版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
Algorithms
for
Nearest
NeighborSearchPiotr
IndykMITNearest
Neighbor
SearchGiven:
a
set
P
of
n
points
in
Rd
Goal:
a
data
structure,
which
given
a
quepoint
q,
finds
the
nearest
neighbor
p
ofin
PpqOutline
of
this
talkVariantsMotivationMain
memory
algorithms:quadtreeskd-treesLocality
Sensitive
HashingSecondary
storage
algorithms:R-tree
(and
its
variants)VA-fileVariants
of
nearest
neighbor
Near
neighbor
(range
search):
find
one/alpoints
in
P
within
distance
r
from
q
Spatial
join:
given
two
sets
P,Q,
find
allpairs
p
in
P,
q
in
Q,
such
that
p
is
withindistance
r
from
q
Approximate
near
neighbor:
find
one/allpoints
p’
in
P,
whose
distance
to
q
is
atmost
(1+e)
times
the
distance
from
q
to
itsnearest
neighborMotivationDepends
on
the
value
of
d:low
d:
graphics,
vision,
GIS,
etchigh
d:similarity
search
in
databases
(text,
imagesfinding
pairs
of
similar
objects
(e.g.,
copyrviolation
detection)useful
subroutine
for
clusteringAlgorithmsMain
memory
(Computational
Geometry)linear
scantree-based:quadtreekd-treehashing-based:
Locality-Sensitive
HashingSecondary
storage
(Databases)R-tree
(and
numerous
variants)Vector
Approximation
File
(VA-file)QuadtreeSimplest
spatial
structure
on
Earth
!Quadtree
ctd.Split
the
space
into
2d
equal
subsquaresRepeat
until
done:only
one
pixel
leftonly
one
point
leftonly
a
few
points
leftVariants:split
only
one
dimension
at
a
timek-d-trees
(in
a
moment)Range
searchNear
neighbor
(range
search):put
the
root
on
the
stackrepeatpop
the
next
node
T
from
the
stackfor
each
child
C
of
T:if
C
is
a
leaf,
examine
point(s)
in
Cif
C
intersects
with
the
ball
of
radius
r
around
q,
add
Cthe
stackNear
neighbor
ctdNearest
neighborStart
range
search
with
r
=Whenever
a
point
is
found,
update
r
Only
investigate
nodes
with
respect
tocurrent
rQuadtree
ctd.Simple
data
structureVersatile,
easy
to
implementSo
why
doesn’t
this
talk
end
here
?Empty
spaces:
if
the
points
form
sparse
cloudit
takes
a
while
to
reach
themSpace
exponential
in
dimensionTime
exponential
in
dimension,
e.g.,
points
othe
hypercubeSpace
issues:
exampleK-d-trees
[Bentley’75]Main
ideas:only
one-dimensional
splitsinstead
of
splitting
in
the
middle,
choose
thsplit
“carefully”
(many
variations)near(est)
neighbor
queries:
as
for
quadtreesAdvantages:no
(or
less)
empty
spacesonly
linear
spaceExponential
query
time
still
possibleExponential
query
timeWhat
does
it
mean
exactly
?Unless
we
do
something
really
stupid,
query
time
ismost
dnTherefore,
the
actual
query
time
isMin[
dn,
exponential(d)
]
This
is
still
quite
bad
though,
when
the
dimensiois
around
20-30
Unfortunately,
it
seems
inevitable
(both
in
theoand
practice)Approximate
nearest
neighbor
Can
do
it
using
(augmented)
k-d
trees,
byinterrupting
search
earlier
[Arya
et
al’Still
exponential
time
(in
the
worst
caseTry
a
different
approach:for
exact
queries,
we
can
use
binary
searchtrees
or
hashingcan
we
adapt
hashing
to
nearest
neighborsearch
?Locality-Sensitive
Hashing[Indyk-Motwani’98]
Hash
functions
are
locality-sensitive,
ia
random
hash
random
function
h,
for
anypair
of
points
p,q
we
have:Pr[h(p)=h(q)]
is
“high”
if
p
is
“close”
tqPr[h(p)=h(q)]
is
“l(fā)ow”
if
p
is”far”
fromqDo
such
functions
exist
?Consider
the
hypercube,
i.e.,pointsfrom{0,1}dHamming
distance
D(p,q)=
#
positions
onwhich
p
and
q
differDefine
hash
function
h
by
choosing
a
set
Iof
k
random
coordinates,
and
settingh(p)
=projection
of
p
onIExampleTake–
d=10,
p=0101110010–
k=2,
I={2,5}Then
h(p)=11h’s
are
locality-sensitivePr[h(p)=h(q)]=(1-D(p,q)/d)kWe
can
vary
the
probability
by
changing
kk=1k=2distancedistancePrPrHow
can
we
use
LSH
?Choose
several
h1..hlInitialize
a
hash
array
for
each
hiStore
each
point
p
in
the
bucket
hi(p)
of
ti-th
hash
array,
i=1...lIn
order
to
answer
query
qfor
each
i=1..l,
retrieve
points
in
a
bucket
hreturn
the
closest
point
foundWhat
does
this
algorithm
do
?
By
proper
choice
of
parameters
k
and
l,
we
canmake,
for
any
p,
the
probability
thathi(p)=hi(q)
for
some
ilook
like
this:Can
control:Position
of
the
slopeHow
steep
it
isdistanceThe
LSH
algorithm
Therefore,
we
can
solve
(approximately)
the
nearneighbor
problem
with
given
parameter
rWorst-case
analysis
guarantees
dn1/(1+e)
query
ti
Practical
evaluation
indicates
much
better
beha[GIM’99,HGI’00,Buh’00,BT’00]Drawbacks:
works
best
for
Hamming
distance
(although
can
be
generalizeto
Euclidean
space)requires
radius
r
to
be
fixed
in
advanceSecondary
storage
Seek
time
same
as
time
needed
to
transferhundreds
of
KBsGrouping
the
data
is
crucialDifferent
approach
required:in
main
memory,
any
reduction
in
the
numberof
inspected
points
was
goodon
disk,
this
is
not
the
case
!Disk-based
algorithmsR-tree
[Guttman’84]departing
point
for
many
variationsover
600
citations
!
(according
to
CiteSeer)“optimistic”
approach:
try
to
answer
queries
inlogarithmic
timeVector
Approximation
File
[WSB’98]“pessimistic”
approach:
if
we
need
to
scan
the
whdata
set,
we
better
do
it
fastLSH
works
also
on
diskR-tree
“Bottom-up”
approach
(k-d-tree
was“top-down”)
:Start
with
a
set
of
points/rectanglesPartition
the
set
into
groups
of
small
cardinFor
each
group,
find
minimum
rectanglecontaining
objects
from
this
groupRepeatR-tree
ctd.R-tree
ctd.Advantages:Supports
near(est)
neighbor
search
(similarbefore)Works
for
points
and
rectanglesAvoids
empty
spacesMany
variants:
X-tree,
SS-tree,
SR-tree
etcWorks
well
for
low
dimensionsNot
so
great
for
high
dimensionsVA-file
[Weber,
Schek,Blott’98]Approach:In
high-dimensional
spaces,
all
tree-basedindexing
structures
examine
large
fraction
ofleavesIf
we
need
to
visit
so
many
nodes
anyway,
it
isbetter
to
scan
the
whole
data
set
and
avoidperforming
seeks
altogether1
seek
=
transfer
of
few
hundred
KBVA-file
ctd.
Natural
question:
how
to
speed-up
linearscan
?Answer:
use
approximationUse
only
i
bits
per
dimension
(and
speed-up
thscan
by
a
factor
of
32/i)Identify
all
points
which
could
be
returned
aan
answerVerify
the
points
using
original
data
setTime
to
sum
up“Curse
of
dimensionality”
is
indeed
a
curse
In
main
memory,
we
can
perform
sublinear-timesearch
using
trees
or
hashing
In
secondary
storage,
linear
scan
is
p
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 二零二五版建筑臨時(shí)工勞動(dòng)合同示范文本3篇
- 2025年度個(gè)人二手房買賣合同范本示例4篇
- 二零二五版集體合同范本:企業(yè)規(guī)章制度解讀3篇
- 二零二五年度環(huán)保設(shè)施運(yùn)營維護(hù)與管理合同4篇
- 二零二五版旅行社旅游行業(yè)標(biāo)準(zhǔn)化服務(wù)合同4篇
- 2025版外墻清洗作業(yè)施工許可證廢止合同樣本3篇
- 2025年度車庫門智能識(shí)別系統(tǒng)集成合同4篇
- 二零二五版金融機(jī)構(gòu)第三方擔(dān)保業(yè)務(wù)合同樣本3篇
- 2025年度木結(jié)構(gòu)建筑維護(hù)木工承包合同范本4篇
- 銷售業(yè)務(wù)員勞務(wù)合同
- 2025年度房地產(chǎn)權(quán)證辦理委托代理合同典范3篇
- 柴油墊資合同模板
- 湖北省五市州2023-2024學(xué)年高一下學(xué)期期末聯(lián)考數(shù)學(xué)試題
- 城市作戰(zhàn)案例研究報(bào)告
- 【正版授權(quán)】 ISO 12803:1997 EN Representative sampling of plutonium nitrate solutions for determination of plutonium concentration
- 道德經(jīng)全文及注釋
- 2024中考考前地理沖刺卷及答案(含答題卡)
- 多子女贍養(yǎng)老人協(xié)議書范文
- 彩票市場銷售計(jì)劃書
- 骨科抗菌藥物應(yīng)用分析報(bào)告
- 支付行業(yè)反洗錢與反恐怖融資
評(píng)論
0/150
提交評(píng)論