Algorithms for Nearest Neighbor Search-大學(xué)課件-在線

上傳人：1*** IP屬地：湖北上傳時(shí)間：2023-11-30 格式：PPTX 頁數(shù)：35 大?。?9.12KB 積分：6 舉報(bào) 版權(quán)申訴

Algorithms for Nearest Neighbor Search-大學(xué)課件-在線_第2頁

Algorithms for Nearest Neighbor Search-大學(xué)課件-在線_第3頁

Algorithms for Nearest Neighbor Search-大學(xué)課件-在線_第4頁

Algorithms for Nearest Neighbor Search-大學(xué)課件-在線_第5頁

已閱讀5頁，還剩30頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

Algorithms

for

Nearest

NeighborSearchPiotr

IndykMITNearest

Neighbor

SearchGiven:

set

points

Goal:

data

structure,

which

given

quepoint

finds

the

nearest

neighbor

ofin

PpqOutline

this

talkVariantsMotivationMain

memory

algorithms:quadtreeskd-treesLocality

Sensitive

HashingSecondary

storage

algorithms:R-tree

(and

its

variants)VA-fileVariants

nearest

neighbor

Near

neighbor

(range

search):

find

one/alpoints

within

distance

from

Spatial

join:

given

two

sets

P,Q,

find

allpairs

such

that

withindistance

from

Approximate

near

neighbor:

find

one/allpoints

p’

whose

distance

atmost

(1+e)

times

the

distance

from

itsnearest

neighborMotivationDepends

the

value

d:low

graphics,

vision,

GIS,

etchigh

d:similarity

databases

(text,

imagesfinding

pairs

similar

objects

(e.g.,

copyrviolation

detection)useful

subroutine

for

clusteringAlgorithmsMain

memory

(Computational

Geometry)linear

scantree-based:quadtreekd-treehashing-based:

Locality-Sensitive

HashingSecondary

storage

(Databases)R-tree

(and

numerous

variants)Vector

Approximation

File

(VA-file)QuadtreeSimplest

spatial

structure

Earth

!Quadtree

ctd.Split

the

space

into

equal

subsquaresRepeat

until

done:only

one

pixel

leftonly

one

point

leftonly

few

points

leftVariants:split

only

one

dimension

timek-d-trees

(in

moment)Range

searchNear

neighbor

(range

search):put

the

root

the

stackrepeatpop

the

node

from

the

stackfor

each

child

T:if

leaf,

examine

point(s)

Cif

intersects

with

the

ball

radius

around

add

Cthe

stackNear

neighbor

ctdNearest

neighborStart

range

with

=Whenever

point

found,

update

Only

investigate

nodes

with

respect

tocurrent

rQuadtree

ctd.Simple

data

structureVersatile,

easy

implementSo

why

doesn’t

this

talk

end

here

?Empty

spaces:

the

points

form

sparse

cloudit

takes

while

reach

themSpace

exponential

dimensionTime

exponential

dimension,

e.g.,

points

othe

hypercubeSpace

issues:

exampleK-d-trees

[Bentley’75]Main

ideas:only

one-dimensional

splitsinstead

splitting

the

middle,

choose

thsplit

“carefully”

(many

variations)near(est)

neighbor

queries:

for

quadtreesAdvantages:no

(or

less)

empty

spacesonly

linear

spaceExponential

query

time

still

possibleExponential

query

timeWhat

does

mean

exactly

?Unless

something

really

stupid,

query

time

ismost

dnTherefore,

the

actual

query

time

isMin[

dn,

exponential(d)

]

This

still

quite

bad

though,

when

the

dimensiois

around

20-30

Unfortunately,

seems

inevitable

(both

theoand

practice)Approximate

nearest

neighbor

Can

using

(augmented)

k-d

trees,

byinterrupting

earlier

[Arya

al’Still

exponential

time

(in

the

worst

caseTry

different

approach:for

exact

queries,

can

use

binary

searchtrees

hashingcan

adapt

hashing

nearest

neighborsearch

?Locality-Sensitive

Hashing[Indyk-Motwani’98]

Hash

functions

are

locality-sensitive,

random

hash

random

function

for

anypair

points

p,q

have:Pr[h(p)=h(q)]

“high”

“close”

tqPr[h(p)=h(q)]

“l(fā)ow”

is”far”

fromqDo

such

functions

exist

?Consider

the

hypercube,

i.e.,pointsfrom{0,1}dHamming

distance

D(p,q)=

positions

onwhich

and

differDefine

hash

function

choosing

set

Iof

random

coordinates,

and

settingh(p)

=projection

onIExampleTake–

d=10,

p=0101110010–

k=2,

I={2,5}Then

h(p)=11h’s

are

locality-sensitivePr[h(p)=h(q)]=(1-D(p,q)/d)kWe

can

vary

the

probability

changing

kk=1k=2distancedistancePrPrHow

can

use

LSH

?Choose

several

h1..hlInitialize

hash

array

for

each

hiStore

each

point

the

bucket

hi(p)

ti-th

hash

array,

i=1...lIn

order

answer

query

qfor

each

i=1..l,

retrieve

points

bucket

hreturn

the

closest

point

foundWhat

does

this

algorithm

proper

choice

parameters

and

canmake,

for

any

the

probability

thathi(p)=hi(q)

for

some

ilook

this:Can

control:Position

the

slopeHow

steep

isdistanceThe

LSH

algorithm

Therefore,

can

solve

(approximately)

the

nearneighbor

problem

with

given

parameter

rWorst-case

analysis

guarantees

dn1/(1+e)

query

Practical

evaluation

indicates

much

better

beha[GIM’99,HGI’00,Buh’00,BT’00]Drawbacks:

works

best

for

Hamming

distance

(although

can

generalizeto

Euclidean

space)requires

radius

fixed

advanceSecondary

storage

Seek

time

same

time

needed

transferhundreds

KBsGrouping

the

data

crucialDifferent

approach

required:in

main

memory,

any

reduction

the

numberof

inspected

points

was

goodon

disk,

this

not

the

case

!Disk-based

algorithmsR-tree

[Guttman’84]departing

point

for

many

variationsover

600

citations

(according

CiteSeer)“optimistic”

approach:

try

answer

queries

inlogarithmic

timeVector

Approximation

File

[WSB’98]“pessimistic”

approach:

need

scan

the

whdata

set,

better

fastLSH

works

also

diskR-tree

“Bottom-up”

approach

(k-d-tree

was“top-down”)

:Start

with

set

points/rectanglesPartition

the

set

into

groups

small

cardinFor

each

group,

find

minimum

rectanglecontaining

objects

from

this

groupRepeatR-tree

ctd.R-tree

ctd.Advantages:Supports

near(est)

neighbor

(similarbefore)Works

for

points

and

rectanglesAvoids

empty

spacesMany

variants:

X-tree,

SS-tree,

SR-tree

etcWorks

well

for

low

dimensionsNot

great

for

high

dimensionsVA-file

[Weber,

Schek,Blott’98]Approach:In

high-dimensional

spaces,

all

tree-basedindexing

structures

examine

large

fraction

ofleavesIf

need

visit

many

nodes

anyway,

isbetter

scan

the

whole

data

set

and

avoidperforming

seeks

altogether1

seek

transfer

few

hundred

KBVA-file

ctd.

Natural

question:

how

speed-up

linearscan

?Answer:

use

approximationUse

only

bits

per

dimension

(and

speed-up

thscan

factor

32/i)Identify

all

points

which

could

returned

aan

answerVerify

the

points

using

original

data

setTime

sum

up“Curse

dimensionality”

indeed

curse

main

memory,

can

perform

sublinear-timesearch

using

trees

hashing

secondary

storage,

linear

scan

人人文庫> 全部分類> 應(yīng)用文書 > 作業(yè)報(bào)告

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

Algorithms for Nearest Neighbor Search-大學(xué)課件-在線

文檔簡介

溫馨提示

最新文檔

評(píng)論

Algorithms for Nearest Neighbor Search-大學(xué)課件-在線

文檔簡介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔