Slides

Report
Spectral Approaches to
Nearest Neighbor Search
arXiv:1408.0751
Robert Krauthgamer (Weizmann Institute)
Joint with: Amirali Abdullah, Alexandr Andoni,
Ravi Kannan
Les Houches, January 2015
Nearest Neighbor Search (NNS)

Preprocess: a set  of  points in ℝ

Query: given a query point , report a
point ∗ ∈  with the smallest distance
to 
∗

Motivation

Generic setup:



Application areas:



Points model objects (e.g. images)
Distance models (dis)similarity measure
machine learning: k-NN rule
signal processing, vector quantization,
bioinformatics, etc…
Distance can be:

Hamming, Euclidean,
edit distance, earth-mover distance, …
000000
011100
010100
000100
010100
011111
000000
001100
000100
000100
110100
111111
∗

Curse of Dimensionality

All exact algorithms degrade rapidly with the
dimension 
Algorithm
Query time
Space
Full indexing
(log  ⋅ )
No indexing –
linear scan
( ⋅ )
() (Voronoi diagram size)
( ⋅ )
Approximate NNS

Given a query point , report ′ ∈  s.t.
∗
′ −  ≤  min

−
∗




 ≥ 1 : approximation factor
randomized: return such ′ with probability
≥ 90%
Heuristic perspective: gives a set of
candidates (hopefully small)
∗

′
NNS algorithms
It’s all about space partitions !
Low-dimensional
[Arya-Mount’93], [Clarkson’94], [Arya-MountNetanyahu-Silverman-We’98], [Kleinberg’97],
[HarPeled’02],[Arya-Fonseca-Mount’11],…

High-dimensional
[Indyk-Motwani’98], [Kushilevitz-OstrovskyRabani’98], [Indyk’98, ‘01], [Gionis-IndykMotwani’99], [Charikar’02], [Datar-ImmorlicaIndyk-Mirrokni’04], [Chakrabarti-Regev’04],
[Panigrahy’06], [Ailon-Chazelle’06], [AndoniIndyk’06], [Andoni-Indyk-NguyenRazenshteyn’14], [Andoni-Razenshteyn’15]

6
Low-dimensional
kd-trees,…
 =1+
runtime:  −() ⋅ log 



7
High-dimensional
Locality-Sensitive Hashing
Crucial use of random projections



Johnson-Lindenstrauss Lemma: project to random subspace of
dimension ( −2 log ) for 1 +  approximation
Runtime: 1/ for -approximation

8
Practice
Data-aware partitions





optimize the partition to your dataset
PCA-tree [Sproull’91, McNames’01, Verma-Kpotufe-Dasgupta’09]
randomized kd-trees [SilpaAnan-Hartley’08, Muja-Lowe’09]
spectral/PCA/semantic/WTA hashing [Weiss-Torralba-Fergus’08,
Wang-Kumar-Chang’09, Salakhutdinov-Hinton’09, Yagnik-Strelow-RossLin’11]
9
Practice vs Theory

Data-aware projections often outperform (vanilla)
random-projection methods
But no guarantees (correctness or performance)

JL generally optimal [Alon’03, Jayram-Woodruff’11]


Even for some NNS setups! [Andoni-Indyk-Patrascu’06]
Why do data-aware projections outperform random projections ?
Algorithmic framework to study phenomenon?
10
Plan for the rest



Model
Two spectral algorithms
Conclusion
11
Our model

“low-dimensional signal + large noise”



inside high dimensional space
Signal:  ⊂  for subspace  ⊂ ℝ of dimension  ≪ 
Data: each point is perturbed by a full-dimensional
Gaussian noise  (0,  2  )

12
Model properties

Data  =  + 


Query  =  +  s.t.:




points in P have at least unit norm
|| − ∗ || ≤ 1 for “nearest neighbor” ∗
|| − || ≥ 1 +  for everybody else
Noise entries (0,  2 )
1
up to factor poly( −1  log )

≈

Claim: exact nearest neighbor is still the same
 1/4
Noise is large:



13
has magnitude   ≈ 1/4 ≫ 1
top  dimensions of  capture sub-constant mass
JL would not work: after noise, gap very close to 1
Algorithms via PCA

Find the “signal subspace”  ?


Use Principal Component Analysis (PCA)?



then can project everything to  and solve NNS there
≈ extract top direction(s) from SVD
e.g., -dimensional space  that minimizes
∈ 
2 (, )
If PCA removes noise “perfectly”, we are done:


14
=
Can reduce to -dimensional NNS
NNS performance as if we are in  dimensions for full model?

Best we can hope for

dataset contains a “worst-case” -dimensional instance

Reduction from dimension  to 

Spoiler: Yes
15
PCA under noise fails


Does PCA find “signal subspace”  under noise ?
No 
2 (, )

∈

PCA minimizes

good only on “average”, not “worst-case”
weak signal directions overpowered by noise directions
typical noise direction contributes =1 2  2 = Θ( 2 )


∗
16
1st Algorithm: intuition

Extract “well-captured points”



points with signal mostly inside top PCA space
should work for large fraction of points
Iterate on the rest
∗
17
Iterative PCA
•
•
•
•
Find top PCA subspace 
=points well-captured by 
Build NNS d.s. on { projected onto }
Iterate on the remaining points,  ∖ 
Query: query each NNS d.s. separately

To make this work:

Nearly no noise in : ensuring  close to 


Capture only points whose signal fully in 

18
 determined by heavy-enough spectral directions (dimension may be
less than )
well-captured: distance to  explained by noise only
•
•
•
•
Simpler model

Assume: small noise

 =  +  ,



well-captured if  ,  ≤ 2
Claim 1: if ∗ captured by , will find it in NNS



Query: query each NNS separately
can be even adversarial
Algorithm:


where || || ≪ 
Find top-k PCA subspace 
=points well-captured by 
Build NNS on { projected onto }
Iterate on remaining points,  ∖ 
for any captured :
|| −  || = ||  − || ± 4 = || − || ± 5
Claim 2: number of iterations is (log )



19
∈ 
2
(, ) ≤
∈ 
2
,  ≤  ⋅  2
for at most 1/4-fraction of points,  2 ,  ≥ 4 2
hence constant fraction captured in each iteration
Analysis of general model



Need to use randomness of the noise
Want to say that “signal” is stronger than “noise” (on
average)
Use random matrix theory


 =+
 is random  ×  with entries (0,  2 )


 has rank ≤  and (Frobenius-norm)2 ≥ 



20
All singular values 2 ≤  2  ≈ / 
important directions have 2 ≥ Ω(/)
can ignore directions with 2 ≪ /
Important signal directions stronger than noise!
Closeness of subspaces ?

Trickier than singular values


Top singular vector not stable under perturbation!
Only stable if second singular value much smaller

How to even define “closeness” of subspaces?

To the rescue: Wedin’s sin-theta theorem

sin  ,  = max min || − ||
∈ ∈
||=1
21



Wedin’s sin-theta theorem

Developed by [Davis-Kahan’70], [Wedin’72]

Theorem:

Consider  =  + 
 is top- subspace of 
 is the -space containing 

Then: sin  ,  ≤




||||
 ()
Another way to see why we need to take directions with
sufficiently heavy singular values
22
Additional issue: Conditioning

After an iteration, the noise is not random anymore!

non-captured points might be “biased” by capturing criterion

Fix: estimate top PCA subspace from a small sample of
the data

Might be purely due to analysis

23
But does not sound like a bad idea in practice either
Performance of Iterative PCA

Can prove there are 

In each, we have NNS in ≤  dimensional space

Overall query time: 

Reduced to 
24
 log  iterations
1
 
⋅  ⋅ log 3/2 
 log  instances of -dimension NNS!
2nd Algorithm: PCA-tree
Closer to algorithms used in practice

•
•
•
•
Find top PCA direction 
Partition into slabs ⊥ 
Snap points to ⊥ hyperplane
Recurse on each slice
≈ /
25
Query:
• follow all tree paths that
may contain ∗
2 algorithmic modifications
Find top PCA direction 
Partition into slabs ⊥ 
Snap points to ⊥ hyperplanes
Recurse on each slice
•
•
•
•

Centering:



Query:
• follow all tree paths that
may contain ∗
Need to use centered PCA (subtract average)
Otherwise errors from perturbations accumulate
Sparsification:


Need to sparsify the set of points in each node of the tree
Otherwise can get a “dense” cluster:


26
not enough variance in signal
lots of noise
Analysis

An “extreme” version of Iterative PCA Algorithm:


just use the top PCA direction: guaranteed to have signal !
Main lemma: the tree depth is ≤ 2



because each discovered direction close to 
snapping: like orthogonalizing with respect to each one
cannot have too many such directions
 2


Query runtime: 

Overall performs like ( ⋅ log )-dimensional NNS!
27
Wrap-up
Why do data-aware projections outperform random projections ?
Algorithmic framework to study phenomenon?

Here:




Immediate questions:



Model: “low-dimensional signal + large noise”
like NNS in low dimensional space
via “right” adaptation of PCA
Other, less-structured signal/noise models?
Algorithms with runtime dependent on spectrum?
Broader Q: Analysis that explains empirical success?
28

similar documents