Report

Spectral Approaches to Nearest Neighbor Search arXiv:1408.0751 Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi Kannan Les Houches, January 2015 Nearest Neighbor Search (NNS) Preprocess: a set of points in ℝ Query: given a query point , report a point ∗ ∈ with the smallest distance to ∗ Motivation Generic setup: Application areas: Points model objects (e.g. images) Distance models (dis)similarity measure machine learning: k-NN rule signal processing, vector quantization, bioinformatics, etc… Distance can be: Hamming, Euclidean, edit distance, earth-mover distance, … 000000 011100 010100 000100 010100 011111 000000 001100 000100 000100 110100 111111 ∗ Curse of Dimensionality All exact algorithms degrade rapidly with the dimension Algorithm Query time Space Full indexing (log ⋅ ) No indexing – linear scan ( ⋅ ) () (Voronoi diagram size) ( ⋅ ) Approximate NNS Given a query point , report ′ ∈ s.t. ∗ ′ − ≤ min − ∗ ≥ 1 : approximation factor randomized: return such ′ with probability ≥ 90% Heuristic perspective: gives a set of candidates (hopefully small) ∗ ′ NNS algorithms It’s all about space partitions ! Low-dimensional [Arya-Mount’93], [Clarkson’94], [Arya-MountNetanyahu-Silverman-We’98], [Kleinberg’97], [HarPeled’02],[Arya-Fonseca-Mount’11],… High-dimensional [Indyk-Motwani’98], [Kushilevitz-OstrovskyRabani’98], [Indyk’98, ‘01], [Gionis-IndykMotwani’99], [Charikar’02], [Datar-ImmorlicaIndyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [AndoniIndyk’06], [Andoni-Indyk-NguyenRazenshteyn’14], [Andoni-Razenshteyn’15] 6 Low-dimensional kd-trees,… =1+ runtime: −() ⋅ log 7 High-dimensional Locality-Sensitive Hashing Crucial use of random projections Johnson-Lindenstrauss Lemma: project to random subspace of dimension ( −2 log ) for 1 + approximation Runtime: 1/ for -approximation 8 Practice Data-aware partitions optimize the partition to your dataset PCA-tree [Sproull’91, McNames’01, Verma-Kpotufe-Dasgupta’09] randomized kd-trees [SilpaAnan-Hartley’08, Muja-Lowe’09] spectral/PCA/semantic/WTA hashing [Weiss-Torralba-Fergus’08, Wang-Kumar-Chang’09, Salakhutdinov-Hinton’09, Yagnik-Strelow-RossLin’11] 9 Practice vs Theory Data-aware projections often outperform (vanilla) random-projection methods But no guarantees (correctness or performance) JL generally optimal [Alon’03, Jayram-Woodruff’11] Even for some NNS setups! [Andoni-Indyk-Patrascu’06] Why do data-aware projections outperform random projections ? Algorithmic framework to study phenomenon? 10 Plan for the rest Model Two spectral algorithms Conclusion 11 Our model “low-dimensional signal + large noise” inside high dimensional space Signal: ⊂ for subspace ⊂ ℝ of dimension ≪ Data: each point is perturbed by a full-dimensional Gaussian noise (0, 2 ) 12 Model properties Data = + Query = + s.t.: points in P have at least unit norm || − ∗ || ≤ 1 for “nearest neighbor” ∗ || − || ≥ 1 + for everybody else Noise entries (0, 2 ) 1 up to factor poly( −1 log ) ≈ Claim: exact nearest neighbor is still the same 1/4 Noise is large: 13 has magnitude ≈ 1/4 ≫ 1 top dimensions of capture sub-constant mass JL would not work: after noise, gap very close to 1 Algorithms via PCA Find the “signal subspace” ? Use Principal Component Analysis (PCA)? then can project everything to and solve NNS there ≈ extract top direction(s) from SVD e.g., -dimensional space that minimizes ∈ 2 (, ) If PCA removes noise “perfectly”, we are done: 14 = Can reduce to -dimensional NNS NNS performance as if we are in dimensions for full model? Best we can hope for dataset contains a “worst-case” -dimensional instance Reduction from dimension to Spoiler: Yes 15 PCA under noise fails Does PCA find “signal subspace” under noise ? No 2 (, ) ∈ PCA minimizes good only on “average”, not “worst-case” weak signal directions overpowered by noise directions typical noise direction contributes =1 2 2 = Θ( 2 ) ∗ 16 1st Algorithm: intuition Extract “well-captured points” points with signal mostly inside top PCA space should work for large fraction of points Iterate on the rest ∗ 17 Iterative PCA • • • • Find top PCA subspace =points well-captured by Build NNS d.s. on { projected onto } Iterate on the remaining points, ∖ Query: query each NNS d.s. separately To make this work: Nearly no noise in : ensuring close to Capture only points whose signal fully in 18 determined by heavy-enough spectral directions (dimension may be less than ) well-captured: distance to explained by noise only • • • • Simpler model Assume: small noise = + , well-captured if , ≤ 2 Claim 1: if ∗ captured by , will find it in NNS Query: query each NNS separately can be even adversarial Algorithm: where || || ≪ Find top-k PCA subspace =points well-captured by Build NNS on { projected onto } Iterate on remaining points, ∖ for any captured : || − || = || − || ± 4 = || − || ± 5 Claim 2: number of iterations is (log ) 19 ∈ 2 (, ) ≤ ∈ 2 , ≤ ⋅ 2 for at most 1/4-fraction of points, 2 , ≥ 4 2 hence constant fraction captured in each iteration Analysis of general model Need to use randomness of the noise Want to say that “signal” is stronger than “noise” (on average) Use random matrix theory =+ is random × with entries (0, 2 ) has rank ≤ and (Frobenius-norm)2 ≥ 20 All singular values 2 ≤ 2 ≈ / important directions have 2 ≥ Ω(/) can ignore directions with 2 ≪ / Important signal directions stronger than noise! Closeness of subspaces ? Trickier than singular values Top singular vector not stable under perturbation! Only stable if second singular value much smaller How to even define “closeness” of subspaces? To the rescue: Wedin’s sin-theta theorem sin , = max min || − || ∈ ∈ ||=1 21 Wedin’s sin-theta theorem Developed by [Davis-Kahan’70], [Wedin’72] Theorem: Consider = + is top- subspace of is the -space containing Then: sin , ≤ |||| () Another way to see why we need to take directions with sufficiently heavy singular values 22 Additional issue: Conditioning After an iteration, the noise is not random anymore! non-captured points might be “biased” by capturing criterion Fix: estimate top PCA subspace from a small sample of the data Might be purely due to analysis 23 But does not sound like a bad idea in practice either Performance of Iterative PCA Can prove there are In each, we have NNS in ≤ dimensional space Overall query time: Reduced to 24 log iterations 1 ⋅ ⋅ log 3/2 log instances of -dimension NNS! 2nd Algorithm: PCA-tree Closer to algorithms used in practice • • • • Find top PCA direction Partition into slabs ⊥ Snap points to ⊥ hyperplane Recurse on each slice ≈ / 25 Query: • follow all tree paths that may contain ∗ 2 algorithmic modifications Find top PCA direction Partition into slabs ⊥ Snap points to ⊥ hyperplanes Recurse on each slice • • • • Centering: Query: • follow all tree paths that may contain ∗ Need to use centered PCA (subtract average) Otherwise errors from perturbations accumulate Sparsification: Need to sparsify the set of points in each node of the tree Otherwise can get a “dense” cluster: 26 not enough variance in signal lots of noise Analysis An “extreme” version of Iterative PCA Algorithm: just use the top PCA direction: guaranteed to have signal ! Main lemma: the tree depth is ≤ 2 because each discovered direction close to snapping: like orthogonalizing with respect to each one cannot have too many such directions 2 Query runtime: Overall performs like ( ⋅ log )-dimensional NNS! 27 Wrap-up Why do data-aware projections outperform random projections ? Algorithmic framework to study phenomenon? Here: Immediate questions: Model: “low-dimensional signal + large noise” like NNS in low dimensional space via “right” adaptation of PCA Other, less-structured signal/noise models? Algorithms with runtime dependent on spectrum? Broader Q: Analysis that explains empirical success? 28