Report

Random Grids Presented by Yonatan Glassner A work of : Dror Aiger, Efi Kokiopoulou & Ehud Rivlin 1 What’s for today? • • • • • 2 Problem & Motivation Previous work Current algorithm Results Discussion The NN problem Given a set P of points in R d : • Nearest neighbor – For any query q, returns a point p P minimizing pq p p p 3 q p The NN problem Given a set P of points in R d : • Nearest neighbor – For any query q, returns a point p P minimizing pq p p p 4 q p Motivation 1 – Image similarity 5 Motivation 1 – Image similarity Description p p p q p p Description 6 Motivation 2 – Suggestion algorithms x R d x R R number d 7 Large of dimensions – d & examples N d What’s for today? • • • • • 8 Problem & Motivation Previous work Current algorithm Results Discussion KNN performance key domains 9 Short history lesson KNN computation 10 Exact algorithms 1960’s 1975 Bentley • K nearest neighbors classification • K-D trees search What is the complexity? In practice: • kd-trees work “well” in “low-medium” dimensions • Near-linear query time for high dimensions 11 What can we do? (see next slide) Problem formulation q r cr 12 Problem formulation q 13 Approximate algorithms 1998 Indyk &Motwani 1998 Arya et al. 2006 David Nister and Henrik Stewenius 2009 Marius Muja, David G. Lowe 14 • Towards Removing the Curse of Dimensionality - LSH • ANN – BBD trees • Scalable Recognition with a Vocabulary Tree – K-means • FLANN Complexity summary (partial) Pre processing Exhaustive search LSH Vocabulary tree ANN 15 On query Time Space Time ( ∙ ) ( ∙ ) ( ∙ ) + + ( ∙ + ) ( ∙ ) ( ∙ ()) ( ∙ + ) ( ∙ ) ( ∙ ) ( ∙ ∙ ) () (() ∙ ) Complexity summary (partial) Pre processing Exhaustive search LSH Vocabulary tree ANN 16 On query Time Space Time ( ∙ ) ( ∙ ) ( ∙ ) + + ( ∙ + ) ( ∙ ) ( ∙ + ( ∙ ) And back to present… ( ∙ ()) () ) ( ∙ ) ( ∙ ∙ ) (() ∙ ) What’s for today? • • • • • 17 Problem & Motivation Previous work Current algorithm Results Discussion Motivation revisited • We want to avoid exponential dependency on dimension • On query, we want to avoid dependency on dataset size • Our solution: –Random Grids 18 Theorem (the only one in the PPT) • If p and q are two points at distance at most 1 in d-dimensional Euclidean space – and we impose a randomly rotated and shifted grid of cells size w on this space, then the probability of capturing both p and q in the same cell is at least 19 − for sufficiently large w. Intuition – see next slide q 20 p q 21 p q 22 p Basic algorithm structure Pre processing • Set • Create copies of points P, randomly rotated and shifted • Index points using hash table On query(q): • Rotate q times, search by hash tables. From all points found – check randomly K points and return the nearest neighbor. 23 Performance ( ∙ ∙ ) ( ∙ ∙ ) ( ∙ ) 24 Practical algorithm For specific dataset – set desired Build data structure 25 Learn w, m, k to build data structure Upon query Map-Reduce method What’s for today? • • • • • 26 Problem & Motivation Previous work Current algorithm Results Discussion Experimental settings • Data: 1M SURF descriptors (dim=64), extracted from 4000 images. • Fair comparison – auto tuning to get best results, set fixed target precision for all algorithms • Metrics – Runtime is computed over multiple queries – Accuracy = See next slide 27 Accuracy metric RRS NN p p p p p q R q p p p p p 28 p Accuracy= Accuracy= Results - runtime 29 Index dataset = Query set. Precision = 0.98 RRS Results - runtime 30 Index dataset = Query set. Precision = 0.98 RRS Results - accuracy 31 Index dataset = Query set. Precision = 0.98 RRS Results - accuracy 32 Index dataset = Query set. Precision = 0.98, Radius=0.08 RRS Results - runtime 33 r=0.3 is fixed to approximate 1 NN. Probability of report success = 0.9. NN Results - runtime 34 r=0.3 is fixed to approximate 1 NN. Probability of report success = 0.9. NN Results - accuracy 35 r=0.3 is fixed to approximate 1 NN. Probability of report success = 0.9. NN Results - accuracy 36 r=0.3 is fixed to approximate 1 NN. Probability of report success = 0.9. NN What’s for today? • • • • • 37 Problem & Motivation Previous work Current algorithm Results Discussion Discussion • Pros: – Very good runtime results, w\o harming accuracy – Intuitive to parallel – Good fitting to data • Cons – Graph explanations are missing – c dependency is missing on the – Cons… 38 Questions? 39 תודה על ההקשבה 40 Backup 41 Locality-Sensitive Hashing Scheme Based on p-Stable Distributions 42 Tree methods in high-dimensions 43 44 Performance Pre processing space Pre processing time Query time 45 46 47 People • • • • 48 Andoni – Microsoft Indyk - MIT Nister – Microsoft Motwani – Google related Google images database size 49 Table comparison 50