five-minute talk - Ilya Razenshteyn

Optimal Data-Dependent
Hashing for
Approximate Near Neighbors
Ilya Razenshteyn (CSAIL MIT)
Alexandr Andoni (Simons Institute)
Approximate Near Neighbors (ANN)
• Dataset: n points in d dimensions
• Query: a point within 1 from a data point
• Want: a data point within 2 from the query
• Regime: d ≈ log n
Locality-Sensitive Hashing (LSH)
• Hash family on Rd: points at distance ≤ 1 collide more often (with
probability ≥ p1) than points at distance ≥ 2 (with probability ≤ p2)
• (Indyk, Motwani 1998): LSH can be used to solve ANN with space
O(n1+ρ) and query time O(nρ), where ρ depends on the gap between
p1 and p2
• Question: what is the best ρ?
Bounds on LSH
• (Indyk, Motwani 1998): ρ = 1/2 (defined LSH and first constructions)
• (Andoni, Indyk 2006): ρ = 1/4 (better LSH construction)
• (O’Donnell, Wu, Zhou 2011): ρ = 1/4 is optimal!
• Can one improve upon LSH for ANN?
Beyond LSH
• (Andoni, Indyk, Nguyen, R 2014), (Andoni, R 2015): ρ = 1/7 for ANN
(the best LSH gives only ρ = 1/4)
• The main idea
LSH is data-oblivious: good collision probabilities for every pair from Rd
For ANN one of the points may be assumed to lie in the dataset
Data-dependent hashing: a hash family depends on the points
Achieve improvement for every dataset
• (Andoni, R 2015): ρ = 1/7 is optimal for data-dependent hashing
• See the arXiv preprint (intricate proof, but very simple algorithm!)
Open problems
• Dynamic case
• Instance-optimal data-dependent hashing

similar documents