Report

Optimal Data-Dependent Hashing for Approximate Near Neighbors Ilya Razenshteyn (CSAIL MIT) Alexandr Andoni (Simons Institute) Approximate Near Neighbors (ANN) • Dataset: n points in d dimensions • Query: a point within 1 from a data point • Want: a data point within 2 from the query • Regime: d ≈ log n Locality-Sensitive Hashing (LSH) • Hash family on Rd: points at distance ≤ 1 collide more often (with probability ≥ p1) than points at distance ≥ 2 (with probability ≤ p2) • (Indyk, Motwani 1998): LSH can be used to solve ANN with space O(n1+ρ) and query time O(nρ), where ρ depends on the gap between p1 and p2 • Question: what is the best ρ? Bounds on LSH • (Indyk, Motwani 1998): ρ = 1/2 (defined LSH and first constructions) • (Andoni, Indyk 2006): ρ = 1/4 (better LSH construction) • (O’Donnell, Wu, Zhou 2011): ρ = 1/4 is optimal! • Can one improve upon LSH for ANN? Beyond LSH • (Andoni, Indyk, Nguyen, R 2014), (Andoni, R 2015): ρ = 1/7 for ANN (the best LSH gives only ρ = 1/4) • The main idea • • • • LSH is data-oblivious: good collision probabilities for every pair from Rd For ANN one of the points may be assumed to lie in the dataset Data-dependent hashing: a hash family depends on the points Achieve improvement for every dataset • (Andoni, R 2015): ρ = 1/7 is optimal for data-dependent hashing • See the arXiv preprint (intricate proof, but very simple algorithm!) Open problems • Dynamic case • Instance-optimal data-dependent hashing