Slides - University of New South Wales

```NICTA Machine Learning Research Group Seminar
APPROXIMATE NEAREST
NEIGHBOR QUERIES WITH A
TINY INDEX
6/26/14
Wei Wang, University of New South Wales
1
Outline
2



Overview of Our Research
SRS: c-Approximate Nearest Neighbor with a tiny
index [PVLDB 2015]
Conclusions
NICTA Machine Learning Research Group Seminar
6/26/14
Research Projects
3




Similarity Query Processing
Keyword search on (Semi-) Structured Data
Graph
Succinct Data Structures
NICTA Machine Learning Research Group Seminar
6/26/14
NN and c-ANN Queries
4
c-ANN = x
x
D = {a, b, x}
aka. (1+ε)-ANN
NN = a

Definitions
A set of points, D = ∪i=1n {Oi}, in d-dimensional Euclidean
space (d is large, e.g., hundreds)
 Given a query point q, find the closest point, O*, in D
 Relaxed version:

Return a c-ANN point: i.e., its distance to q is at most c*Dist(O*, q)
 May return a c-ANN point with at least constant probability
NICTA Machine Learning Research Group Seminar 6/26/14

Applications and Challenges
5

Applications
 Feature
vectors: Data Mining, Multimedia DB
 Fundamental geometric problem: “post-office problem”
 Quantization in coding/compression
…

Challenges
 Curse
of Dimensionality / Concentration of Measure
 Hard
 Large
to find algorithms sub-linear in n and polynomial in d
data size: 1KB for a single point with 256 dims
NICTA Machine Learning Research Group Seminar
6/26/14
Existing Solutions
6

NN:
Linear scan is (practically) the best approach using linear space & time
O(d5log(n)) query time, O(n2d+δ) space
 O(dn1-ε(d)) query time, O(dn) space
 Linear scan: O(dn/B) I/Os, O(dn) space


(1+ε)-ANN
LSH is the best approach using sub-quadratic space
O(log(n) + 1/ε(d-1)/2) query time, O(n*log(1/ε)) space
 Probabilistic test  remove exponential dependency on d

Fast JLT: O(d*log(d) + ε-3log2(n)) query time, O(nmax(2, ε^-2)) space
 LSH-based: Õ(dnρ+o(1)) query time, Õ(n1+ρ+o(1) + nd) space


ρ= 1/(1+ε) + oc(1)
NICTA Machine Learning Research Group Seminar
6/26/14
Approximate NN for Multimedia
Retrieval
7





Cover-tree
Spill-tree
Reduce to NN search with Hamming distance
Dimensionality reduction (e.g., PCA)
Quantization-based approaches (e.g., CK-Means)
NICTA Machine Learning Research Group Seminar
6/26/14
LSH is the best approach using sub-quadratic space
Locality Sensitive Hashing (LSH)
8

Equality search
Index: store o into bucket h(o)
 Query: retrieve every o in the bucket
h(q), verify if o = q


LSH

∀h∈LSH-family, Pr[ h(q) = h(o) ] ∝
1/Dist(q, o)

h :: Rd  Z
 technically,
dependent on r
 “Near-by” points (blue) have more
chance of colliding with q than “faraway” points (red)
NICTA Machine Learning Research Group Seminar
6/26/14
LSH: Indexing & Query Processing
9

Index

For a fixed r
sig(o) = ⟨h1(o), h2(o), …, hk(o)⟩
 store o into bucket sig(o)



Reduce Query Cost
Iteratively increase r
Query

Const
Succ.
Prob.
Search with a fixed r

Retrieve and “verify” points in
the bucket sig(q)


Repeat this L times (boosting)
Galloping search to find the first
Incurs additional cost +
good r
2
only c quality guarantee
NICTA Machine Learning Research Group Seminar
6/26/14
Locality Sensitive Hashing (LSH)
10

Standard LSH
 c2-ANN

 binary search on R(ci, ci+1)-NN problems
LSH on external memory
O((dn/B)0.5) query,
O((dn/B)1.5) space
 LSB-forest [SIGMOD’09, TODS’10]:
different reduction from c2-ANN to a R(ci, ci+1)-NN
problem
O(n*log(n)/B) query,
A
 C2LSH [SIGMOD’12]:
O(n*log(n)/B) space
 Do
not use composite hash keys
 Perform fine-granular counting number of collisions in m LSH
projections
SRS
O(n/B) query,
(Ours) O(n/B) space
NICTA Machine Learning Research Group Seminar
6/26/14
Weakness of Ext-Memory LSH Methods
11

Existing methods uses super-linear space
 Thousands
(or more) of hash tables needed if rigorous
 People resorts to hashing into binary code (and using
Hamming distance) for multimedia retrieval

Can only handle c, where c = x2, for integer x ≥ 2
 To


enable reusing the hash table (merging buckets)
Valuable information lost (due to quantization)
Update? (changes to n, and c)
Dataset Size
Audio, 40MB
LSB-forest
1500 MB
C2LSH+
127 MB
NICTA Machine Learning Research Group Seminar
SRS (Ours)
2 MB
6/26/14
SRS: Our Proposed Method
12

Solving c-ANN queries with O(n) query time and O(n)
space with constant probability
Constants hidden in O() is very small
 Early-termination condition is provably effective


Small index
 Rich-functionality
 Simple


Central idea:
c-ANN query in d dims  kNN query in m-dims with
filtering
 Model the distribution of m “stable random projections”

NICTA Machine Learning Research Group Seminar
6/26/14
2-stable Random Projection
13



Let D be the 2-stable random projection =
standard Normal distribution N(0, 1)
For two i.i.d. random variables A ~ D, B ~ D, then
x*A + y*B ~ (x2+y2)1/2 * D
Illustration
V
r1
V
r
2 Seminar
NICTA Machine Learning Research Group
6/26/14
Dist(O) and ProjDist(O) and
Their Relationship
Proj(O)
14
O
O
Proj(Q)
r1
Q
Q
O in d dims
m 2-stable random
projections
r2



z1≔⟨V, r1⟩ ~ N(0, ǁvǁ)
z2≔⟨V, r2⟩ ~ N(0, ǁvǁ)
 z12+z22 ~ ǁvǁ2 *χ2m


Dist(O)
(z1, … zm) in m dims
ProjDist(O)
i.e., scaled Chi-squared distribution of m degrees of freedom
Ψm(x): cdf of the standardχ2m distribution
NICTA Machine Learning Research Group Seminar
6/26/14
LSH-like Property
15

Intuitive idea:
If Dist(O1) ≪ Dist(O2) then ProjDist(O1) < ProjDist(O2) with
high probability
 But the inverse is NOT true

NN object in the projected space is most likely not the NN object
in the original space with few projections, as
 Many far-away objects projected before the NN/cNN objects
 But we can bound the expected number of such cases! (say T)


Solution
Perform incremental k-NN search on the projected space till
accessing T objects
 + Early termination test

NICTA Machine Learning Research Group Seminar
6/26/14
Indexing
16

Finding the minimum m

Input



Output






n, c,
T ≔ max # of points to access by the algorithm
m : # of 2-stable random projections
T’ ≤ T: a better bound on T
m = O(n/T). We use T = O(n), so m = O(1) to achieve linear space
index
Generate m 2-stable random projections  n projected
points in a m-dimensional space
Index these projections using any index that supports
incremental kNN search, e.g., R-tree
Space cost: O(m * n) = O(n)
NICTA Machine Learning Research Group Seminar
6/26/14
SRS-αβ(T, c, pτ)
17
Early-termination test:
c 2 * ProjDist2 (ok ) 
m 
 p
2
 Dist (omin ) 
Compute proj(Q)
 Do incremental kNN search
from proj(Q)

// stopping condition α
for k = 1 to T

 Compute
Dist(Ok)
 Maintain Omin = argmin1≤i≤k Dist(Oi)
 If early-termination test (c, pτ) = TRUE
// stopping condition β
 BREAK

Return Omin
c = 4, d = 256, m = 6, T = 0.00242n, B = 1024, pτ=0.18
Index = 0.0059n, Query = 0.0084n, succ prob = 0.13
Main Theorem:
NICTA
Machine
Research p
Group
Seminar
SRS-αβ returns a c-NN
point
withLearning
probability
with 6/26/14
O(n) I/O cost
τ-f(m,c)
Variations of SRS-αβ(T, c, pτ)
18
Compute proj(Q)
 Do incremental kNN search from proj(Q)
for k = 1 to T
// stopping condition α
Compute Dist(Ok)
Maintain Omin = argmin1≤i≤k Dist(Oi)
If early-termination test (c, pτ) = TRUE
// stopping condition β






1.
2.
3.
BREAK
Return Omin
SRS-α
SRS-β
SRS-αβ(T, c’, pτ)
Better quality; query cost is O(T)
Best quality; query cost bounded by O(n); handles c = 1
Better quality; query cost bounded by O(T)
All
with
success
probability
least p6/26/14
NICTA
Machine
Learning
Research GroupatSeminar
τ
Other Results
19



Can be easily extended to support top-k c-ANN
queries (k > 1)
No previous known guarantee on the correctness of
returned results
We guarantee the correctness with probability at
least pτ, if SRS-αβ stops due to early-termination
condition
 ≈100%
in practice (97% in theory)
NICTA Machine Learning Research Group Seminar
6/26/14
Analysis
20
NICTA Machine Learning Research Group Seminar
6/26/14
Stopping Condition α
21



“near” point: the NN point  its distance ≕ r
“far” points: points whose distance > c * r
Then for any κ > 0 and any o:
Pr[ProjDist(o)≤κ*r | o is a near point] ≥ ψm(κ2)
 Pr[ProjDist(o)≤κ*r | o is a far point] ≤ ψm(κ2/c2)





Both because ProjDist2(o)/Dist2(o) ~ χ2m
Pr[the NN point projected beforeκ*r] ≥ ψm(κ2)
Pr[# of bad points projected before κ*r < T]
> (1 - ψm(κ2/c2)) * (n/T)
Choose κ such that P1 + P2 – 1 > 0

Feasible due to good concentration bound for χ2m
NICTA Machine Learning Research Group Seminar
6/26/14
P1
P2
Choosing κ
22




κ*r
NICTA Machine Learning Research Group Seminar
Let c = 4
Mode = m – 2
Blue: 4
Red: 4*(c2) = 64
6/26/14
ProjDist(OT): Case I
Consider cases where both
conditions hold (re. near and far
points)  P1 + P2 – 1 probability
23
ProjDist(o) in m-dims
ProjDist(OT)
NICTA Machine Learning Research Group Seminar
Omin = the NN point
6/26/14
ProjDist(OT): Case II
Consider cases where both
conditions hold (re. near and far
points)  P1 + P2 – 1 probability
24
ProjDist(o) in m-dims
ProjDist(OT)
Omin = a cNN point
NICTA Machine Learning Research Group Seminar
6/26/14
Early-termination Condition (β)
25

Omit the proof here
 Also
relies on the fact that the squared sum of m
projected distances follows a scaled χ2m distribution

Key to
 Handle
the case where c = 1
 Returns
the NN point with guaranteed probability
 Impossible to handle by LSH-based methods
 Guarantees
the correctness of top-k cANN points
returned when stopped by this condition
 No
such guarantee by any previous method
NICTA Machine Learning Research Group Seminar
6/26/14
Experiment Setup
26

Algorithms
 LSB-forest
[SIGMOD’09, TODS’10]
 C2LSH [SIGMOD’12]
 SRS-* [VLDB’15]


Data
Measures
 Index
size, query cost, result quality, success probability
NICTA Machine Learning Research Group Seminar
6/26/14
Datasets
27
5.6PB
369GB
16GB
NICTA Machine Learning Research Group Seminar
6/26/14
Tiny Image Dataset (8M pts, 384 dims)
28



Fastest: SRS-αβ,
Slowest: C2LSH
 Quality the other
way around
SRS-αhas comparable
quality with C2LSH yet
has much lower cost.
SRS-* dominates LSBforest
NICTA Machine Learning Research Group Seminar
6/26/14
Approximate Nearest Neighbor
29



Empirically better than
the theoretic guarantee
With 15% I/Os of
linear scan, returns NN
with probability 71%
With 62% I/Os of
linear scan, returns NN
with probability 99.7%
NICTA Machine Learning Research Group Seminar
6/26/14
Large Dataset (0.45 Billion)
30
NICTA Machine Learning Research Group Seminar
6/26/14
Summary
31



c-ANN queries in arbitrarily high dim space  kNN
query in low dim space
Our index size is approximately d/m of the size of
the data file
Opens up a new direction in c-ANN queries in highdimensional space
 Find
efficient solution to kNN problem in 6-10
dimensional space
NICTA Machine Learning Research Group Seminar
6/26/14
Q&A
32
Similarity Query Processing Project Homepage:
http://www.cse.unsw.edu.au/~weiw/project/simjoin.html
NICTA Machine Learning Research Group Seminar
6/26/14
```