### K-D Trees and Hashing

```Nearest Neighbor Model
k-d tree
Locality sensitive hashing
Nearest Neighbor Model
k-d tree
Locality sensitive hashing
NN(k, xq)
“Baseball”
“Congress” “Police” “House”
“Gate”
“Money”
. . .
Article1
0
0
2
7
9
12
. . .
Article2
0
0
41
6
7
2
. . .
Article3
2
1
0
0
4
9
. . .
Article4
0
6
0
0
0
14
. . .
Article5
0
1
0
1
4
12
. . .
Article6
8
2
0
2
1
5
. . .
…
…
…
…
…
…
…
. . .
Nearest Neighbor Model
k-d tree
Locality sensitive hashing
Dimensions
• Works
well in low-dimensional
spaces
• “Curse of dimensionality”
• As the dimension increases,
the space become large very
fast, available data become
sparse
Nearest Neighbor Model
k-d tree
Locality sensitive hashing
Lookup
• A Balanced binary tree
• Over data with k dimensions
•Building the tree:
•Given a set of examples, the root node is chosen to
be median of those examples along the ith
dimension.
•Half of the examples goes to left branch and the
other half goes to the left branch of the tree.
• Recursively repeated until there fewer than two
examples left.
• How to choose which dimension to use?
• j mod k for jth level of the tree (any dimension can
be used for split several times)
• Split on the dimension which has the widest
Nearest Neighbor Model
k-d tree
Locality sensitive hashing
Lookup
First the whole cell is split
by Red
Subcells created from
Red are split further by
Green
All four resulting subcells
are further split by blue
Nearest Neighbor Model
k-d tree
Locality sensitive hashing
Building
• Exact lookup
• Same as the lookup from binary tree.
• Must tend to which dimension divides the
current level of the tree
• Nearest neighbor
• What if the queried data is very close to he
dividing boundary?
• Compute distance to the dividing
boundary.
• Check both right and left branches of
the dividing boundary if too close.
Nearest Neighbor Model
k-d tree
Locality sensitive hashing
Nearest Neighbor Model
k-d tree
Locality sensitive hashing
LSH
1D projection of 2D points
2D projection of 3D points
LSH relies on the idea:
•If two points are close together on n-dimensional space, they
projections to lower-dimensional space would be closer too.
Nearest Neighbor Model
k-d tree
Locality sensitive hashing
Projection
1
2
…
g1(x)
1
2
…
g2(x)
Hash buckets
•Create multiple (l) random projections
to create multiple hash tables g1(x), …
gl(x)
• Given a query point xq:
• Fetch set of points from each bin
gk(xq) for all k
• Union these sets to come up with
the candidate points
• Compute actual distance from xq
to each point in C, obtain nearest
ones.
• With high probability, the real nearest
points showed up in one of the bins,
mixed with some far away points
(ignored).
Nearest Neighbor Model
k-d tree
Locality sensitive hashing
13 million images
512 dimensions
LSH
A few
thousand
candidates
```