### SIFT

2011.4.14
Reporter: Fei-Fei Chen
 Wide-baseline matching
 Object recognition
 Texture recognition
 Scene classification
 Robot wandering
 Motion tracking
 Change in illumination
 3D camera viewpoint
 etc.
…
> 5000
images
change in viewing angle
22 correct matches
…
> 5000
images
change in viewing angle
+ scale change
 Find corresponding features across two or more views.
 Elements to be matched are image patches of fixed size
 Task: Find the best (most similar) patch in a second image.
 Intuition: This would be a good match for matching,
since it is very distinctive.
 Intuition: This would be a BAD patch for matching,
since it is not very distinctive.
 Intuitively, junctions of contours.
 Generally more stable features over change of viewpoint.
 Intuitively, large variations in the neighborhood of the point
in all directions.
 They are good features to match!
 Detection of Scale-Space Extrema
 Accuracy Keypoint localization
 Orientation assignment
 Keypoint descriptor
detector
descriptor
 For scale invariance, search for stable features across
all possible scales using a continuous function of scale,
scale space.
 SIFT uses DoG filter for scale space because it is
efficient and as stable as scale-normalized Laplacian of
Gaussian.
Convolution with a variable-scale Gaussian
Difference-of-Gaussian (DoG) filter
Convolution with the DoG filter
 doubles for
the next octave
K=2(1/s)
Dividing into octave is for efficiency only.
X is selected if it is larger or smaller than all 26 neighbors
 Reject (1) points with low contrast (flat)
(2) poorly localized along an edge (edge)
 Fit a 3D quadratic function for sub-pixel maxima
6
6
1
3
5
f ( x)  6  2 x 
6
x  6  2x  3x
2
2
f '( x)  2  6 x  0
1
-1
xˆ 
1
3
2
1
1
f ( xˆ )  6  2   3     6
3
3
3
1
0
1
3
+1
2
 Taylor series of several variables
 Two variables
2
2
2
 f
f  1   f 2
 f
 f 2
f ( x , y )  f ( 0 , 0 )  
x
y   
x 2
xy 
y 
y  2  xx
xy
yy
 x

 x
 0  
f    f  
 y 
 0 
 
 
f  x   f 0  
f
x
 f

 x
T
x
f   x  1
    x
y   y  2
1
2
 f
2
x
T
x
2
x
 2 f

xx
y  2
 f
 xy

2
 f 

xy  x 
 
2
 f  y
 y  y 
 Taylor expansion in a matrix form, x is a vector, f maps
x to a scalar
Hessian matrix
(often symmetric)
 f

  x1
 f
 x
 1
 
 f
 x
 n










 2 f

2

x

1
 2 f

  x 2  x1


 2 f

  x n  x1
 f
2
 x1  x 2
 f
2

2
x

2
 f
2
2
xn x2


 x1  x n 
2
 f 

x2xn 


2
 f 

2
xn 
 f



2
2
f
f
1   f
 f



2
2

x x 2 x
x

T
2


f

f
x 

x
2

x x

 x is a 3-vector
 Remove sample point if offset is larger than 0.5
 Throw out low contrast (<0.03)
Hessian matrix at keypoint location
Let
Keep the points with
r=10
 By assigning a consistent orientation, the keypoint
descriptor can be orientation invariant.
 For a keypoint, L is the Gaussian-smoothed image with
the closest scale,
(Lx, Ly)
m
θ
orientation histogram (36 bins)
σ=1.5*scale of the keypoint
accurate peak position
is determined by fitting
36-bin orientation histogram over 360°,
weighted by m and 1.5*scale falloff
Peak is the orientation
Local peak within 80% creates multiple orientations
About 15% has multiple orientations and they
contribute a lot to stability
0
2
• Thresholded image gradients are sampled over 16x16 array of locations in
scale space
• Create array of orientation histograms (w.r.t. key orientation)
• 8 orientations x 4x4 histogram array = 128 dimensions
• Normalized for intensity variance, clip values larger than 0.2, renormalize
σ=0.5*width
 Detection of Scale-Space Extrema
 Accuracy Keypoint localization
 Orientation assignment
 Keypoint descriptor
For scale invariance
Remove unstable feature points
For rotation invariance
For illumination invariance
 Image scale invariance.
 Image rotation invariance.
 Robust matching across a substantial range of
(1) affine distortion,
(2) change in 3D viewpoint,