### Bag of feature models - Discriminative

```790-133
Recognizing People, Objects, & Actions
Tamara Berg
Object Recognition – BoF models
1
Topic Presentations
• Hopefully you have met your topic presentations group
members?
• Group 1 – see me to run through slides this week or
Monday at the latest (I’m traveling Thurs/Friday). Send
group go to groups.google.com and search for 790-133
(sorted by date). Use this to post/answer questions
related to the class.
2
Bag-of-features models
Object
Bag of
‘features’
3
source: Svetlana Lazebnik
Exchangeability
• De Finetti Theorem of exchangeability (bag of
words theorem): the joint probability
distribution underlying the data is invariant to
permutation.
p(x1, x 2 ,..., x N ) =
ò
æ N
ö
p(q )çÕ p(x i | q )÷dq
è i=1
ø
4
Origin 2: Bag-of-words models
• Orderless document representation: frequencies of words from a
dictionary Salton & McGill (1983)
US Presidential Speeches Tag Cloud
http://chir.ag/phernalia/preztags/
5
source: Svetlana Lazebnik
Bag of words for text
 Represent documents as a “bags of words”
6
Example
• Doc1 = “the quick brown fox jumped”
• Doc2 = “brown quick jumped fox the”
Would a bag of words model represent these
two documents differently?
7
Bag of words for images
 Represent images as a “bag of features”
8
Bag of features: outline
1. Extract features
9
source: Svetlana Lazebnik
Bag of features: outline
1. Extract features
2. Learn “visual vocabulary”
10
source: Svetlana Lazebnik
Bag of features: outline
1. Extract features
2. Learn “visual vocabulary”
3. Represent images by frequencies of
“visual words”
11
source: Svetlana Lazebnik
2. Learning the visual vocabulary
…
Clustering
12
Slide credit: Josef Sivic
2. Learning the visual vocabulary
Visual vocabulary
…
Clustering
13
Slide credit: Josef Sivic
K-means clustering (reminder)
•
Want to minimize sum of squared Euclidean
distances between points xi and their nearest
cluster centers mk
D(X ,M ) 


( xi  m k )
2
cluster k point i in
cluster k
Algorithm:
• Randomly initialize K cluster centers
• Iterate until convergence:
• Assign each data point to the nearest center
• Recompute each cluster center as the mean of all points assigned
to it
14
source: Svetlana Lazebnik
Example visual vocabulary
15
Fei-Fei et al. 2005
Image Representation
• For a query image
Extract features
Visual vocabulary
x
x
x
x
x
Associate each
feature with the
nearest cluster
center (visual word)
x
x
x
x
x
Accumulate visual
word frequencies
over the image
frequency
3. Image representation
…..
codewords
17
source: Svetlana Lazebnik
frequency
4. Image classification
CAR
…..
codewords
Given the bag-of-features representations of images from different
classes, how do we learn a model for distinguishing them?
18
source: Svetlana Lazebnik
Image Categorization
What is this?
Choose from many categories
helicopter
Image Categorization
SVM/NB
Csurka et al (Caltech 4/7)
What is this?
Choose from many categories
Nearest Neighbor
Berg et al (Caltech 101)
Kernel + SVM
Grauman et al (Caltech 101)
Multiple Kernel Learning + SVMs
Varma et al (Caltech 101)
…
Visual Categorization with Bags of Keypoints
Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, Cédric Bray
21
Data
• Images in 7 classes: faces, buildings, trees,
cars, phones, bikes, books
• Caltech 4 dataset: faces, airplanes, cars (rear
and side), motorbikes, background
22
Method
Steps:
– Detect and describe image patches.
– Assign patch descriptors to a set of predetermined
clusters (a visual vocabulary).
– Construct a bag of keypoints, which counts the
number of patches assigned to each cluster.
– Apply a classifier (SVM or Naïve Bayes), treating
the bag of keypoints as the feature vector
– Determine which category or categories to assign
to the image.
23
Bag-of-Keypoints Approach
Interesting Point
Detection
Key Patch
Extraction
Feature
Descriptors
Bag of Keypoints
Multi-class
Classifier
  0 .1 


  0 .5 
.



.



.


  1 .5 


Slide credit: Yun-hsueh Liu
24
SIFT Descriptors
Interesting Point
Detection
Key Patch
Extraction
Feature
Descriptors
Bag of Keypoints
Multi-class
Classifier
  0 .1 


  0 .5 
.



.



.


  1 .5 


Slide credit: Yun-hsueh Liu
25
Bag of Keypoints (1)
Interesting Point
Detection
Key Patch
Extraction
Feature
Descriptors
Bag of Keypoints
Multi-class
Classifier
• Construction of a vocabulary
– Kmeans clustering  find “centroids”
(on all the descriptors we find from all the training images)
– Define a “vocabulary” as a set of “centroids”, where every centroid
represents a “word”.
Slide credit: Yun-hsueh Liu
26
Bag of Keypoints (2)
Interesting Point
Detection
Key Patch
Extraction
Feature
Descriptors
Bag of Keypoints
Multi-class
Classifier
• Histogram
– Counts the number of occurrences of different visual words in each image
Slide credit: Yun-hsueh Liu
27
Multi-class Classifier
Interesting Point
Detection
Key Patch
Extraction
Feature
Descriptors
Bag of Keypoints
Multi-class
Classifier
• In this paper, classification is based on conventional machine learning
approaches
– Support Vector Machine (SVM)
– Naïve Bayes
Slide credit: Yun-hsueh Liu
28
SVM
29
Reminder: Linear SVM
x2
Margin
g (x)  w x  b
T
x+
x+
m inim ize
1
w
2
2
s.t.
x-
yi ( w x i  b )  1
T
Support Vectors
x1
Slide 30 of 113
Slide credit: Jinwei Gu
Nonlinear SVMs: The Kernel Trick

With this mapping, our discriminant function becomes:
g (x)  w  (x)  b 
T

 i ( x i )  ( x )  b
T
iSV

No need to know this mapping explicitly, because we only use the
dot product of feature vectors in both the training and test.

A kernel function is defined as a function that corresponds to a dot
product of two feature vectors in some expanded feature space:
K (xi , x j )   (xi )  (x j )
T
31
Slide credit: Jinwei Gu
Nonlinear SVMs: The Kernel Trick

Examples of commonly-used kernel functions:
K (xi , x j )  xi x j
T

Linear kernel:

Polynomial kernel:

Gaussian (Radial-Basis Function (RBF) ) kernel:
K ( x i , x j )  (1  x i x j )
K ( x i , x j )  exp( 

T
xi  x j
2
2
p
2
)
Sigmoid:
K ( x i , x j )  tan h (  0 x i x j   1 )
T
32
Slide credit: Jinwei Gu
SVM for image classification
• Train k binary 1-vs-all SVMs (one per class)
• For a test instance, evaluate with each
classifier
• Assign the instance to the class with the
largest SVM output
34
Naïve Bayes
35
Naïve Bayes Model
C
C – Class
F - Features
F1
F2
Fn
P(C,F1,F2,...Fn ) = P(C)Õ P(Fi | C)
i
We only specify (parameters):
P(C) prior over class labels
P(Fi | C) how each feature depends on the class
36
Example:
37
Slide from Dan Klein
38
Slide from Dan Klein
39
Slide from Dan Klein
Percentage of
documents in
training set labeled
as spam/ham
40
Slide from Dan Klein
In the documents labeled
as spam, occurrence
percentage of each word
(e.g. # times “the”
occurred/# total words).
41
Slide from Dan Klein
In the documents labeled
as ham, occurrence
percentage of each word
(e.g. # times “the”
occurred/# total words).
42
Slide from Dan Klein
Classification
The class that maximizes:
P(C,W1,...W n ) = P(C)Õ P(W i | C)
i
= argmax P(c)Õ P(W i | c)
c ÎC
i
43
Classification
• In practice
44
Classification
• In practice
– Multiplying lots of small probabilities can result in
floating point underflow
45
Classification
• In practice
– Multiplying lots of small probabilities can result in
floating point underflow
– Since log(xy) = log(x) + log(y), we can sum log
46
Classification
• In practice
– Multiplying lots of small probabilities can result in
floating point underflow
– Since log(xy) = log(x) + log(y), we can sum log
– Since log is a monotonic function, the class with
the highest score does not change.
47
Classification
• In practice
– Multiplying lots of small probabilities can result in
floating point underflow
– Since log(xy) = log(x) + log(y), we can sum log
– Since log is a monotonic function, the class with
the highest score does not change.
– So, what we usually compute in practice is:
c map
é
ù
= arg max log P(c)+å log P(Wi |c)
êë
úû
i
c ÎC
48
Naïve Bayes on images
49
Naïve Bayes
C
C – Class
F - Features
F1
F2
Fn
P(C,F1,F2,...Fn ) = P(C)Õ P(Fi | C)
i
We only specify (parameters):
P(C) prior over class labels
P(Fi | C) how each feature depends on the class
50
Naive Bayes Parameters
Problem: Categorize images as one of k object
classes using Naïve Bayes classifier:
– Classes: object categories (face, car, bicycle, etc)
– Features – Images represented as a histogram of
visual words. Fi are visual words.
P(C) treated as uniform.
P(Fi | C) learned from training data – images labeled
with category. Probability of a visual word given
an image category.
51
Multi-class classifier –
Naïve Bayes (1)
•
Let V = {vi}, i = 1,…,N, be a visual vocabulary, in which each vi represents a visual
word (cluster centers) from the feature space.
•
A set of labeled images I = {Ii } .
•
Denote Cj to represent our Classes, where j = 1,..,M
•
N(t,i) = number of times vi occurs in image Ii
•
Compute P(Cj|Ii):
Slide credit: Yun-hsueh Liu
52
Multi-class Classifier –
Naïve Bayes (2)
•
Goal - Find maximum probability class Cj:
•
In order to avoid zero probability, use Laplace smoothing:
Slide credit: Yun-hsueh Liu
53
Results
Results
55
Results
56
Results
Results on Dataset 2
57
Results
58
Results
59
Results
60
Thoughts?
• Pros?
• Cons?
Related BoF models
pLSA, LDA, …
62
pLSA
document
topic
word
63
pLSA
64
pLSA on images
67
Discovering objects and their location
in images
Josef Sivic, Bryan C. Russell, Alexei A. Efros, Andrew Zisserman,
William T. Freeman
Documents – Images
Words – visual words (vector quantized SIFT descriptors)
Topics – object categories
Images are modeled as a mixture of topics (objects).
68
Goals
They investigate three areas:
– (i) topic discovery, where categories are
discovered by pLSA clustering on all available
images.
– (ii) classification of unseen images, where topics
corresponding to object categories are learnt on
one set of images, and then used to determine
the object categories present in another set.
– (iii) object detection, where you want to
determine the location and approximate
segmentation of object(s) in each image.
69
(i) Topic Discovery
Most likely words for 4 learnt topics (face, motorbike, airplane, car)
70
(ii) Image Classification
Confusion table for unseen test images against pLSA trained on
images containing four object categories, but no background
images.
71
(ii) Image Classification
Confusion table for unseen test images against pLSA trained on
images containing four object categories, and background images.
Performance is not quite as good.
72
(iii) Topic Segmentation
P(zk | wi ,d j ) > 0.8
73
(iii) Topic Segmentation
74
(iii) Topic Segmentation
75
```