Adaboost_20110510

```AdaBoost
1
Classifier
• Simplest classifier
2
3
Adaboost: Agenda
• (Adaptive Boosting, R. Scharpire, Y. Freund,
ICML, 1996):
• Supervised classifier
• Assembling classifiers
– Combine many low-accuracy classifiers (weak
learners) to create a high-accuracy classifier (strong
learners)
4
Example 1
5
Adaboost: Example (1/10)
6
Adaboost: Example (2/10)
7
Adaboost: Example (3/10)
8
Adaboost: Example (4/10)
9
Adaboost: Example (5/10)
10
Adaboost: Example (6/10)
11
Adaboost: Example (7/10)
12
Adaboost: Example (8/10)
13
Adaboost: Example (9/10)
14
Adaboost: Example (10/10)
15
16
Adaboost
• Strong classifier = linear combination of T
weak classifiers
(1) Design of weak classifier ht (x)
(2) Weight for each classifier (Hypothesis
weight)  t
(3) Update weight for each data (example
distribution)
• Weak Classifier: < 50% error over any distribution
17
Adaboost: Terminology (1/2)
18
Adaboost: Terminology (2/2)
19
Adaboost: Framework
20
Adaboost: Framework
21
22
23
Adaboost
• Strong classifier = linear combination of T
weak classifiers
(1) Design of weak classifier ht (x)
(2) Weight for each classifier (Hypothesis
weight)  t
(3) Update weight for each data (example
distribution)
• Weak Classifier: < 50% error over any distribution
24
Adaboost: Design of weak classifier (1/2)
25
Adaboost: Design of weak classifier (2/2)
• Select a weak classifier with the smallest weighted
error
m
ht  arg min  j  i 1 Dt (i )[ yi  h j ( xi )]
h j H
1
• Prerequisite:  t 
2
26
Adaboost
• Strong classifier = linear combination of T
weak classifiers
(1) Design of weak classifier ht (x)
(2) Weight for each classifier (Hypothesis
weight)  t
(3) Update weight for each data (example
distribution)
• Weak Classifier: < 50% error over any distribution
27
Adaboost: Hypothesis weight (1/2)
• How to set t ?
1 , if yi  H(xi )
i  0, else

1 , if yi f(xi )  0
1
 
0, else
N i 
1
1 exp( yi f(xi ))
  exp( yi f(xi ))  DT 1 (i) 
N
N i
 Zt
1
training error( H ) 
N
  DT 1 (i ) Z t
i
  Zt
i
i
i
28
Adaboost: Hypothesis weight (2/2)
1 1 t
 t  ln(
)
2
t
29
Adaboost
• Strong classifier = linear combination of T
weak classifiers
(1) Design of weak classifier ht (x)
(2) Weight for each classifier (Hypothesis
weight)  t
(3) Update weight for each data (example
distribution)
• Weak Classifier: < 50% error over any distribution
30
Adaboost: Update example distribution
(Reweighting)
y * h(x) = 1
y * h(x) = -1
31
Reweighting
In this way, AdaBoost “focused on” the
informative or “difficult” examples.
32
Reweighting
In this way, AdaBoost “focused on” the informative or
“difficult” examples.
33
34
Summary
t=1
1 1 t
 t  ln(
)
2
t
35
Example 2
36
Example (1/5)
Original Training set : Equal Weights to all training samples
Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire
Example (2/5)
ROUND 1
Example (3/5)
ROUND 2
Example (4/5)
ROUND 3
Example (5/5)
Example 3
42
1
2
1 t
 t  ln(
t
)
43
1
2
1 t
 t  ln(
t
)
44
1
2
1 t
 t  ln(
t
)
45
1
2
1 t
 t  ln(
t
)
46
1
2
1 t
 t  ln(
t
)
47
1
2
1 t
 t  ln(
t
)
48
49
Example 4
50
Adaboost:
51
52
53
54
Application
55
Discussion
56
Discrete Adaboost (DiscreteAB)
(Friedman’s wording)
57
Discrete Adaboost (DiscreteAB)
(Freund and Schapire’s wording)
58
Adaboost with Confidence Weighted
Predictions (RealAB)
59
Adaboost Variants Proposed By Friedman
• LogitBoost
60
Adaboost Variants Proposed By Friedman
• GentleBoost
61
Reference
62
63
Robust Real-time Object Detection
Key word : Features extraction,
Integral Image , AdaBoost , Cascade
64
Outline
1. Introduction
2. Features
2.1 Features Extraction
2.2 Integral Image
3. AdaBoost
3.1 Training Process
3.2 Testing Process
4. The Attentional Cascade
5. Experimental Results
6. Conclusion
7. Reference
65
1. Introduction
This paper brings together new algorithms and
insights to construct a framework for robust and
extremely rapid object detection.
Frontal face detection system achieves :
1. High detection rates
2. Low false positive rates
Three main contributions:
1. Integral image
2. AdaBoost : Selecting a small number of important features.
3. Cascaded structure
66
2. Features
 Based on the simple features value.
 Reason :
1. Knowledge-based system is difficult to learn using a finite
quantity of training data.
2. Much faster than Image-based system.
ps. Feature-based: Use extraction features like eye, nose pattern.
Knowledge-based: Use rules of facial feature.
Image-based: Use face segments and predefined face pattern.
[3] A.S.S.Mohamed, Ying Weng, S. S Ipson, and Jianmin Jiang, ”Face Detection based on Skin Color in Image by Neural
67
Networks”, ICIAS 2007, pp. 779-783, 2007.
2.1 Feature Extraction (1/2)
 Filter:
Ex: haar-like filter
Filter type.
 Feature:
a. pattern 的座標位置.
EX: eye , nose
b. pattern 的大小
 Feature value:
Feature value = Filter  Feature
EX:
convolution
68
2.1 Feature Extraction (2/2)
• Haar-like filter: The sum of the pixels which lie within the
white rectangles are subtracted from the sum of pixels in
the grey rectangles.
Figure 1
Filter C
Feature

24
Filter type
＋ －＋
feature value
69
24
2.2 Integral Image (1/6)
 Integral image
1. Rectangle features
2. Computed very rapidly
 II(x , y) : sum of the pixels above and to the left of (x , y).
ii( x, y) 

i( x' , y ' )
x'  x , y '  y
70
2.2 Integral Image (2/6)
Known:
A: Sum of the pixels within rectangle A.
B: Sum of the pixels within rectangle B.
C: Sum of the pixels within rectangle C.
D: Sum of the pixels within rectangle D.
Location 1 value is A. Location 2 value is A+B.
Location 3 value is A+C.
Location 4 value is A+B+C+D.
Equation: 1 = A
3=A+C
2 = A + B.
4=A+B+C+D
Figure 2: Integral image
Q : The sum of the pixels within rectangle D = ?
A : 4  A B C  D  D  4 A B C
 D  4  ( A  B)  C  D  4  2  C
 D  4  2  (3  A)  D  4  2  (3  1)  4  1  (2  3)
The sum within D can be computed as 4 + 171- (2 + 3).
2.2 Integral Image (3/6)
Sum of the pixels
72
2.2 Integral Image (4/6)
 Using the following pair of recurrences to get integral image
s( x, y )  s( x, y  1)  i( x, y )
ii( x, y )  ii( x  1, y )  s ( x, y )
ii ( x, y )
i( x, y)
is the integral image.
(1)
(2)
ii(1, y)  0
is the original image.
s( x, y ) is the cumulative row sum.
ps.
ii ( x, y ):
s( x, 1)  0
ii( x, y)    i( x' , y ' )
x'  x y '  y
73
2.2 Integral Image (5/6)
i( x, y)
original image
s ( x, y )
ii ( x, y )
cumulative row sum
1
3
4
7
2
4
74
6
10
6 13 18
8
s( x, y )  s( x, y  1)  i( x, y )
ii( x, y )  ii( x  1, y )  s( x, y )
integral image
(1)
(2)
2.2 Integral Image (6/6)
i( x, y)
s ( x, y )
original image
cumulative row sum
(3*3)
s( x, y ) = s( x, y  1) +
s(0, 0) = s(0, 1) +
s(1,0) = s (1, 1) +
s(2, 0) = s(2, 1) +
s(0,1) = s(0, 0)
+
s (1,1) = s(1,0)
+
ii ( x, y )
i( x, y)
i(0,0)
i (1, 0)
i(2, 0)
i (0,1)
i (1,1)
integral image
(3*3)
=0+1=1
=0+1=1
=0+4=4
=1+2=3
=1+2=3
(3*3)
ii ( x, y ) = ii ( x  1, y ) + s ( x, y )
ii(0, 0) = ii(1,0) + s(0, 0) =0+1=1
ii(1, 0)
ii (2, 0)
ii(0,1)
ii(1,1)
= ii (0, 0)
+ s(1,0) =1+1=2
= ii (1, 0)
+ s (2, 0) =2+4=6
= ii(1,1)
+ s(0,1) =0+3=3
= ii (0,1)
+ s (1,1) =3+3=6
+ i(2, 2) =7+1=8
…
…
s(2, 2) = s(2,1)
ii (2, 2) = ii(1, 2)
+ s (2, 2) =10+8=18
75
3. AdaBoost (1/2)
 AdaBoost (Adaptive Boosting) is a machine learning algorithm.
 AdaBoost works by choosing and combining weak classifiers
together to form a more accurate strong classifier !
– Weak classifier:
Feature value
positive , if filter(X )<
h( X )  
negative , otherwise
Image set
76
Threshold
3. AdaBoost (2/2)
 Subsequent classifiers built are tweaked in favor of those
instances misclassified by previous classifiers. [4]
 The goal is to minimize the number of features that need to
be computed when given a new image, while still achieving
high identification rates.
77
[4] AdaBoost - Wikipedia, the free encyclopedia , http://en.wikipedia.org/wiki/AdaBoost
3.1 Training Process - Flowchart
1. Input:
Training image set X
Face
(24x24)
l張
Non-Face
(24x24)
m張
2. Feature Extraction:
Using haar-like filter
feature

extract 出 N 個 feature value

Candidate threshold θ
3. AdaBoost Algorithm:
3.0. Image weight initialization
1
, for postive

 2l
wi  
, i  1,..., n
 1 , for negative

 2m
3.1. Normalize image weight
wt ,i 
wt ,i
n
 wt , j
, i  1,..., n
j 1
3.2. Error calculation
n
 i , j   wt ,k hi , j  xk   yk , i  1,...n, j  1,...N
k 1
3.3. Select a weak classifier ht with the lowest error εt
3.4. Image weight adjusting
4. Output:
A strong classifier
(24x24)

 wt ,i t , if xi is classified correctly
wt 1,i  

 wt ,i , otherwise
T
HT ( x)    t ht ( x)
t 1
Weak classifier
Weak classifier weight
T個
weak classifiers
3.2 Training Process - Input
1. Input:
Training data set 以 X 表示 . 設有 l 張 positive image , m 張 negative
image , 共 n (n=l+m) 張 image.
1 , postive/face
X  ( x1 , y1 ), ( x2 , y2 ),..., ( xn , yn ) , yi  
0 , negative/non-face
{
…
1
1
0
1
0
}
1

f j ( xi ) 表示 image

79
xi
3.2 Training Process - Feature Extraction (1/2)
2. Feature Extraction: Using haar-like filter
Haar-like filter :
n 張 image
…
…

…
…
…
…
…
f: feature number
convolution
Candidate feature value
(n*N 個)
Ps. 1張 image 可 extract 出 N 個 feature value
∴N=4*f
80
3.2 Training Process - Feature Extraction (2/2)
Define weak classifiers
h: i , j
Ex : 3 face & 1 non-face image
extract by 5th feature
Face
1 , if pi , j f j ( xk )  pi , ji , j
hi , j ( xk )  
0 , otherwise
Non-face
i , j
: 即 image i 的第 j 個 local feature value
f j ( xk ) : 即 image k 的第 j 個 local feature value
Polarity :
1 , if  i , j  0.5
pi , j  
1 , otherwise
θ1,5 θ2,5 θ3,5 θ4,5
h1,5 h2,5 h3,5 h4,5
ε1,5 ε2,5 ε3,5 ε4,5
81
3.2 Training Process - AdaBoost Algorithm (1/4)
3. AdaBoost Algorithm:
3-0. Image weight initialization :
1
, for postive/face image

 2l
wi  
, i  1,..., n
1

, for negative/non-face image

 2m
l is the quantity of positive images.
m is the quantity of negative images.
82
3.2 Training Process - AdaBoost Algorithm (2/4)
Iterative: t = 1, … ,T
T : weak classifier number
3-1. Normalize image weight:
wt ,i
wt ,i  n
, i  1,..., n
 wt , j
Training data set X
j 1
3-2. Error calculation :
Candidate weak classifier
n
error rate
 i , j   wt ,k hi , j  xk   yk , i  1,...n, j  1,...N
k 1
3-3. Select a weak classifier
positive or negative
with
ht the lowest error rate .  t
3-4. Image weight adjusting :
wt 1,i


 wt ,i t , if xi is classified correctly

where  t = t
1- t

 wt ,i , otherwise
83
3.2 Training Process – Output (1/2)
T
1 T

positive/face , t ht ( x)  t
HT ( x)  
2 t 1
t 1
negative/non-face , otherwise

1
threshold
t  log
t
Weak classifier weight
84
3.2 Training Process – Output (2/2)
Weak classifier weight (  t  1   t )
t
ε
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
150
250 α
200
Fig. B
Weak classifier weight ( t
1 t
 log(
)
t
ε

~ 0.1 區間內與其他如 0.1 ~
0.5 區間內即使ε有相同的變

classifier weight)變化量差異也

log 是為了縮小 weight 彼此

)
ε
0.6
0.5
0.001
0.4
0.3
0.2
0.1
0
0
Fig. C
0.5
1
1.5
2
2.5
3
3.5
4
4.5
α
1 

999
0.005
199
0.101
8.9
0.105
8.52
1 
log(
)

2.99
800
0.7
2.29
0.94
0.38
0.93
0.01
AdaBoost Algorithm –
Image Weight Adjusting Example
If
1,3
1,3
0.167


 0.2
1  1,3 1  0.167
t =1 時
W1
W2
W3
W4
W5
W6

0.167
0.167
0.167
0.167
0.167
0.167

O
O
O
O
X
O
0.167
0.167*0.2
0.5
0.1
Update wt 1 0.167*0.2 0.167*0.2 0.167*0.2 0.167*0.2
Normalize
0.1
0.1
0.1
0.1
Weight 變化
wt 1,i
 wt ,i t , if xi is classified correctly

 wt ,i , otherwise

Normalize 後，分錯的 image的 weight 會相對提高，

3.3 Testing Process - Flowchart
Test Image
1. Extract Sub-windows
Downsampling
…
(360*420)
(24*28)
(228*336)
(360*420)
About 100000
sub-windows
(24*24)
…

2. Strong Classifier (24*24)Detection
Load T weak classifiers
Strong Classifier
…
h1
h2
h3
Sub-window

hT
T
1 T

positive
,

h
(
x
)



t
t t
H T ( x)  
2 t 1
t 1
negative , otherwise

Accept windows
Result Image
…
3. Merge Result
average
coordinate
…
For all
sub-windows
Reject windows
…
4. The Attentional Cascade (1/5)
 Advantage: Reducing testing computation time.
 Method: Cascade stages.
 Idea: Reject as many negatives as possible at the earliest stage. More complex
classifiers were then used in later stages.
 The detection process is that of a degenerate decision tree, so called “cascade”.
88
Stage
True positive
False positive
True negative
False negative
Figure 4 : Cascade Structure
4. The Attentional Cascade (2/4)
θ
Stage 1
θ
Stage 2
θ
89
Stage 3
4. The Attentional Cascade (3/4)
 True positive rates (detection rates):

True Positive
True Positive  False Negative
 False positive rates (FP):

False Positive
False Positive  True Negative
 True negative rates:

True Negative
False Positive  True Negative
 False negative rates (FN):

FP
False Negative
True Positive  False Negative
90
+
FN
=> Error Rate
4. The Attentional Cascade (4/4)
 Training a cascade of classifiers:
 Involves two types of tradeoffs :
1. Higher detection rates
2. Lower false positive rates
 More features will achieve higher detection rates and
lower false positive rates. But classifiers require more time
to compute.
 Define an optimization framework:
1. the number of stages
2. the number of features in each stage
3. the strong classifier threshold in each stage
91
4. The Attentional Cascade - Algorithm (1/3)
4. The Attentional Cascade - Algorithm (2/3)
 f : Maximum acceptable false positive rate. (最大 negative 辨識成 positive 錯誤百分比)
 d : Minimum acceptable detection rate. (最小辨識出 positive 的百分比)
 Ft arg et: Target overall false positive rate. (最後可容許的 false positive rate)
 Initial value:
P : Total positive images
N : Total negative images
f = 0.5
d = 0.9999
Ft arg et  106
F0  1.0

D0  1.0

Threshold = 0.5
Threshold_EPS =
i=0
AdaBoost threshold
104
Threshold adjust weight
The number of cascade stage
93
4. The Attentional Cascade - Algorithm (3/3)
 Iterative:
While( Fi  Ft arg et )
{
i=i+1
f : Maximum acceptable false positive rate
d : Minimum acceptable detection rate
Ft arg et : Target overall false positive rate
Add Stage
ni  0, Fi  Fi 1
P : Total positive images
N : Total negative images
i : The number of cascade stage
While( Fi  f * Fi )1
Fi : False positive rate at ith stage
{
ni : The number of features at ith stage
Di : Detection rate at ith stage
Add Feature
ni  ni  1
( Fi , D)i = AdaBoost(P,N, ) ni
Get New Di , Fi
While( Di  d * Di )1
{
Threshold = Threshold – Threshold_EPS
Di = Re-computer current strong classifier
detection rate with Threshold (this also affects
)
}
Threshold , 則Di ,Fi
}
If( Fi  Ft arg et)
N = false detections with current cascaded detector on the N
}
N = Fi *N
Fi
5. Experimental Results (1/3)
 Face training set:
 Extracted from the world wide web.
 Use face and non-face training images.
 Consisted of 4916 hand labeled faces.
 Scaled and aligned to base resolution of 24 by 24 pixels.
 The non-face sub-windows come from 9544 images which were
manually inspected and found to not contain any faces.
Fig. 5: Example of frontal upright face
images used for training
95
5. Experimental Results (2/3)
 In the cascade training:
 Use 4916 training faces.
 Use 10,000 non-face sub-windows.
 Use the AdaBoost training procedure.
 Evaluated on the MIT+CMU test set:
 An average of 10 features out of a stage are evaluated per sub-window.
 This is possible because a large majority of sub-windows are rejected by
the first or second stage in the cascade.
 On a 700 Mhz Pentium III processor, the face detector can process a 384
by 288 pixel image in about .067 seconds .
96
5. Experimental Results (3/3)
θ
Fig. 6: Create the ROC curve (receiver operating characteristic) the threshold of
the final stage classifier is adjusted from
.
 to 
97
Reference
[1] P. Viola and M. Jones, “Rapid Object Detection Using A Boosted Cascade of
Simple Features”, Proc. IEEE Conf. Computer Vision and Pattern
Recognition, vol.1, pp. 511-518, 2001
[2] P. Viola and M. Jones, “Robust Real-time Object Detection”, IEEE
International Journal of Computer Vision, vol.57, no.2, pp.137-154, 2001.
[3] A.S.S.Mohamed, Ying Weng, S. S Ipson, and Jianmin Jiang, ”Face Detection
based on Skin Color in Image by Neural Networks”, ICIAS 2007, pp. 779783, 2007.
[4] AdaBoost - Wikipedia, the free encyclopedia ,
http://en.wikipedia.org/wiki/AdaBoost
98
```