### slides - Yisong Yue

```An Introduction to Ensemble Methods
Boosting, Bagging, Random Forests and More
Yisong Yue
Supervised Learning
• Goal: learn predictor h(x)
– High accuracy (low error)
– Using training data {(x1,y1),…,(xn,yn)}
h(x) = sign(wTx-b)
é 1 ù
w =ê ú
ë 1 û
b = -10
Person
Age
Male?
Height > 55”
Alice
14
0
1
Bob
10
1
1
Carol
13
0
1
Dave
8
1
0
Erin
11
0
0
Frank
9
1
1
Gena
8
0
0
é
age
ê
x=
êë 1[gender=male]
ù
ú
úû
ìï 1 height > 55"
y=í
ïî 0 height £ 55"
Male?
Yes
1
Person
Age
Male?
Height > 55”
Alice
14
0
1
Yes
No
Bob
10
1
1
Age>9?
Age>10?
Carol
13
0
1
Dave
8
1
0
Erin
11
0
0
Frank
9
1
1
Gena
8
0
0
No
0
Yes
1
No
0
é
age
ê
x=
êë 1[gender=male]
ù
ú
úû
ìï 1 height > 55"
y=í
ïî 0 height £ 55"
Splitting Criterion: entropy reduction (info gain)
Complexity: number of leaves, size of leaves
Outline
• Ensemble methods that minimize variance
– Bagging
– Random Forests
• Ensemble methods that minimize bias
– Boosting
– Ensemble Selection
Generalization Error
• “True” distribution: P(x,y)
– Unknown to us
• Train: h(x) = y
– Using training data S = {(x1,y1),…,(xn,yn)}
– Sampled from P(x,y)
• Generalization Error:
– L(h) = E(x,y)~P(x,y)[ f(h(x),y) ]
– E.g., f(a,b) = (a-b)2
Age
Male?
Height > 55”
James
11
1
1
Jessica
14
0
1
Alice
14
0
1
Amy
12
0
1
Bob
10
1
1
Xavier
9
1
0
Cathy
9
0
1
Carol
13
0
1
Eugene
13
1
0
Rafael
12
1
1
Dave
8
1
0
Peter
9
1
0
Henry
13
1
0
Erin
11
0
0
Rose
7
0
0
Iain
8
1
1
Paulo
12
1
0
Margare
t
10
0
1
Frank
9
1
1
Jill
13
0
0
Leon
10
1
0
Sarah
12
0
0
Gena
8
0
0
Patrick
5
1
1
…
Person
Person
Age
Male?
Height > 55”
Alice
14
0
1
Bob
10
1
1
Carol
13
0
1
Dave
8
1
0
Erin
11
0
0
Frank
9
1
1
Gena
8
0
0
y
Generalization Error:
L(h) = E(x,y)~P(x,y)[ f(h(x),y) ]
h(x)
• Treat h(x|S) has a random function
– Depends on training data S
• L = ES[ E(x,y)~P(x,y)[ f(h(x|S),y) ] ]
– Expected generalization error
– Over the randomness of S
• Squared loss: f(a,b) = (a-b)2
• Consider one data point (x,y)
• Notation:
– Z = h(x|S) – y
– ž = ES[Z]
– Z-ž = h(x|S) – ES[h(x|S)]
ES[(Z-ž)2] = ES[Z2 – 2Zž + ž2]
= ES[Z2] – 2ES[Z]ž + ž2
= ES[Z2] – ž2
Expected Error
ES[f(h(x|S),y)] = ES[Z2]
= ES[(Z-ž)2] + ž2
Bias/Variance for all (x,y) is expectation over P(x,y).
Can also incorporate measurement noise.
(Similar flavor of analysis for other loss functions.)
Variance
Bias
Example
1.5
1
0.5
y
0
−0.5
−1
−1.5
0
10
20
30
40
50
x
60
70
80
90
100
h(x|S)
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
0
20
40
60
80
100
−1
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
0
20
40
60
80
100
−1
0
20
40
60
80
100
0
20
40
60
80
100
h(x|S)
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
0
20
40
60
80
100
−1
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
0
20
40
60
80
100
−1
0
20
40
60
80
100
0
20
40
60
80
100
h(x|S)
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
0
20
40
60
80
100
−1
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
0
20
40
60
80
100
−1
0
20
40
60
80
100
0
20
40
60
80
100
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1
0
20
40
60
80
100
1.5
0
20
40
60
80
100
1.5
Bias
1
Variance
0.5
0
−1
20
40
60
80
100
Bias
1
0
20
40
60
80
100
Variance
Variance
Bias
1
0.5
0
20
40
ES[(h(x|S) - y)2] = ES[(Z-ž)2] + ž2
Expected Error
0
1.5
0.5
0
−1
Variance
Bias
60
80
100
0
0
20
40
60
80
Z = h(x|S) – y
ž = ES[Z]
100
Outline
• Ensemble methods that minimize variance
– Bagging
– Random Forests
• Ensemble methods that minimize bias
– Boosting
– Ensemble Selection
Bagging
• Goal: reduce variance
P(x,y)
James#
11#
1#
1#
Jessica#
14#
0#
1#
Alice#
14#
0#
1#
Amy#
12#
0#
1#
Bob#
10#
1#
1#
Xavier#
9#
1#
0#
Cathy#
9#
0#
1#
Carol#
13#
0#
1#
Eugene#
13#
1#
0#
James#
11#
S’
1#
1#
Jessica#
14#
0#
1#
Alice#
14#
0#
1#
Amy#
12#
0#
1#
Bob#
10#
1#
1#
Xavier#
9#
1#
0#
Cathy#
9#
0#
1#
Carol#
13#
0#
1#
Rafael#
12#
1#
1#
Eugene#
13#
1#
0#
Dave#
8#
1#
0#
Rafael#
12#
1#
1#
Peter#
9#
1#
0#
Dave#
8#
1#
0#
Henry#
13#
1#
0#
Erin#
11#
0#
0#
Rose#
7#
0#
0#
Peter#
9#
1#
0#
Henry#
13#
1#
0#
Erin#
11#
0#
0#
Rose#
7#
0#
0#
Iain#
8#
1#
1#
Iain#
8#
1#
1#
Paulo#
12#
1#
0#
Paulo#
12#
1#
0#
Margaret#
10#
0#
1#
Margaret#
10#
0#
1#
Frank#
9#
1#
1#
Jill#
13#
0#
0#
Leon#
10#
1#
0#
Sarah#
12#
0#
0#
Gena#
8#
0#
0#
Patrick#
5#
1#
1#
Frank#
9#
1#
1#
Jill#
13#
0#
0#
Leon#
10#
1#
0#
Sarah#
12#
0#
0#
Gena#
8#
0#
0#
Patrick#
5#
1#
1#
Alice#
14#
0#
Bob#
10#
Carol#
13#
Dave#
8#
Erin#
11#
Erin#
Frank#
9#
1#
Frank#
Gena#
8#
0#
Gena#
1#
Alice#
14#
0#
1#
Bob#
10#
1#
1#
Carol#
13#
0#
1#
1#
0#
11#
0#
0#
9#1#
1#
1#
8#0#
0#
0#
1#
0#
1#
Dave#
0#
1#
1#
0#
8#
0#
L(h)#=#E
[#f(h(x),y)#]###
(x,y)~P(x,y)
L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#
]###
sampled independently
• Ideal setting: many training sets S’
– Train model using each S’
– Average predictions
ES[(h(x|S) - y)2] = ES[(Z-ž)2] + ž2
Expected Error
Variance
Bias
“Bagging Predictors” [Leo Breiman, 1994]
http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf
Variance reduces linearly
Bias unchanged
Z = h(x|S) – y
ž = ES[Z]
Bagging
• Goal: reduce variance
S
James#
11#
1#
1#
Jessica#
14#
0#
1#
Alice#
14#
0#
1#
Amy#
12#
0#
1#
Bob#
10#
1#
1#
Xavier#
9#
1#
0#
Cathy#
9#
0#
1#
Carol#
13#
0#
1#
Eugene#
13#
1#
0#
Rafael#
12#
1#
1#
Dave#
8#
1#
0#
Peter#
9#
1#
0#
Henry#
13#
1#
0#
Erin#
11#
0#
0#
Rose#
7#
0#
0#
Iain#
8#
1#
1#
Paulo#
12#
1#
0#
S’
Alice#
14#
0#
1#
Bob#
10#
1#
1#
Carol#
13#
0#
1#
Dave#
8#
1#
0#
Erin#
11#
0#
0#
James#
11#
1#
1#
Jessica#
14#
0#
1#
Alice#
14#
0#
1#
Amy#
12#
0#
1#
Bob#
10#
1#
1#
Xavier#
9#
1#
0#
Cathy#
9#
0#
1#
Carol#
13#
0#
1#
Eugene#
13#
1#
0#
Rafael#
12#
1#
1#
Dave#
8#
1#
0#
Peter#
9#
1#
0#
Henry#
13#
1#
0#
Erin#
11#
0#
0#
7#
0#
0#
Frank#
9#
1#
1#
Rose#
Iain#
8#
1#
1#
Gena#
8#
0#
0#
Paulo#
12#
1#
0#
Margaret#
10#
0#
1#
Margaret#
10#
0#
1#
Frank#
9#
1#
1#
Frank#
9#
1#
1#
Jill#
13#
0#
0#
Jill#
13#
0#
0#
Leon#
10#
1#
0#
Leon#
10#
1#
0#
Sarah#
12#
0#
0#
Sarah#
12#
0#
0#
Gena#
8#
0#
0#
Patrick#
5#
1#
1#
L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
Gena#
8#
0#
0#
Alice#
14#
0#
1#
Bob#
10#
1#
1#
Carol#
13#
0#
1#
Dave#
8#
1#
0#
Erin#
11#
0#
0#
Frank#
9#
1#
1#
Gena#
8#
0#
0#
L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
from S
Patrick#
5#
1#
1#
• In practice: resample S’ with replacement
– Train model using each S’
– Average predictions
ES[(h(x|S) - y)2] = ES[(Z-ž)2] + ž2
Expected Error
Variance
Variance reduces sub-linearly
(Because S’ are correlated)
Bias often increases slightly
Z = h(x|S) – y
ž = ES[Z]
Bias
“Bagging Predictors” [Leo Breiman, 1994]
http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf
Bagging = Bootstrap Aggregation
DT
Bagged DT
Better
Variance
Bias
Bias
“An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants”
Eric Bauer & Ron Kohavi, Machine Learning 36, 105–139 (1999)
Random Forests
• Goal: reduce variance
– Bagging can only do so much
– Resampling training data asymptotes
• Random Forests: sample data & features!
– Sample S’
– Train DT
Further de-correlates trees
• At each node, sample features (sqrt)
– Average predictions
“Random Forests – Random Features” [Leo Breiman, 1997]
http://oz.berkeley.edu/~breiman/random-forests.pdf
Better
Average performance over many datasets
Random Forests perform the best
“An Empirical Evaluation of Supervised Learning in High Dimensions”
Caruana, Karampatziakis & Yessenalina, ICML 2008
Structured Random Forests
• DTs normally train on unary labels y=0/1
– Must define information gain of structured labels
• Edge detection:
– E.g., structured label is a 16x16 image patch
– Map structured labels to another space
• where entropy is well defined
“Structured Random Forests for Fast Edge Detection”
Dollár & Zitnick, ICCV 2013
Image
Ground Truth
RF Prediction
Comparable accuracy vs state-of-the-art
Orders of magnitude faster
“Structured Random Forests for Fast Edge Detection”
Dollár & Zitnick, ICCV 2013
Outline
• Ensemble methods that minimize variance
– Bagging
– Random Forests
• Ensemble methods that minimize bias
– Boosting
– Ensemble Selection
h(x) = h1(x) + h2(x) + … + hn(x)
S’ = {(x,y)}
S’ = {(x,y-h1(x))}
S’ = {(x,y-h1(x) - … - hn-1(x))}
…
h1(x)
http://statweb.stanford.edu/~jhf/ftp/trebst.pdf
h2(x)
hn(x)
• Learn w so that h(x) = wTx
• Coordinate descent
– Init w = 0
– Choose dimension with highest gain
• Set component of w
– Repeat
• Learn w so that h(x) = wTx
• Coordinate descent
– Init w = 0
– Choose dimension with highest gain
• Set component of w
– Repeat
é
ê
ê
ê
ê
ê
w =ê
ê
ê
ê
ê
ê
ê
ë
0
0
0
0
0
0
0
0
0
0
ù
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
û
• Learn w so that h(x) = wTx
• Coordinate descent
– Init w = 0
– Choose dimension with highest gain
• Set component of w
– Repeat
é
ê
ê
ê
ê
ê
w =ê
ê
ê
ê
ê
ê
ê
ë
0
0
+3
0
0
0
0
0
0
0
ù
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
û
• Learn w so that h(x) = wTx
• Coordinate descent
– Init w = 0
– Choose dimension with highest gain
• Set component of w
– Repeat
é 0 ù
ú
ê
ê 0 ú
ê +3 ú
ê 0 ú
ú
ê
0 ú
w =ê
ê 0 ú
ê 0 ú
ú
ê
ê -1.5 ú
ê 0 ú
ê 0 ú
û
ë
• Learn w so that h(x) = wTx
• Coordinate descent
– Init w = 0
– Choose dimension with highest gain
• Set component of w
– Repeat
é +2.1 ù
ú
ê
ê 0 ú
ê +3 ú
ê 0 ú
ú
ê
0 ú
w =ê
ê 0 ú
ê 0 ú
ú
ê
ê -1.5 ú
ê 0 ú
ê 0 ú
û
ë
• Learn w so that h(x) = wTx
• Coordinate descent
– Init w = 0
– Choose dimension with highest gain
• Set component of w
– Repeat
é
ê
ê
ê
ê
ê
w =ê
ê
ê
ê
ê
ê
ê
ë
+2.1
0
+3
0
0
-0.9
0
-1.5
0
0
ù
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
û
…+
h(x) = h1(x) + h2(x) + … + hn(x)
Coordinate descent in function space
Restrict weights to be 0,1,2,…
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
…+
“Function Space”
…+
(All
possible DTs)
…+
…+
h(x) = a1h1(x) + a2h2(x) + … + a3hn(x)
S’ = {(x,y,u1)}
S’ = {(x,y,u2)}
S’ = {(x,y,u3))}
…
h1(x)
h2(x)
u – weighting on data points
a – weight of linear combination
hn(x)
Stop when validation
performance plateaus
(will discuss later)
Initial Distribution of Data
Train model
Error of model
Coefficient of model
Update Distribution
Final average
Theorem: training error drops exponentially fast
DT
Bagging
Better
Variance
Boosting often uses weak models
E.g, “shallow” decision trees
Weak models have lower variance
Bias
Bias
“An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants”
Eric Bauer & Ron Kohavi, Machine Learning 36, 105–139 (1999)
Ensemble Selection
James#
11#
1#
1#
Jessica#
14#
0#
1#
Alice#
14#
0#
1#
Amy#
12#
0#
1#
Bob#
10#
1#
1#
Xavier#
9#
1#
0#
Cathy#
9#
0#
1#
Carol#
13#
0#
1#
Eugene#
13#
1#
0#
Rafael#
12#
1#
1#
Dave#
8#
1#
0#
Peter#
9#
1#
0#
Henry#
13#
1#
0#
Erin#
11#
0#
0#
Rose#
7#
0#
0#
Iain#
8#
1#
1#
Paulo#
12#
1#
0#
Alice#
14#
0#
1#
Bob#
10#
1#
1#
Carol#
13#
0#
Dave#
8#
1#
Erin#
11#
0#
0#
Frank#
9#
1#
1#
Gena#
8#
0#
0#
James#
11#
1#
1#
Jessica#
14#
0#
1#
Alice#
14#
0#
1#
Amy#
12#
0#
1#
Bob#
10#
1#
1#
Xavier#
9#
1#
0#
Cathy#
9#
0#
1#
Carol#
13#
0#
1#
Eugene#
13#
1#
0#
Rafael#
12#
1#
1#
Dave#
8#
1#
0#
Peter#
9#
1#
0#
Henry#
13#
1#
0#
Erin#
11#
0#
0#
Rose#
7#
0#
0#
Iain#
8#
1#
1#
Paulo#
12#
1#
0#
Margaret#
10#
0#
1#
1#
Frank#
9#
1#
1#
Jill#
13#
0#
0#
Leon#
10#
1#
0#
0#
Sarah#
12#
0#
0#
Gena#
James#
Patrick#
Jessica#
8#
11#
5#
14#
0#
1#
1#
0#
0#
1#
1#
1#
Alice#
14#
0#
1#
Amy#
12#
0#
1#
Bob#
10#
1#
1#
Xavier#
9#
1#
0#
Cathy#
9#
0#
1#
Carol#
13#
0#
1#
Eugene#
13#
1#
0#
Margaret#
10#
0#
1#
Rafael#
12#
1#
1#
Frank#
9#
1#
1#
Dave#
8#
1#
0#
Jill#
13#
0#
0#
Peter#
9#
1#
0#
Leon#
10#
1#
0#
Sarah#
12#
0#
0#
Gena#
8#
0#
0#
Patrick#
5#
1#
1#
S
Henry#
13#
1#
0#
Erin#
11#
0#
0#
Rose#
7#
0#
0#
Iain#
8#
1#
1#
L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
Paulo#
12#
1#
0#
Margaret#
10#
0#
1#
Frank#
9#
1#
1#
Jill#
13#
0#
0#
Leon#
10#
1#
0#
Sarah#
12#
0#
0#
Gena#
8#
0#
0#
Patrick#
5#
1#
1#
Alice#
14#
0#
1#
Bob#
10#
1#
1#
Carol#
13#
0#
1#
Dave#
8#
1#
0#
Erin#
11#
0#
0#
Frank#
9#
1#
1#
Gena#
8#
0#
0#
Training S’
H = {2000 models trained using S’}
L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
Alice#
14#
0#
1#
Bob#
10#
1#
1#
Carol#
13#
0#
1#
Dave#
8#
1#
0#
Erin#
11#
0#
0#
Frank#
9#
1#
1#
Gena#
8#
0#
0#
Validation V’
L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###
Maintain ensemble model as combination of H:
h(x) = h1(x) + h2(x) + … + hn(x) + hn+1(x)
Denote as hn+1
Add model from H that maximizes performance on V’
Repeat
“Ensemble Selection from Libraries of Models”
Caruana, Niculescu-Mizil, Crew & Ksikes, ICML 2004
Models are trained on S’
Ensemble built to optimize V’
Ensemble Selection often outperforms a more homogenous sets of models.
Reduces overfitting by building model using validation set.
Ensemble Selection won KDD Cup 2009
http://www.niculescu-mizil.org/papers/KDDCup09.pdf
“Ensemble Selection from Libraries of Models”
Caruana, Niculescu-Mizil, Crew & Ksikes, ICML 2004
Method
Minimize Bias?
Minimize Variance?
Bagging
Complex model class. Bootstrap aggregation
(Deep DTs)
(resampling training data)
Does not work for
simple models.
Random
Forests
Complex model class. Bootstrap aggregation
(Deep DTs)
+ bootstrapping features
Only for decision trees.
Optimize training
Boosting
performance.
Simple model class.
(Shallow DTs)
Determines which
Ensemble
Selection
Optimize validation
performance
Pre-specified dictionary
of models learned on
training set.
Optimize validation
performance
…and many other ensemble methods as well.
• State-of-the-art prediction performance
– Won Netflix Challenge
– Won numerous KDD Cups
– Industry standard
“An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants” Bauer & Kohavi,
Machine Learning, 36, 105–139 (1999)
“Bagging Predictors” Leo Breiman, Tech Report #421, UC Berkeley, 1994,
http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf
“An Empirical Comparison of Supervised Learning Algorithms” Caruana & Niculescu-Mizil, ICML 2006
“An Empirical Evaluation of Supervised Learning in High Dimensions” Caruana, Karampatziakis & Yessenalina,
ICML 2008
“Ensemble Methods in Machine Learning” Thomas Dietterich, Multiple Classifier Systems, 2000
“Ensemble Selection from Libraries of Models” Caruana, Niculescu-Mizil, Crew & Ksikes, ICML 2004
“Getting the Most Out of Ensemble Selection” Caruana, Munson, & Niculescu-Mizil, ICDM 2006