### Lecture 7: Tree and Forest

```Ensemble Learning (2),
Tree and Forest
Classification and Regression Tree
Bagging of trees
Random Forest
Motivation
To estimate complex response surface and class
boundary.
Reminder:
SVM achieves this goal by the kernel trick;
Boosting achieves this goal by combining weak
classifiers
Classification tree achieves this goal by generating
complex surface/boundary in a very complex model.
Bagging classification tree and random forest
achieves this goal in a manner similar as boosting.
Classification tree
Classification Tree
An example classification tree.
Classification tree
Issues:
How many splits should be allowed at a node?
Which property to use at a node?
When to stop splitting a node and declare it a “leaf”?
How to adjust the size of the tree?
Tree size <-> model complexity.
Too large a tree – over fitting;
Too small a tree –
not capture the underlying structure.
How to assign the classification decision at each leaf?
Missing data?
Classification Tree
Binary split.
Classification Tree
To decide what split criteria to use, need to establish the
measurement of node impurity.
i ( N )    P ( j ) log
Entropy:
2
P ( j )
j
Misclassification:
i ( N )  1  max
P (
j
j
Gini impurity:
i( N ) 
 P (
i j
) P ( j )  1 
i
P
2
)
( j )
j
(Expected error rate if class label is permuted.)
Classification Tree
Classification Tree
Growing the tree.
Greedy search: at every step, choose the query that decreases the
impurity as much as possible.
 i ( N )  i ( N )  PL i ( N L )  (1  PL ) i ( N R )
For a real valued predictor, may use gradient descent to find the
optimal cut value.
When to stop?
- Stop when reduction in impurity is smaller than a threshold.
- Stop when the leaf node is too small.
- Stop when a global criterion is met.   Size   i ( N )
leaf _ nodes
- Hypothesis testing.
- Cross-validation.
- Fully grow and then prune.
Classification Tree
Pruning the tree.
- Merge leaves when the loss of impurity is not severe.
- cost-complexity pruning allows elimination of a branch in a
single step.
Gini impurity
i( N ) 
 P ( ) P ( )

ij
i
j
ij
Assigning class label to a leaf.
- No prior: take the class with highest frequency at the node.
- With prior: weigh the frequency by prior
- With loss function.…
Always minimize the Bayes error.
Classification Tree
Choice or features.
Classification Tree
Multivariate tree.
Classification Tree
Example of error rate v.s. tree size.
Regression tree
Example: complex surface by CART.
Regression tree
Model the response as region-wise constant:
Due to computational difficulty, a greedy algorithm:
Consider splitting variable j and split point s,
Seek j and s that minimize RSS:
Simply scan through all j and the range of xj to find s.
After partition, treat each region as separate data and iterate.
Regression tree
Tree size <-> model complexity.
Too large a tree – over fitting;
Too small a tree – not capture the underlying structure.
How to tune?
- Grow the tree until RSS reduction becomes too small.
Too greedy and “short sighted”.
- Grow until leafs are too small, then prune the tree using
cost-complexity pruning.
Bootstraping
Directly assess uncertainty from the training data
Basic thinking:
assuming the data approaches
true underlying density, resampling from it will give us an
idea of the uncertainty caused by
sampling
Bootstrapping
Bagging
“Bootstrap aggregation.”
Resample the training dataset.
Build a prediction model on each resampled dataset.
Average the prediction.
1 B
fˆbag ( x ) 
B

*b
fˆ ( x )
b 1
It’s a Monte Carlo estimate of E Pˆ fˆ * ( x ) , where is the
empirical distribution putting equal probability 1/N on each of
the data points.
Bagging only differs from the original estimate fˆ ( x ) when f() is a
non-linear or adaptive function of the data! When f() is a linear
function,
Tree is a perfect candidate for bagging – each bootstrap tree will
differ in structure.
Bagging trees
Bagged trees
are of different
structure.
Bagging trees
Error curves.
Bagging trees
Failure in bagging a single-level tree.
Random Forest
Bagging can be seen as a method to reduce variance of an
estimated prediction function. It mostly helps high-variance, lowbias classifiers.
Comparatively, boosting build weak classifiers one-by-one,
allowing the collection to evolve to the right direction.
Random forest is a substantial modification to bagging – build a
collection of de-correlated trees.
- Similar performance to boosting
- Simpler to train and tune compared to boosting
Random Forest
The intuition – the average of random variables.
B i.i.d. random variables, each with variance
The mean has variance
B i.d. random variables, each with variance
correlation ,
, with pairwise
The mean has variance
------------------------------------------------------------------------------------Bagged trees are i.d. samples.
Random forest aims at reducing the correlation to reduce
variance. This is achieved by random selection of variables.
Random Forest
Random Forest
Example
comparing RF
to boosted
trees.
Random Forest
Example
comparing RF to
boosted trees.
Random Forest
Benefit of RF – out of bag (OOB) sample  cross validation error.
For sample i, find its RF error from only trees built from samples
where sample i did not appear.
The OOB error rate is close to N-fold cross validation error rate.
Unlike many other nonlinear estimators, RF can be fit in a single
sequence. Stop growing forest when OOB error stabilizes.
Random Forest
Variable importance – find the most relevant predictors.
At every split of every tree, a variable contributed to the
improvement of the impurity measure.
Accumulate the reduction of i(N) for every variable, we have a
measure of relative importance of the variables.
The predictors that appears the most times at split points, and
lead to the most reduction of impurity, are the ones that are
important.
-----------------Another method – Permute the predictor values of the OOB
samples at every tree, the resulting decrease in prediction
accuracy is also a measure of importance. Accumulate it over all
trees.
Random Forest
```