Data Mining: Decision Trees

```Flisha Fernandez
DATA MINING
1
WHAT IS DATA MINING?
Data Mining is the discovery of hidden
knowledge, unexpected patterns and new rules in
large databases
 Automating the process of searching patterns in
the data

Emily Thomas, State University of New
York
2
OBJECTIVE OF DATA MINING

Any of four types of relationships are sought:
Classes – stored data is used to locate data in
predetermined groups
 Clusters – data items grouped according to logical
relationships or consumer preferences
 Associations – Walmart example (beer and diapers)
 Sequential Patterns – data mined to anticipate
behavior patterns and trends

anderson.ucla.edu
3
DATA MINING TECHNIQUES
Rule Induction – extraction of if-then rules
 Nearest Neighbors
 Artificial Neural Networks – models that learn
 Clustering
 Genetic Algorithms – concept of natural evolution
 Factor Analysis
 Exploratory
 Stepwise Regression
 Data Visualization – usage of graphic tools
 Decision Trees

Emily Thomas, State University of New
York
4
CLASSIFICATION
A major data mining operation
 Given one attribute (e.g. Wealth), try to to predict
the value of new people’s wealths by means of
some of the other available attributes
 Applies to categorical outputs

Flisha Fernandez
Categorical attribute: an attribute which takes
on two or more discrete values, also knows as a
symbolic attribute
 Real attribute: a column of real numbers, also
known as a continuous attribute

5
DATASET
real
symbol
classification
Kohavi 1995
6
Flisha Fernandez
DECISION TREES
7
DECISION TREES
Also called classification trees or regression trees
 Based on recursive partitioning of the sample
space
 Tree-shaped structures that represent sets of
decisions/data which generate rules for the
classification of a (new, unclassified) dataset
 The resulting classification tree becomes an input
for decision making

anderson.ucla.edu / Wikipedia
8
DECISION TREES

A Decision Tree is where:
The nonleaf nodes are labeled with attributes
 The arcs out of a node labeled with attribute A
are labeled with each of the possible values
of the attribute A
 The leaves of the tree are labeled with classifications

Learning Decision Trees
9
ADVANTAGES
Simple to understand and interpret
 Requires little data preparation (other
techniques need data normalization, dummy
variables, etc.)
 Can support both numerical and categorical data
 Uses white box model (explained by boolean
logic)
 Reliable – possible to validate model using
statistical tests
 Robust – large amounts of data can be analyzed
in a short amount of time

Wikipedia
10
Flisha Fernandez
DECISION TREE EXAMPLE
11
DAVID’S DEBACLE


Objective: Come up with optimized staff schedule
From Wikipedia

Status Quo: Sometimes people play golf,
sometimes they do not
Means: Predict when people will play golf, when
they will not
12
DAVID’S DATASET
Temp
Humidity
Windy
Play
Sunny
85
85
No
No
Sunny
80
90
Yes
No
Overcast
83
78
No
Yes
Rain
70
96
No
Yes
Rain
68
80
No
Yes
Rain
65
70
Yes
No
Overcast
64
65
Yes
Yes
Sunny
72
95
No
No
Sunny
69
70
No
Yes
Rain
75
90
No
Yes
Sunny
75
70
Yes
Yes
Overcast
72
90
Yes
Yes
Overcast
81
75
No
Yes
Rain
71
80
Yes
No
Quinlan 1989
Outlook
13
DAVID’S DIAGRAM
Play: 9
Don’t: 5
From Wikipedia
outlook?
SUNNY
OVERCAST
RAIN
Play: 2
Don’t: 3
Play: 4
Don’t: 0
Play: 3
Don’t: 2
humid?
HUMIDITY <=
70
Play: 2
Don’t: 0
windy?
HUMIDITY > 70
WINDY
NOT WINDY
Play: 0
Don’t: 3
Play: 0
Don’t: 2
Play: 3
Don’t: 0
14
DAVID’S DECISION TREE
Arcs Labeled with
Possible Values
Root Node
Non-leaf Nodes
Labeled with
Attributes
Outlook?
Flisha Fernandez
Sunny
Rain
Overcast
Humidity?
Normal
Play
Windy?
High
Don’t Play
Play
False
Play
True
Don’t Play
15
Leaves Labeled with Classifications
DAVID’S DECISION

Dismiss staff when it is
Sunny AND Hot
 Rainy AND Windy

Hire extra staff when it is
Cloudy
 Sunny AND Not So Hot
 Rainy AND Not Windy

From Wikipedia

16
Flisha Fernandez
DECISION TREE INDUCTION
17
DECISION TREE INDUCTION ALGORITHM

Basic Algorithm (Greedy Algorithm)




Simon Fraser University, Canada

Tree is constructed in a top-down recursive divide-andconquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they
are discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a certain
measure (e.g., information gain)
18
ISSUES

How should the attributes be split?
What is the appropriate root?
 What is the best split?

Decision Tree Learning
When should we stop splitting?
 Should we prune?

19
TWO PHASES

Tree Growing (Splitting)
Splitting data into progressively smaller subsets
 Analyzing the data to find the independent variable
(such as outlook, humidity, windy) that when used as
a splitting rule will result in nodes that are most
different from each other with respect to the
dependent variable (play)

Tree Pruning
DBMS Data Mining Solutions Supplement

20
SPLITTING
Flisha Fernandez
21
SPLITTING ALGORITHMS
Random
 Information Gain
 Information Gain Ratio
 GINI Index

DBMS Data Mining Solutions Supplement
22
DATASET
AIXploratorium
23
RANDOM SPLITTING
AIXploratorium
24
RANDOM SPLITTING

Disadvantages
Trees can grow huge
 Hard to understand
 Less accurate than smaller trees

AIXploratorium
25
SPLITTING ALGORITHMS
Random
 Information Gain
 Information Gain Ratio
 GINI Index

DBMS Data Mining Solutions Supplement
26
pi  ni
E ( A)  
I ( pi , ni )
i 1 p  n

INFORMATION ENTROPY
A measure of the uncertainty associated with a
random variable
 A measure of the average information content the
recipient is missing when they do not know the
value of the random variable
 A long string of repeating characters has an
entropy of 0, since every character is predictable
 Example: Coin Toss

Wikipedia
Independent fair coin flips have an entropy of 1 bit
per flip
 A double-headed coin has an entropy of 0 – each toss
of the coin delivers no information

27
INFORMATION GAIN
A good measure for deciding the relevance of an
attribute
 The value of Information Gain is the reduction in
the entropy of X achieved by learning the state of
the random variable A
 Can be used to define a preferred sequence
(decision tree) of attributes to investigate to most
rapidly narrow down the state of X
 Used by ID3 and C4.5

Wikipedia
28
CALCULATING INFORMATION GAIN


Ex. Attribute Thread = New, Skips = 3, Reads = 7
-0.3 * log 0.3 - 0.7 * log 0.7
 = 0.881 (using log base 2)


Flisha Fernandez
First compute information content
p
p
n
n
I ( p, n)  
log2

log2
pn
pn pn
pn
Ex. Attribute Thread = Old, Skips = 6, Reads = 2
-0.6 * log 0.6 - 0.2 * log 0.2
 = 0.811 (using log base 2)


Information Gain is...
Of 18, 10 threads are new and 8 threads are old
 1.0 - (10/18)*0.881 + (8/18)*0.811
 0.150

Gain( A)  I ( p, n)  E( A)
29
INFORMATION GAIN
AIXploratorium
30
TEST DATA
Whe
n
Fred
Joe
Starts Offense
Joe
Defense
Opp
C
Outcom
e
Away
9pm
No
Forward
Tall
??
Center
AIXploratorium
Wher
e
31
DRAWBACKS OF INFORMATION GAIN

Prefers attributes with many values (real
attributes)
Wikipedia / AIXploratorium
Prefers AudienceSize {1,2,3,..., 150, 151, ..., 1023,
1024, ... }
 But larger attributes are not necessarily better


Example: credit card number
Has a high information gain because it uniquely
identifies each customer
 But deciding how to treat a customer based on their
credit card number is unlikely to generalize
customers we haven't seen before

32
SPLITTING ALGORITHMS
Random
 Information
 Information Gain Ratio
 GINI Index

DBMS Data Mining Solutions Supplement
33
INFORMATION GAIN RATIO
Works by penalizing multiple-valued attributes
 Gain ratio should be


Large when data is evenly spread
Small when all data belong to one branch
http://www.it.iitb.ac.in/~sunita


Gain ratio takes number and size of branches
into account when choosing an attribute

It corrects the information gain by taking the
intrinsic information of a split into account (i.e. how
much info do we need to tell which branch an
instance belongs to)
34
DATASET
ID
Humidit
y
Windy Play?
A
sunny
hot
high
false
No
B
sunny
hot
high
true
No
C
overcast
hot
high
false
Yes
D
rain
mild
high
false
Yes
E
rain
cool
normal
false
Yes
F
rain
cool
normal
true
No
G
overcast
cool
normal
true
Yes
H
sunny
mild
high
false
No
I
sunny
cool
normal
false
Yes
J
rain
mild
normal
false
Yes
K
sunny
mild
normal
true
Yes
L
overcast
mild
high
true
Yes
M
overcast
hot
normal
false
Yes
N
rain
mild
high
true
No
http://www.it.iitb.ac.in/~sunita
Outlook
Temperatur
e
35
CALCULATING GAIN RATIO

Intrinsic information: entropy of distribution of
instances into branches

Gain ratio (Quinlan’86) normalizes info gain by
GainRatio(S, A) 
Gain(S, A)
.
IntrinsicInfo(S, A)
http://www.it.iitb.ac.in/~sunita
|S |
|S |
IntrinsicInfo(S, A)   i log i .
|S| 2 | S |
36
COMPUTING GAIN RATIO

Example: intrinsic information for ID code


Importance of attribute decreases as intrinsic
information gets larger
Example of gain ratio:
gain(" Attribute" )
gain_ratio (" Attribute" ) 
intrinsic_ info(" Attribute" )
gain_ratio (" ID_code" ) 
0.940 bits
 0.246
3.807 bits
http://www.it.iitb.ac.in/~sunita
info([1,1,,1)  14 (1 / 14 log1 / 14)  3.807 bits
37
DATASET
ID
Humidit
y
Windy Play?
A
sunny
hot
high
false
No
B
sunny
hot
high
true
No
C
overcast
hot
high
false
Yes
D
rain
mild
high
false
Yes
E
rain
cool
normal
false
Yes
F
rain
cool
normal
true
No
G
overcast
cool
normal
true
Yes
H
sunny
mild
high
false
No
I
sunny
cool
normal
false
Yes
J
rain
mild
normal
false
Yes
K
sunny
mild
normal
true
Yes
L
overcast
mild
high
true
Yes
M
overcast
hot
normal
false
Yes
N
rain
mild
high
true
No
http://www.it.iitb.ac.in/~sunita
Outlook
Temperatur
e
38
INFORMATION GAIN RATIOS
Outlook
Temperature
0.693
Info:
0.911
Gain: 0.940-0.693
0.247
Gain: 0.940-0.911
0.029
Split info: info([5,4,5])
1.577
Split info: info([4,6,4])
1.362
Gain ratio: 0.247/1.577
0.156
Gain ratio: 0.029/1.362
0.021
Humidity
Windy
Info:
0.788
Info:
0.892
Gain: 0.940-0.788
0.152
Gain: 0.940-0.892
0.048
Split info: info([7,7])
1.000
Split info: info([8,6])
0.985
Gain ratio: 0.152/1
0.152
Gain ratio: 0.048/0.985
0.049
http://www.it.iitb.ac.in/~sunita
Info:
39
MORE ON GAIN RATIO
“Outlook” still comes out top
 However, “ID code” has greater gain ratio


Standard fix: ad hoc test to prevent splitting on that
type of attribute
Problem with gain ratio: it may overcompensate


May choose an attribute just because its intrinsic
information is very low
Standard fix:
http://www.it.iitb.ac.in/~sunita

First, only consider attributes with greater than average
information gain
 Then, compare them on gain ratio

40
INFORMATION GAIN RATIO

Below is the decision tree
AIXploratorium
41
TEST DATA
Whe
n
Fred
Joe
Starts Offense
Joe
Defense
Opp
C
Outcom
e
Away
9pm
No
Forward
Tall
??
Center
AIXploratorium
Wher
e
42
SPLITTING ALGORITHMS
Random
 Information
 Information Gain Ratio
 GINI Index

DBMS Data Mining Solutions Supplement
43
GINI INDEX
 All
Flisha Fernandez
attributes are assumed continuousvalued
 Assumes there exist several possible split
values for each attribute
 May need other tools, such as clustering, to
get the possible split values
 Can be modified for categorical attributes
 Used in CART, SLIQ, SPRINT
44
GINI INDEX

If a data set T contains examples from n classes, gini
n
index, gini(T) is defined as
2
where pj is the relative frequency of class j in node T

If a data set T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively, the gini index of the
split data contains examples from n classes, the gini
index gini(T) is defined as
[email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */
gini (T ) 1  p j
j 1
N1 gini( )  N 2 gini( )
(
T
)

gini split
T1
T2
N
N

The attribute that provides the smallest ginisplit(T) is
chosen to split the node
45
GINI INDEX

[email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */

Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
Minimum (0.0) when all records belong to one
class, implying most interesting information
C1
C2
0
6
Gini=0.000
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
C1
C2
3
3
Gini=0.500
46
EXAMPLES FOR COMPUTING GINI
GINI(t )  1  [ p( j | t )]2
C1
C2
0
6
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
[email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */
j
P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
Gini = 1 –
P(C2) = 4/6
(2/6)2 –
(4/6)2
= 0.444
47
STOPPING RULES
Pure nodes
 Maximum tree depth, or maximum number of
nodes in a tree

Because of overfitting problems
Minimum number of elements in a node
considered for splitting, or its near equivalent
 Minimum number of elements that must be in a
new node
 A threshold for the purity measure can be
imposed such that if a node has a purity value
higher than the threshold, no partitioning will be
attempted regardless of the number of
observations

DBMS Data Mining Solutions Supplement
/ AI Depot

48
OVERFITTING
The generated tree may overfit the training data
 Too many branches, some may reflect
anomalies due to noise or outliers
 Result is in poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do
not split a node if this would result in the
goodness measure falling below a threshold


Flisha Fernandez

Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully
grown” tree—get a sequence of progressively
pruned trees

Use a set of data different from the training data to decide
which is the “best pruned tree”
49
PRUNING
Used to make a tree more general, more accurate
 Removes branches that reflect noise

DBMS Data Mining Solutions Supplement
50
ORDER OF PRUNING

accuracy and size of node
 accuracy gain is weighted by share of sample
 small nodes tend to get removed before large ones

If several nodes have same contribution they all prune
away simultaneously
 Hence more than two terminal nodes could be cut off in one
Flisha Fernandez
Prune away "weakest link" — the nodes that add least to
overall accuracy of the tree
 contribution to overall tree a function of both increase in
pruning

Sequence determined all the way back to root node
 need to allow for possibility that entire tree is bad
 if target variable is unpredictable we will want to prune back
to root . . . the no model solution
51
CROSS-VALIDATION

Of the training sample, give learner only a subset

Gives us a measure of the quality of the decision
trees produced based on their splitting algorithm
 Remember that more examples lead to better
estimates

AIXploratorium
Ex: Training on the first 15 games, testing on the last
5
52
Flisha Fernandez
ALGORITHMS
53
DECISION TREE METHODS

ID3 and C4.5 Algorithms

Classification and Regression Trees (CART)


Chi Square Automatic Interaction Detection
(CHAID)


Segments a dataset using 2-way splits
anderson.ucla.edu

Developed by Ross Quinlan
Segments a dataset using chi square tests to create
multi-way splits
And many others
54
ID3 ALGORITHM
Developed by Ross Quinlan
 Iterative Dichotomiser 3
 Based on Occam’s Razor
 Prefers smaller decision trees over larger ones
 Does not always produce the smallest tree, is
heuristic
 Summarized as follows:

Wikipedia
Take all unused attributes and count their entropy
concerning test samples
 Choose attribute for which entropy is minimum
 Make node containing that attribute

55
ID3 ALGORITHM







A = The Attribute that best classifies examples
Decision Tree attribute for Root = A
For each possible value, vi, of A,






Wikipedia

Create a root node for the tree
If all examples are positive, return single-node tree root, with label =
+
If all examples are negative, return single-node tree root, with label =
If number of predicting attributes is empty, then return single node
tree root, with label = most common value of the target attribute in
the examples
Otherwise Begin
Add a new tree branch below Root, corresponding to the test A = vi.
Let Examples(vi), be the subset of examples that have the value vi for A
If Examples(vi) is empty
 Then below this new branch add a leaf node with label = most common target
value in the examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute,
Attributes – {A})
End
Return Root
56
EXAMPLE DATA
Color
Shape
Class
Medium
Blue
Brick
Yes
Small
Red
Sphere
Yes
Large
Green
Pillar
Yes
Large
Green
Sphere
Yes
Small
Red
Wedge
No
Large
Red
Wedge
No
Large
Red
Pillar
No
http://www.cse.unsw.edu.au/
Size
57
CHOOSING ATTRIBUTES
The order in which attributes are chosen
determines how complicated the tree is
 ID3 uses information theory to determine the
most informative attribute

A measure of the information content of a message is
the inverse of the probability of receiving the
message:


information1(M) = 1/probability(M)
http://www.cse.unsw.edu.au/

Taking logs (base 2) makes information correspond to
the number of bits required to encode a message:

information(M) = -log2(probability(M))
58
INFORMATION
The information content of a message should be
related to the degree of surprise in receiving the
message
 Messages with a high probability of arrival are
not as informative as messages with low
probability
 Learning aims to predict accurately, i.e. reduce
surprise
 Probabilities are multiplied to get the probability
of two or more things both/all happening. Taking
logarithms of the probabilities allows information
to be added instead of multiplied

http://www.cse.unsw.edu.au/
59
ENTROPY
Different messages have different probabilities of
arrival
 Overall level of uncertainty (termed entropy) is:


-Σi Pi log2Pi
Frequency can be used as a probability estimate

e.g. if there are 5 positive examples and 3 negative
examples in a node the estimated probability of
positive is 5/8 = 0.625
http://www.cse.unsw.edu.au/

60
LEARNING
Learning tries to reduce the information content
of the inputs by mapping them to fewer outputs
 Hence we try to minimize entropy

http://www.cse.unsw.edu.au/
61
SPLITTING CRITERION
Work out entropy based on distribution of classes
 Trying splitting on each attribute
 Work out expected information gain for each
attribute
 Choose best attribute

http://www.cse.unsw.edu.au/
62
EXAMPLE DATA
Color
Shape
Class
Medium
Blue
Brick
Yes
Small
Red
Sphere
Yes
Large
Green
Pillar
Yes
Large
Green
Sphere
Yes
Small
Red
Wedge
No
Large
Red
Wedge
No
Large
Red
Pillar
No
http://www.cse.unsw.edu.au/
Size
63
EXAMPLE
Initial decision tree is one node with all
examples.
 There are 4 positive examples and 3 negative
examples


http://www.cse.unsw.edu.au/
i.e. probability of positive is 4/7 = 0.57; probability of
negative is 3/7 = 0.43
 Entropy for Examples is: – (0.57 * log 0.57) – (0.43 *
log 0.43) = 0.99

Evaluate possible ways of splitting
64
EXAMPLE
Size
Color
Shape
Class
Medium
Blue
Brick
Yes
Small
Red
Sphere
Yes
Large
Green
Pillar
Yes
Large
Green
Sphere
Yes
No
No
No
There are two large positives examples and two large
negative examples.
 The probability of positive is 0.5
 The entropy for Large is: – (0.5 * log 0.5) – (0.5 * log
0.5) = 1

http://www.cse.unsw.edu.au/
Small
Red
Wedge
Try split on size which
Large
Red
Wedge
has three values:
Large
Red
Pillar
large, medium and small
 There are four instances with size = large.

65
EXAMPLE

There is one small
positive and one small
negative


Shape
Class
Medium
Blue
Brick
Yes
Small
Red
Sphere
Yes
Large
Green
Pillar
Yes
Large
Green
Sphere
Yes
Small
Red
Wedge
No
Large
Red
Wedge
No
Large
Red
Pillar
No
Size
Color
Shape
Class
Medium
Blue
Brick
Yes
Small
Red
Sphere
Yes
Large
Green
Pillar
Yes
Large
Green
Sphere
Yes
Small
Red
Wedge
No
Large
Red
Wedge
No
Large
Red
Pillar
No
Entropy for Small is:
– (0.5 * log 0.5) – (0.5 * log 0.5) = 1
There is only one
medium positive and
no medium negatives

Color
Entropy for Medium is 0
Expected information
for a split on Size is:
http://www.cse.unsw.edu.au/

Size
66
EXAMPLE

The expected information gain for Size is:

Checking the Information Gains on color and
shape:
Color has an information gain of 0.52
 Shape has an information gain of 0.7

Therefore split on shape
 Repeat for all subtrees

http://www.cse.unsw.edu.au/

0.99 – 0.86 = 0.13
67
OUTPUT TREE
http://www.cse.unsw.edu.au/
68
WINDOWING

ID3 can deal with very large data sets by
performing induction on subsets or windows onto
the data
2.
3.
4.

Select a random subset of the whole set of training
instances
Use the induction algorithm to form a rule to
explain the current window
Scan through all of the training instances looking
for exceptions to the rule
Add the exceptions to the window
http://www.cse.unsw.edu.au/
1.
Repeat steps 2 to 4 until there are no exceptions
left
69
NOISY DATA




http://www.cse.unsw.edu.au/

Frequently, training data contains "noise" (examples
which are misclassified)
In such cases, one is like to end up with a part of the
decision tree which considers say 100 examples, of
which 99 are in class C1 and the other is apparently
in class C2
If there are any unused attributes, we might be able
to use them to elaborate the tree to take care of this
one case, but the subtree we would be building would
in fact be wrong, and would likely misclassify real
data
Thus, particularly if we know there is noise in the
training data, it may be wise to "prune" the decision
tree to remove nodes which, statistically speaking,
seem likely to arise from noise in the training data
A question to consider: How fiercely should we prune?
70
PRUNING ALGORITHM
Approximate expected error assuming that we
prune at a particular node
 Approximate backed-up error from children
assuming we did not prune
 If expected error is less than backed-up error,
prune

http://www.cse.unsw.edu.au/
71
EXPECTED ERROR
If we prune a node, it becomes a leaf labelled, C
 Laplace error estimate:

http://www.cse.unsw.edu.au/
S is the set of examples in a node
 k is the number of classes
 N examples in S
 C is the majority class in S
 n out of N examples in S belong to C

72
BACKED-UP ERROR

Probabilities can be estimated by relative
frequencies of attribute values in sets of
examples that fall into child nodes
http://www.cse.unsw.edu.au/

Let children of Node be Node1, Node2, etc
73
PRUNING EXAMPLE
http://www.cse.unsw.edu.au/
74
ERROR CALCULATION FOR PRUNING
Left child of b has class frequencies [3, 2], error = 0.429

Right child has error of 0.333

Static error estimate E(b) is 0.375, calculated using the
Laplace error estimate formula, with N=6, n=4, and k=2

Backed-up error is:


http://www.cse.unsw.edu.au/

(5/6 and 1/6 because there are 4+2=6 examples handled by node b, of
which 3+2=5 go to the left subtree and 1 to the right subtree
Since backed-up estimate of 0.413 is greater than static
estimate of 0.375, we prune the tree and use the static error of
0.375
75
C4.5 ALGORITHM
An extension of the C4.5 algorithm
 Statistical classifier
 Uses concept of Information Entropy

Wikpedia
76
C4.5 REFINEMENTS OVER ID3
Splitting criterion is Information Gain Ratio
 Post-pruning after induction of trees, e.g. based
on test sets, in order to increase accuracy
 Allows for attributes that have a whole range of
discrete or continuous values
 Handles training data with missing attribute
values by replacing them with the most common
or the most probable value

informatics.sussex.ac.uk
77
```