### Clustering 2 (Chap 7)

```Clustering and Probability
(Chap 7)
Review from Last Lecture
 Defined the K-means problem for formalizing the notion
of clustering.
 Discussed the K-means algorithm.
 Noted that the K-means algorithm was “quite good” in
discovering “concepts” from data (based on features).
 Noted the important distinction between “attributes” and
“features”.
Example of K-means -1
Measure 1
Measure 2
Patient 1
1
1
Patient 2
2
1
Patient 3
3
4
Patient 4
4
5
Let initial centroids be C1 = (1,1) and C2 = (2,1)
Example of K-means-2
Measure Measure
1
2
Dist to
C1 (1,1)
Dist to
C2 (2,1)
nearest
Patient 1
1
1
0
1
C1
Patient 2
2
1
1
0
C2
Patient 3
3
4
3.6
3.16
C2
Patient 4
4
5
5
4.47
C2
C1: = (1,1); C2 = ((2+3+4)/3, (1+4+5)/3)) = (3,3.33)
Example of K-means-3
Measure1 Measure2
Dist to
C1 (1,1)
Dist to C2
(3,3.33)
nearest
Patient 1
1
1
0
3.07
C1
Patient 2
2
1
1
2.54
C1
Patient 3
3
4
3.6
0.67
C2
Patient 4
4
5
5
2.10
C2
C1: = ((1+2)/2, (1+1)/2) = (1.5,1); C2 = ((3 +4)/2), (4+5)/2) = (3.5, 4.5)
Example of K-means-4
Measure1 Measure2
Dist to
Dist to C2
C1 (1.5,1) (3.5.4.5)
nearest
Patient 1
1
1
0.5
4.3
C1
Patient 2
2
1
0.5
3.8
C1
Patient 3
3
4
3.35
0.70
C2
Patient 4
4
5
4.61
1.59
C2
C1: = ((1+2)/2, (1+1)/2) = (1.5,1); C2 = ((3 +4)/2), (4+5)/2) = (3.5, 4.5)
Example: 2 Clusters
c A(-1,2)
c B(1,2)
4
(0,0)
c C(-1,-2)
c D(1,-2)
2
K-means Problem: Solution is (0,2) and (0,-2) and the clusters are {A,B} and
{C,D}
K-means Algorithm: Suppose the initial centroids are (-1,0) and (1,0) then
{A,C} and {B,D} end up as the two clusters.
Several other issues regarding clustering
 How do you select the initial centroids?
 How do you select the right number of clusters ?
 How do you deal with non-Euclidean
distance/similarity measures ?
 Other approaches (like hierarchical, spectral etc.)
 Curse of high-dimensionality.
Question
S-Length
S-Width
P-Length
P-Width
Flower
Small
Medium
Small
Medium
A (SetosA)
Medium
Medium
Large
Large
O(Versicolor)
Medium
Small
Small
Large
I (Virginica)
Large
Large
Medium
Small
A
Large
Small
Medium
Small
?
What should the “prediction” be for the flower ?
Prediction and Probability
 When we make predictions we should assign
“probabilities” with the prediction.
 Examples:





20% chance it will rain tomorrow.
50% chance that the tumor is malignant.
60% chance that the stock market will fall by the end of the
week.
30% that the next president of the United States will be a
Democrat.
0.1% chance that the user will click on a banner-ad.
 How do we assign probabilities to complex events..
using smart data algorithms…and counting.
Probability Basics
 Probability is a deep topic…..but for most cases the
rules are straightforward to apply..
 Terminology
 Experiment
 Sample Space
 Events
 Probability
 Rules of probability
 Conditional Probability
 Bayes Rule
Probability: Sample Space
 Consider an experiment and let S be the space of
possible outcomes.
 Example:
 Experiment is tossing a coin; S={h,t}

Experiment is rolling a pair of dice: S={(1,1),(1,2),…(6,6)}

Experiment is a race consisting of three cars: 1,2 and 3. The
sample space is {(1,2,3),(1,3,2),(2,1,3),(2,3,1),(3,1,2),(3,2,1)}
Probabilities
 Let Sample Space S = {1,2,…m}
 Consider numbers
pi ³ 0,i =1, 2...m; å pi =1
i
 pi is the probability that the outcome of the
experiment is i.
 Suppose we toss a fair coin. Sample space is S={h,t}.
Then ph = 0.5 and pt = 0.5.
Probability
 Experiment: Will it rain or not in Sydney :
S = {rain, no-rain}

Prain = 138/365 =0.38;
Pno-rain = 227/365
 Assigning (or rather how to) probabilities is a deep
philosophical problem.

What is the probability that the “green object standing outside
my house is a burglar dressed in green.”
Probability
 An Event A is a set of possible outcomes of the
experiment. Thus A is a subset of S.
 Let A be the event of getting a seven when we roll a
pair of dice.

A = {(1,6),(6,1),(2,5),(5,2),(4,3),(3,4) }

P(A) = 6/36 = 1/6
 In general P(A) = å pi
iÎA
Probability
 The sample space S and events are “sets”.
 P(S) = 1;
 P(Φ) = 0

Often
P(AÈ B) = P(A)+ P(B)- P(AÇ B)
P(AÇ B) º P(AB) º P(A, B)
 Complement:
P(Ac ) =1- P(A)
Example
 Suppose the probability of raining today is 0.4 and
tomorrow is also 0.4 and on both days is 0.1. What
is the probability it does not rain on either day.
 S={(R,N), (R,R),(N,N),(N,R)}
 Let A be the event that it will rain today and B it will
rain tomorrow. Then

A ={(R,N), (R,R)} ; B={(N,R),(R,R)}
 Rain at least today or tomorrow: P(AÈ B) = 0.4 + 0.4 - 0.1= 0.7
 Will not rain on either day: 1 – 0.7 = 0.3
Conditional Probability
 One of the most important concepts in all of Data
Mining and Machine Learning
 P(A|B) = P(AB)/P(B) ..assuming P(B) not equal 0.
 Conditional probability of A given B has occurred.
 Probability it will rain tomorrow given it has rained
today.


P(A|B) = P(AB)/(B) = 0.1/0.4 = ¼ = 0.25
In general P(A|B) is not equal to P(B|A)
We need conditional probability to answer….
S-Length
S-Width
P-Length
P-Width
Flower
Small
Medium
Small
Medium
A (SetosA)
Medium
Medium
Large
Large
O(Versicolor)
Medium
Small
Small
Large
I (Virginica)
Large
Large
Medium
Small
A
Large
Small
Medium
Small
?
What should the “prediction” be for the flower ?
Bayes Rule
 P(A|B) = P(AB)/P(B); P(B|A) = P(BA)|P(A)
 Now P(AB) = P(BA)
 Thus P(A|B)P(B) = P(B|A)P(A)
 Thus P(A|B) = [P(B|A)P(A)]/[P(B)]



This is called Bayes Rule
Basis of almost all prediction
Latest theories hypothesize that human memory and action is Bayes
rule in action.
Bayes Rule
Prior
Posterior
P(B | A)P(A)
P(A | B) =
P(B)
P(data | hypothesis)P(hypothesis)
P(hypothesis | Data) =
P(data)
Bayes Rule: Example
The ASX market goes up 60% of the days of a year. 40% of the time it stays
the same or goes down. The day the ASX is up, there is a 50% chance that the
Shanghai Index is up. On other days there is 30% chance that Shanghai goes
up. Suppose The Shanghai market is up. What is the probability that ASX was
up.
Define A1 as “ASX is up”; A2 is “ASX is not up”
Define S1 as “Shanghai is up”; S2 is “Shanghai is not up”
We want to calculate P(A1|S1) ?
P(A1) = 0.6; P(A2) = 0.4;
P(S1|A1) = 0.5; P(S1|A2) = 0.3
P(S2|A1) = 1 – P(S1|A1) = 0.5;
P(S2|A2) = 1 –P(S1|A2) = 0.7;
Bayes Rule: Example
We want to calculate P(A1|S1) ?
P(A1) = 0.6; P(A2) = 0.4;
P(S1|A1) = 0.5; P(S1|A2) = 0.3
P(S2|A1) = 1 – P(S1|A1) = 0.5;
P(S2|A2) = 1 –P(S1|A2) = 0.7;
P(A1|S1) = P(S1|A1)P(A1)/(P(S1))
How do we calculate P(S1) ?
Bayes Rule: Example
 P(S1) = P(S1,A1) + P(S1,A2)
[Key Step]
= P(S1|A1)P(A1) + P(S1|A2)P(A2)
= 0.5 x 0.6 + 0.3 x 0.4
= 0.42
Finally,
P(A1|S1) = P(S1|A1)P(A1)/P(S1)
= (0.5 x 0.6)/0.42 = 0.71
Example: Iris Flower
 F=Flower; SL=Sepal Length; SW = Sepal Width;

PL=Petal Length; PW =Petal Width
Data

Large
Small
Medium
Small
P(F = A) =
P(Data | F = A)P(F = A)
P(Data)
P(F = O) =
P(Data | F = O)P(F = O)
P(Data)
P(Data | F = I )P(F = I)
P(F = I ) =
P(Data)
?
choose the
maximum
Example: Iris Flower
 So how do we compute: P(Data|F=A) ?
 This is a a non-trivial question…[subject to much
research]
 How many times does “Data” appear in the “database”
when F=A.
P(Data | F = A) =
#(Data, F = A)
#(F = A)
 In this case “Data” is a 4-dimensional “data vector.” Each
component takes 3 values (small, medium, large). Thus
number of combinations 3^4 = 81.
Example: Iris Flower
 Conditional Independence
P(Data|F=A) = P(SL=Large,SW=Small,PL=Medium,PW=Small|F=A)

~= P(SL=Large|F=A)P(SW=Small|F=A)P(PL=Medium|A)P(PW=Small|A)

 The above is an assumption to make the “computation
easier.”

Surprisingly evidence suggest that it works reasonably well in practice.
 This prediction method (which exploits conditional
independence) is called “Naïve Bayes Classifier.”
```