lecture_05

```ECE 8443 – Pattern Recognition
LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION
• Objectives:
Discrete Features
Maximum Likelihood
• Resources:
D.H.S: Chapter 3 (Part 1)
D.H.S.: Chapter 3 (Part 2)
J.O.S.: Tutorial
BGSU: Example
A.W.M.: Tutorial
S.P.: Primer
CSRN: Unbiased
A.W.M.: Bias
URL:
Audio:
Discrete Features
• For problems where features are discrete:
 p ( x  j ) d x   P (x |ω j )
x
• Bayes formula involves probabilities (not densities):
P  j x  

p x
j
 P  j 
p x 
 P  j x  

P x
j
 P  j 
P x 
where
c

P x    P x 
j 1
j
 P  j 
• Bayes rule remains the same:
*
α  arg min R ( α i | x )
i
• The maximum entropy distribution is a uniform distribution:
P (x  x i ) 
1
N
ECE 8443: Lecture 05, Slide 1
Discriminant Functions For Discrete Features
• Consider independent binary features:
x  ( x1 ,..., x d )
p i  Pr [x i  1|ω1 ) q i  Pr [x i  1|ω 2 )
t
• Assuming conditional independence:
d
P (x | ω1 )  
i 1
x
pi i
(1  p i )
d
1 xi
x
P (x | ω 2 )   q i i (1  q i )
1 xi
i 1
• The likelihood ratio is:
P ( x | ω1 )
P (x | ω2 )
x
1 x i
x
1 xi
d
p i i (1  p i )
i 1
q i i (1  q i )
 
• The discriminant function is:
d
pi
i 1
qi
g ( x )   x i ln
 (1  x i ) ln
(1  p i )
(1  q i )
d
d
p i (1  q i )
i 1
i 1
q i (1  p i )
  w i x i  w 0   ln
ECE 8443: Lecture 05, Slide 2
 ln
P ( 1 )
P ( 2 )
d
(1  p i )
i 1
(1  q i )
x i   ln
 ln
P ( 1 )
P ( 2 )
Introduction to Maximum Likelihood Estimation
• In Chapter 2, we learned how to design an optimal classifier if we knew the
prior probabilities, P(i), and class-conditional densities, p(x|i).
• What can we do if we do not have this information?
• What limitations do we face?
• There are two common approaches to parameter estimation: maximum
likelihood and Bayesian estimation.
• Maximum Likelihood: treat the parameters as quantities whose values are
fixed but unknown.
• Bayes: treat the parameters as random variables having some known prior
distribution. Observations of samples converts this to a posterior.
• Bayesian Learning: sharpen the a posteriori density causing it to peak near
the true value.
ECE 8443: Lecture 05, Slide 3
General Principle
• I.I.D.: c data sets, D1,...,Dc, where Dj drawn independently according to p(x|j).
• Assume p(x|j) has a known parametric form and is completely determined
by the parameter vector j (e.g., p(x|j)  N(j,j),
where j=[1, ..., j , 11, 12, ...,dd]).
• p(x|j) has an explicit dependence on j: p(x|j,j)
• Use training samples to estimate 1, 2,..., c
• Functional independence: assume Di gives no useful information
• Simplifies notation to a set D of training samples (x1,... xn) drawn
independently from p(x|) to estimate .
• Because the samples were drawn independently:
n
p(D | )   p(x k )
k 1
ECE 8443: Lecture 05, Slide 4
Example of ML Estimation
• p(D|) is called the likelihood of  with respect to the data.
• The value of  that maximizes this likelihood, denoted ˆ ,
is the maximum likelihood estimate (ML) of .
• Given several training points
• Top: candidate source distributions are
shown
• Which distribution is the ML estimate?
• Middle: an estimate of the likelihood of
the data as a function of  (the mean)
• Bottom: log likelihood
ECE 8443: Lecture 05, Slide 5
General Mathematics
t
Let   ( 1 ,  2 ,...,  p ) .
Let  
 
 
 1





   p



.



Define : l     ln p  D  
ˆ  arg max l   
θ
n
 ln(  p ( x k  ))
k 1
n
  ln  p  x k  
k 1
ECE 8443: Lecture 05, Slide 6
• The ML estimate is found by
solving this equation:
n
  l    [  ln  p  x k  ]
k 1
n
    ln  p  x k    0 .
k 1
• The solution to this equation can
be a global maximum, a local
maximum, or even an inflection
point.
• Under what conditions is it a global
maximum?
Maximum A Posteriori Estimation
• A class of estimators – maximum a posteriori (MAP) – maximize l   p   
where p    describes the prior probability of different parameter values.
• An ML estimator is a MAP estimator for uniform priors.
• A MAP estimator finds the peak, or mode, of a posterior density.
• MAP estimators are not transformation invariant (if we perform a nonlinear
transformation of the input data, the estimator is no longer optimum in the
new space). This observation will be useful later in the course.
ECE 8443: Lecture 05, Slide 7
Gaussian Case: Unknown Mean
• Consider the case where only the mean,  = , is unknown:
n
   ln  p x k    0
k 1
1
ln( p ( x k ))  ln[

( 2 )
1
d /2

1/ 2
d
exp[
ln[( 2  )  ] 
2
1
2
1
2
t
( x k  ) 
t
( x k  ) 
1
1
( x k  )]
( x k  )
which implies:   ln( p ( x k ))   1 ( x k  )
because:
  1
1

d
t
1
[

ln[(
2

)

]

(
x


)

(
x


)]


k
k
  2
2




 
[
1
1
2
( x k  )
ECE 8443: Lecture 05, Slide 8
d
ln[( 2  )  ] 

1
t
1
[ ( x k  )  ( x k  )]
 2
Gaussian Case: Unknown Mean
• Substituting into the expression for the total likelihood:
n
n
k 1
k 1
  l     ln  p x k     
• Rearranging terms:
n

1
k 1
1
( x k  )  0
( x k  ˆ )  0
n
 ( x k  ˆ )  0
k 1
n
n
k 1
k 1
 x k   ˆ  0
n
 x k  n ˆ  0
k 1
n
ˆ  1  x
k
n k 1
• Significance???
ECE 8443: Lecture 05, Slide 9
Gaussian Case: Unknown Mean and Variance
• Let  = [,2]. The log likelihood of a SINGLE point is:
ln( p ( x k  ))  
1
2
ln[( 2  ) 2 ] 
1
2
1
( x k  1 )  2 (x k  1 )
t
1


( x k  1 )


2

 θ l   θ ln( p ( x k θ ))  
2
1
( x  1 ) 


 k
2
 2

2 2
2


• The full likelihood leads to:
n
1
( x k  ˆ1 )  0
ˆ
k 1 

2
2
1
( x k  ˆ1 )

0 
 
2
ˆ
ˆ
2
k  1 2 2
n
2
ECE 8443: Lecture 05, Slide 10
n
n
k 1
k 1
2
 ( x k  ˆ1 )   ˆ2
Gaussian Case: Unknown Mean and Variance
1 n
ˆ
• This leads to these equations:  1  ˆ   x k
n k 1
• In the multivariate case:
1 n
2
2
ˆ
 2  ˆ 
( x k  ˆ )
n k 1
1 n
ˆ 
 xk
n k 1
ˆ
2

1
n
n
 x k  ˆ  x k  ˆ 
t
k 1
t
• The true covariance is the expected value of the matrix  x k  ˆ  x k  ˆ  ,
which is a familiar result.
ECE 8443: Lecture 05, Slide 11
Convergence of the Mean
• Does the maximum likelihood estimate of the variance converge to the true
value of the variance? Let’s start with a few simple results we will need later.
• Expected value of the ML estimate of the mean:
E [ ˆ ]  E [
1
n
 xi ]
n i 1

1
n
 E [ xi ]
n i 1

1
n
  
n i 1
ECE 8443: Lecture 05, Slide 12
2
var[ ˆ ]  E [ ˆ ]  ( E [ ˆ ])
2
 E [ ˆ ]  
2
2

 1 n  1 n
2
 E [  xi    x j  ]  


 n i 1   n j 1 
2

1  n n
2


E [ xi x j ]   


2 

n  i 1 j 1

Variance of the ML Estimate of the Mean
• The expected value of xixj will be 2 for j  k since the two random variables are
independent.
• The expected value of xi2 will be 2 + 2.
• Hence, in the summation above, we have n2-n terms with expected value 2
and n terms with expected value 2 + 2.
• Thus,
var[ ˆ ] 
1
n
2
n
2

n 
2

n 
2

2
  
2


2
n
which implies:
2
E [ ˆ ]  var[ ˆ ]  ( E [ ˆ ])
2


2

2
n
• We see that the variance of the estimate goes to zero as n goes to infinity, and
our estimate converges to the true estimate (error goes to zero).
ECE 8443: Lecture 05, Slide 13
Summary
• Discriminant functions for discrete features are completely analogous to the
continuous case (end of Chapter 2).
• To develop an optimal classifier, we need reliable estimates of the statistics of
the features.
• In Maximum Likelihood (ML) estimation, we treat the parameters as having
unknown but fixed values.
• Justified many well-known results for estimating parameters (e.g., computing
the mean by summing the observations).
• Biased and unbiased estimators.
• Convergence of the mean and variance estimates.
ECE 8443: Lecture 05, Slide 14
```