### EMAlgorithm.ppt

```EM Algorithm
Jur van den Berg
Kalman Filtering vs. Smoothing
• Dynamics and Observation model
x t 1

Axt  w t ,
w t ~ W t  N (0, Q )
yt

Cxt  vt ,
v t ~ Vt  N (0, R )
• Kalman Filter:
– Compute  X t | Y0
 y 0 ,  , Yt  y t 
– Real-time, given data so far
• Kalman Smoother:
– Compute  X t | Y0  y 0 ,  , YT
 y T ,
– Post-processing, given all data
0tT
EM Algorithm
x t 1

Axt  w t ,
w t ~ W t  N (0, Q )
yt

Cxt  vt ,
v t ~ Vt  N (0, R )
• Kalman smoother:
– Compute distributions X0, …, Xt
given parameters A, C, Q, R, and data y0, …, yt.
• EM Algorithm:
– Simultaneously optimize X0, …, Xt and A, C, Q, R
given data y0, …, yt.
Probability vs. Likelihood
• Probability: predict unknown outcomes based
on known parameters:
– p(x | q)
• Likelihood: estimate unknown parameters
based on known outcomes:
– L(q | x) = p(x | q)
• Coin-flip example:
– q is probability of “heads” (parameter)
– x = HHHTTH is outcome
Likelihood for Coin-flip Example
• Probability of outcome given parameter:
– p(x = HHHTTH | q = 0.5) = 0.56 = 0.016
• Likelihood of parameter given outcome:
– L(q = 0.5 | x = HHHTTH) = p(x | q) = 0.016
• Likelihood maximal when q = 0.6666…
• Likelihood function not a probability density
Likelihood for Cont. Distributions
• Six samples {-3, -2, -1, 1, 2, 3} believed to be
drawn from some Gaussian N(0, s2)
• Likelihood of s:
L (s | {  3,  2 ,  1,1, 2 ,3})  p ( x   3 | s )  p ( x   2 | s )  p ( x  3 | s )
• Maximum likelihood:
(  3 )  (  2 )  (  1)  1  2  3
2
s 
2
2
6
2
2
2
 2 . 16
Likelihood for Stochastic Model
• Dynamics model
x t 1

Axt  w t ,
w t ~ W t  N (0, Q )
yt

Cxt  vt ,
v t ~ Vt  N (0, R )
• Suppose xt and yt are given for 0 ≤ t ≤ T, what
is likelihood of A, C, Q and R?
• L ( A, C , Q , R | x, y )  p (x, y | A, C , Q , R )   p ( x | x ) p ( y
• Compute log-likelihood:
T
t
t0
log p ( x , y | A , C , Q , R )
t 1
t
| xt )
Log-likelihood
T
log p ( x , y | A , C , Q , R )  log

p ( x t | x t 1 ) p ( y t | x t ) 
t0
T 1
 log
T
p ( x t 1 | x t ) 
t0
 log
p ( y t | x t )  ...
t0
• Multivariate normal distribution N(m, S) has
exp(  ( x  μ ) S ( x  μ ))
pdf: p ( x )  ( 2 ) S
• From model: x ~ N ( A x , Q ) y ~ N (C x , R )
k / 2
1
t 1
1/ 2
T
1
2
t
t
1
t
1
 T 1 1

1
T
1
   log Q  ( x t  1  A x t ) Q ( x t  1  A x t )  
2
 t0 2

1
 T 1

1
T
1
  log R  ( y t  C x t ) R ( y t  C x t )   const
2
 t0 2

Log-likelihood #2
1
 T 1 1

1
T
1
  log Q  ( x t  1  A x t ) Q ( x t  1  A x t )  
2
 t0 2

1
 T 1

1
T
1
  log R  ( y t  C x t ) R ( y t  C x t )   const  ...
2
 t0 2

• a = Tr(a) if a is scalar
• Bring summation inward

T
log Q
1
2
T 1
2
log R
1
1  T 1

T
1
   Tr(( x t  1  A x t ) Q ( x t  1  A x t ))  
2  t0

1 T

T
1
   Tr(( y t  C x t ) R ( y t  C x t ) )   const
2  t0

Log-likelihood #3
T
log Q
2
T 1
1
1  T 1

T
1
   Tr(( x t  1  A x t ) Q ( x t  1  A x t ))  
2  t0

log R
1
2
1 T

T
1
   Tr(( y t  C x t ) R ( y t  C x t ) )   const  ...
2  t0

• Tr(AB) = Tr(BA)
• Tr(A) + Tr(B) = Tr(A+B)

T
log Q
1
2
T 1
2
log R
1
 1  T 1
T
 Tr  Q   ( x t  1  A x t )( x t  1  A x t )
2
 t0

1
 1  T
T
 Tr  R   ( y t  C x t )( y t  C x t )
2
 t0

1

  


   const

Log-likelihood #4
T
1
log Q
2
T 1
 1  T 1
T
 Tr  Q   ( x t  1  A x t )( x t  1  A x t )
2
 t0

1
log R
1
2
 1  T
T
 Tr  R   ( y t  C x t )( y t  C x t )
2
 t0

1

  


   const  ...

• Expand
l ( A, C , Q , R | x, y ) 
T
log Q
2
T 1
2
1
 1  T 1

T
T
T
T
T
T 

 Tr  Q   x t  1 x t  1  x t  1 x t A  A x t x t  1  A x t x t A   
2
 t0


log R
1
1
 1  T
T
T
T
T
T
T
 Tr  R   y t y t  y t x t C  C x t y t  1  C x t x t C
2
 t0

1

   const

Maximize likelihood
• log is monotone function
– max log(f(x))  max f(x)
• Maximize l(A, C, Q, R | x, y) in turn for A, C, Q
and R.
– Solve
– Solve
– Solve
– Solve
l ( A, C , Q , R | x, y )
A
l ( A, C , Q , R | x, y )
C
l ( A, C , Q , R | x , y )
Q
l ( A, C , Q , R | x, y )
R
0
0
0
0
for A
for C
for Q
for R
Matrix derivatives
• Defined for scalar functions f : Rn*m -> R
• Key identities
x A x
T
x
 B AB
 x ( A  A)
T
T
T
B
 B ( A  A)
T
 Tr ( AB )
A
 log A
A

 A
T
 Tr ( BA )
A
T
 Tr ( B A )
T

A
T
 B
T
Optimizing A
• Derivative
l ( A, C , Q , R | x , y )
A
T 1

1
T
T 
 Q   2 x t 1 x t  2 A x t x t 
2
 t0

1
• Maximizer
T 1
T 1

T 
T 
A    x t 1 x t    x t x t 
 t0
 t  0

1
Optimizing C
• Derivative
l ( A, C , Q , R | x , y )
C
T

1
T
T 
 R   2 y t x t  2C x t x t 
2
 t 0

1
• Maximizer

T 
T 
C    y t x t   x t x t 
 t0
 t  0

T
T
1
Optimizing Q
• Derivative with respect to inverse
l ( A, C , Q , R | x, y )
Q
1
T 1
1
T
T
T
T
T
T 
 Q    x t 1 x t 1  x t 1 x t A  A x t x t 1  A x t x t A 
2
2  t0

T
• Maximizer
1  T 1
T
T
T
T
T
T 
Q    x t 1 x t 1  x t 1 x t A  A x t x t 1  A x t x t A 
T  t0

T
Optimizing R
• Derivative with respect to inverse
l ( A, C , Q , R | x, y )
R
1

T 1
2
1
T
T
T
T
T
T 
R    y ty t  y txt C  Cxty t  Cxtxt C 
2  t0

T
• Maximizer
 T
T
T
T
T
T
T 
R
  y ty t  y txt C  Cxty t  Cxtxt C 
T  1  t0

1
T
EM-algorithm
x t 1

Axt  w t ,
w t ~ W t  N (0, Q )
yt

Cxt  vt ,
v t ~ Vt  N (0, R )
• Initial guesses of A, C, Q, R
• Kalman smoother (E-step):
– Compute distributions X0, …, XT
given data y0, …, yT and A, C, Q, R.
• Update parameters (M-step):
– Update A, C, Q, R such that
expected log-likelihood is maximized
• Repeat until convergence (local optimum)
Kalman Smoother
• for (t = 0; t < T; ++t)
// Kalman filter
xˆ t  1|t  A xˆ t |t
Pt  1|t  APt |t A  Q
T
CP
 K y
K t 1

Pt  1|t C
xˆ t  1|t  1

xˆ t  1|t
Pt  1|t  1

T
t  1|t
t 1
C
t 1
T

1
R
 C xˆ t  1|t 
Pt  1|t  K t  1CP t  1|t
• for (t = T – 1; t ≥ 0; --t)
// Backward pass
1
Lt

xˆ t |T

xˆ t |t  L t xˆ t  1|T  xˆ t  1|t 
Pt |T

Pt |t  L t ( Pt  1|T  Pt  1|t ) L t
T
Pt |t A Pt  1|t
T
Update Parameters
• Likelihood in terms of x, but only X available
l ( A, C , Q , R | x, y ) 
T
log Q
2
T 1
2
1
 1  T 1

T
T
T
T
T
T 
 Tr  Q   x t  1 x t  1  x t  1 x t A  A x t x t  1  A x t x t A   
2
 t0


log R
1
1
 1  T
T
T
T
T
T
T
 Tr  R   y t y t  y t x t C  C x t y t  1  C x t x t C
2
 t0

1

   const

• Likelihood-function linear in x , x x , x x
• Expected likelihood: replace them with:
t
t
T
t
t
T
t 1
E ( X t | y )  xˆ t |T
T
T
E ( X t X t | y )  Pt |T  xˆ t |T xˆ t |T

T
T
T
E ( X t X t  1 | y )  xˆ t |t xˆ t  1|T  L t Pt  1|T  ( xˆ t  1|T  xˆ t  1|t ) xˆ t  1|T

• Use maximizers to update A, C, Q and R.
Convergence
• Convergence is guaranteed to local optimum
• Similar to coordinate ascent
Conclusion
• EM-algorithm to simultaneously optimize
state estimates and model parameters
• Given ``training data’’, EM-algorithm can be
used (off-line) to learn the model for
subsequent use in (real-time) Kalman filters
Next time
• Learning from demonstrations
• Dynamic Time Warping
```