### PowerPoint - Ivan Titov

```Learning for Structured Prediction
Linear Methods For Sequence Labeling:
Hidden Markov Models vs Structured Perceptron
Ivan Titov
x is an input (e.g.,
sentence), y is an output
(syntactic tree)
Last Time: Structured Prediction
Selecting feature representation '
1.


(x ; y )
It should be sufficient to discriminate correct structure from incorrect ones
It should be possible to decode with it (see (3))
Learning
2.

Which error function to optimize on the training set, for example
w ¢' (x ; y ? ) ¡ maxy 02 Y ( x ) ;y 6= y ? w ¢' (x ; y 0) > °

How to make it efficient (see (3))
Decoding: y =
3.


argmaxy 02 Y (x ) w ¢' (x ; y 0)
Dynamic programming for simpler representations '
Approximate search for more powerful ones?
?
We illustrated all these challenges on the example of dependency parsing
Outline

Sequence labeling / segmentation problems: settings and example problems:



Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model

Standard definition + maximum likelihood estimation

General views: as a representative of linear models
Perceptron and Structured Perceptron

algorithms / motivations

Decoding with the Linear Model

Discussion: Discriminative vs. Generative
3
Sequence Labeling Problems


Definition:

Input: sequences of variable length

Output: every position is labeled
x = (x 1 ; x 2 ; : : : ; x j x j ), x i 2 X
y = (y1 ; y2 ; : : : ; yj x j ), yi 2 Y
Examples:


Part-of-speech tagging
x = John
carried
a
tin
can
.
y = NP
VBD
DT
NN
NN
.
Named-entity recognition, shallow parsing (“chunking”),
from video-streams, …
gesture recognition
4
Part-of-speech tagging
x = John
NNP
y =


a
tin
can
.
VBD
DT
NN
NN or MD?
.
In fact, even knowing that
previous
If youthe
just
predict the most
word is a nounfrequent
is not enough
tag for each word you
will make a mistake here
Labels:

NNP – proper singular noun;

NN – singular noun

VBD - verb, past tense

MD - modal

DT - determiner

. - final punctuation
Consider
One need to model interactions between labels
to successfully resolve ambiguities, so this
should be tackled as a structured prediction
problem
x = Tin
can
cause
poisoning
…
NN
MD
VB
NN
…
y =
3
carried
5
Named Entity Recognition
[ORG Chelsea], despite their name, are not based in [LOC Chelsea], but in
neighbouring [LOC Fulham] .



[PERS Bill Clinton] will not embarrass [PERS Chelsea] at her wedding

Tiger failed to make a birdie in the South Course …
Is it an animal or a person?
Encoding example (BIO-encoding)
x=
Bill
y = B-PERS
3
Chelsea can be a person too!
Not as trivial as it may seem, consider:
Clinton
embarrassed
Chelsea
at
her
wedding
at
Astor
Courts
I-PERS
O
B-PERS
O
O
O
O
B-LOC
I-LOC
6
Figures from (Wang et al.,
CVPR 06)
Vision: Gesture Recognition

Given a sequence of frames in a video annotate each frame with a gesture type:

Types of gestures:
Flip back

Shrink
vertically
Expand
vertically
Double
back
Point
and back
Expand horizontally
It is hard to predict gestures from each frame in isolation, you need to exploit
relations between frames and gesture types
7
Outline

Sequence labeling / segmentation problems: settings and example problems:



Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model

Standard definition + maximum likelihood estimation

General views: as a representative of linear models
Perceptron and Structured Perceptron

algorithms / motivations

Decoding with the Linear Model

Discussion: Discriminative vs. Generative
8
Hidden Markov Models


We will consider the part-of-speech (POS) tagging example
John
carried
a
tin
can
.
NP
VBD
DT
NN
NN
.
A “generative” model, i.e.:

Model: Introduce a parameterized model of how both words and tags are
generated P (x ; y jµ)

Learning: use a labeled training set to estimate the most likely parameters of
^
the model µ

Decoding:
^
y = argmaxy 0 P(x ; y 0j µ)
9
Hidden Markov Models
A simplistic state diagram for noun phrases:
Det
N – tags, M – vocabulary size
[0.5 : a
0.5 : the]
Example:
a
0.8
0.5
0.1
\$
0.1
0.2
1.0
0.8
hungry
dog
[0.01: red,
0.01 : hungry
, …]
0.5
Noun
[0.01 : dog
0.01 : herring, …]



States correspond to POS tags,
Words are emitted independently from each POS tag
Parameters (to be estimated from the training set):

Transition probabilities P(y( t ) jy( t ¡

Emission probabilities
1)
P(x ( t ) jy( t ) )
Stationarity assumption: this
probability does not depend on
the position in the sequence t
) :
[ N x N ] matrix
:
[ N x M] matrix
10
Hidden Markov Models
Representation as an instantiation of a graphical model:
y( 1)= Det
N – tags, M – vocabulary size
y( 3)= Noun y( 4)
…
A arrow means that in the
generative story x(4) is generated
from some P(x(4) | y(4))
…
x (1) = a



x (2)= hungry x (3) = dog x (4)
States correspond to POS tags,
Words are emitted independently from each POS tag
Parameters (to be estimated from the training set):

Transition probabilities P(y( t ) jy( t ¡

Emission probabilities
1)
P(x ( t ) jy( t ) )
Stationarity assumption: this
probability does not depend on
the position in the sequence t
) :
[ N x N ] matrix
:
[ N x M] matrix
11
Hidden Markov Models: Estimation

N – the number tags, M – vocabulary size

Parameters (to be estimated from the training set):



Transition probabilities aj i = P(y( t ) = i jy(t ¡

Emission probabilities bi k = P(x (t ) = kjy( t ) = i ) ,
1)
= j ) , A - [ N x N ] matrix
B - [ N x M] matrix
Training corpus:

x(1)= (In, an, Oct., 19, review, of, …. ), y(1)= (IN, DT, NNP, CD, NN, IN, …. )

x(2)= (Ms., Haag, plays, Elianti,.), y(2)= (NNP, NNP, VBZ, NNP, .)

…

x(L)= (The, company, said,…), y(L)= (DT, NN, VBD, NNP, .)
How to estimate the parameters using maximum likelihood estimation?

You probably can guess what these estimation should be?
12
Hidden Markov Models: Estimation

Parameters (to be estimated from the training set):

Transition probabilities aj i = P(y( t ) = i jy(t ¡
= j ) , A - [ N x N ] matrix

Emission probabilities bi k = P(x (t ) = kjy( t ) = i ) ,
1)
B - [ N x M] matrix

Training corpus: (x(1),y(1) ), l = 1, … L

Write down the probability of the corpus according to the HMM:
P(f x
=
QL
(l )
; y (l ) gLl= 1 )
l = 1 a\$;y 1( l )
³Q
l = 1 P(x
=
(l )
; y (l) ) =
jx l j¡ 1
by ( l ) ;x ( l ) ay ( l ) ;y ( l )
t= 1
t
t
t
t+ 1
Draw a word
from this state
Select tag for
the first word
8
=
QL
QN
´
by ( l )
jx l j
Select the next
state
C T ( i ;j )
a
i ;j = 1 i ;j
CT(i,j) is #times tag i is followed by tag j.
Here we assume that \$ is a special tag which
precedes and succeeds every sentence
(l)
jx l j
ay ( l )
jx l j
Draw last
word
QN QM
i= 1
;x
;\$
=
Transit into the
\$ state
C ( i ;k )
E
b
k = 1 i ;k
CE(i,k) is #times
word k is
emitted by tag i
13
Hidden Markov Models: Estimation

Maximize: P(f x ( l ) ; y ( l ) gLl= 1 ) =
=
QN
C T ( i ;j )
a
i ;j = 1 i ;j
QN QM
i= 1
C ( i ;k )
E
b
k = 1 i ;k
CT(i,j) is #times tag i is followed by tag j.
CE(i,k) is #times word k is
emitted by tag i
Equivalently maximize the logarithm of this: log(P(f x (l ) ; y ( l ) gLl= 1 )) =
´
P N ³P N
PM
=
i= 1
j = 1 CT (i ; j ) log ai ;j +
k = 1 CE (i ; k) log bi ;k
PN
PN
a
=
1;
i = 1; : : : ; N
subject to probabilistic constraints:
j = 1 i ;j
i = 1 bi ;k = 1;


Or, we can decompose it into 2N optimization tasks:
For transitions
i = 1; : : : ; N :
For emissions
i = 1; : : : ; N :
P
N
maxa i ; 1 ;:::;a i ; N
j = 1 CT (i ; j ) log ai ;j
PN
s.t .
j = 1 ai ;j = 1
maxbi ; 1 ;:::;bi ; M CE (i ; k) logbi ;k
PN
s.t . i = 1 bi ;k = 1
14
Hidden Markov Models: Estimation

For transitions (some i)
PN
maxa i ; 1 ;:::;a i ; N
j = 1 CT (i ; j ) log ai ;j
PN
s.t . 1 ¡
j = 1 ai ;j = 0

PN
P
L (ai ;1 ; : : : ; ai ;N ; ¸ ) =
C
(i
;
j
)
log
a
+
¸
£
(1
¡
i ;j
j=1 T

=
C T ( i ;j )
ai j
¡ ¸ = 0
P(yt = j jyt ¡
1
=)
ai j =
= i ) = ai ;j =
C T ( i ;j )
¸
P C T (i ;j ) 0
j 0 C T ( i ;j )
Similarly, for emissions:
P(x t = kjyt = i ) = bi ;k =
2
ai ;j )
Find critical points of Lagrangian by solving the system of equations:
PN
@L
j = 1 ai ;j = 0
@¸ = 1 ¡
The maximum likelihood solution is
@L
@a i j

N
j=1
P C E ( i ;k ) 0
k 0 C E ( i ;k )
just normalized counts of events.
Always like this for generative
models if all the labels are visible in
training
I ignore “smoothing” to process rare
or unseen word tag combinations…
Outside score of the seminar
15
HMMs as linear models
John
carried
a
tin
?
?
?
?
can
.
?
.
y = argmaxy 0 P(x ; y 0jA; B ) = argmaxy 0 log P(x ; y 0jA; B )

Decoding:

We will talk about the decoding algorithm slightly later, let us generalize Hidden
Markov Model:
0
P
jxj+ 1
log P(x ; y jA; B ) = l = 1 log by i0;x i + log ay i0;y i0+ 1
PN PN
PN PM
0
0
=
C
(y
;
i
;
j
)
£
log
a
+
C
(x
;
y
; i ; k) £ log bi ;k
T
i
;j
E
i= 1
j=1
i= 1
k= 1
The number of times tag i
is followed by tag j in the
candidate y’

2
The number of times tag i
corresponds to word k in (x, y’)
But this is just a linear model!!
16
Scoring: example
(x ; y 0) =
John
carried
a
tin
can
.
NP
VBD
DT
NN
NN
.
' (x ; y 0) = (
Unary features
1
…
0
NP: John
NP:Mary
...
1
0
…
NN-.
MD-.
…
1
NP-VBD
CE (x ; y 0; i ; k)
Edge features
wM L = ( log bN P;J oh n

wM L ¢' (x ; y ) =

log bN P;M ar y
:::
log aN N ;V B D log aN N ;:
CT (y 0; i ; j )
)
log aM D ;: :::
Their inner product is exactly log P (x ; y 0jA; B )
0

)
P
N
i= 1
P
N
j=1
0
CT (y ; i ; j ) £ log ai ;j +
P
N
i= 1
P
M
k= 1
CE (x ; y 0; i ; k) £ log bi ;k
But may be there other (and better?) ways to estimate w , especially when we
know that HMM is not a faithful model of reality?
It is not only a theoretical question! (we’ll talk about that in a moment)
Feature view
Basically, we define features which correspond to edges in the graph:
y( 1)
y( 2)
y( 3)
y( 4)
…
…
x (1)
x (2)
x (3)
x (4)
are visible (both in
training and testing)
18
Generative modeling


For a very large dataset (asymptotic analysis):

If data is generated from some “true” HMM, then (if the training set is sufficiently
large), we are guaranteed to have an optimal tagger

Otherwise, (generally) HMM will not correspond to an optimal linear classifier

Discriminative methods which minimize the error more directly are guaranteed
(under some fairly general conditions) to converge to an optimal linear classifier
For smaller training sets

Generative classifiers converge faster to their optimal error [Ng & Jordan, NIPS
01]
A discriminative classifier
Errors on a
regression dataset
(predict housing
prices in Boston
area):
1
Real case: HMM is a
coarse approximation
of reality
A generative model
# train examples
19
Outline

Sequence labeling / segmentation problems: settings and example problems:



Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model

Standard definition + maximum likelihood estimation

General views: as a representative of linear models
Perceptron and Structured Perceptron

algorithms / motivations

Decoding with the Linear Model

Discussion: Discriminative vs. Generative
20
Perceptron



y 2 f + 1; ¡ 1g
break ties (0) in some
deterministic way
For binary classification the prediction rule is: y = sign (w ¢' (x))
Perceptron algorithm, given a training set f x ( l ) ; y( l ) gLl= 1
w = 0 // initialize
do
err = 0
for l = 1 .. L¡ // over the
training examples
¢
if ( y( l ) w ¢' (x (l ) ) < 0)
// if mistake
w +=
´ y(l ) ' (x (l ) )
// update, ´ > 0
err ++ // # errors
endif
endfor
while ( err > 0 ) // repeat until no errors
return w
21
Roth’s class at UIUC
Linear classification

Linear separable case, “a perfect” classifier:
(w ¢' (x) + b) = 0
' (x) 1
w
' (x) 2

Linear functions are often written as: y = sign (w ¢' (x) + b), but we can assume
that ' (x) 0 = 1 for any x
22
Perceptron: geometric interpretation
if ( y
(l)
¡
(l )
¢
w ¢' (x ) < 0)
w += ´ y(l ) ' (x (l ) )
// if mistake
// update
endif
23
Perceptron: geometric interpretation
if ( y
(l)
¡
(l )
¢
w ¢' (x ) < 0)
w += ´ y(l ) ' (x (l ) )
// if mistake
// update
endif
24
Perceptron: geometric interpretation
if ( y
(l)
¡
(l )
¢
w ¢' (x ) < 0)
w += ´ y(l ) ' (x (l ) )
// if mistake
// update
endif
25
Perceptron: geometric interpretation
if ( y
(l)
¡
(l )
¢
w ¢' (x ) < 0)
w += ´ y(l ) ' (x (l ) )
// if mistake
// update
endif
26
Perceptron: algebraic interpretation
if ( y
(l)
¡
(l )
¢
w ¢' (x ) < 0)
w += ´ y(l ) ' (x (l ) )
// if mistake
// update
endif

We want after the update to increase y


(l)
¡
(l )
w ¢' (x )
¢
If the increase is large enough than there will be no misclassification
Let’s see that’s what happens after the update
y
(l )
¡
(l )
(l )
(l )
¢
(w + ´ y ' (x )) ¢' (x )
¡
¢
¡
¢
(l )
(l )
(l) 2
(l )
(l )
= y w ¢' (x ) + ´ (y ) ' (x ) ¢' (x )
(y( l ) ) 2 = 1

squared norm > 0
So, the perceptron update moves the decision hyperplane towards
misclassified ' (x (l ) )
27
Perceptron


The perceptron algorithm, obviously, can only converge if the
training set is linearly separable
It is guaranteed to converge in a finite number of iterations,
dependent on how well two classes are separated (Novikoff,
1962)
28
Averaged Perceptron

A small modification
w = 0, w P = 0
// initialize
Do not run until convergence
for k = 1 .. K
// for a number of iterations
for l = 1 .. L¡ // over the
training examples
¢
if ( y( l ) w ¢' (x (l ) ) < 0)
// if mistake
w += ´ y(l ) ' (x (l ) )
// update, ´ > 0
Note: it is after endif
endif
w P += w
// sum of w over the course of training
endfor
endfor
More stable in training: a vector w which survived
return K1L w P
more iterations without updates is more similar to
the resulting vector
larger number of times
2
1
KL
w P , as it was added a
w
29
Structured Perceptron

Let us start with structured problem: y = argmaxy 02 Y (x ) w ¢' (x ; y 0)

Perceptron algorithm, given a training set f x ( l ) ; y (l ) gLl= 1
w = 0 // initialize
do
err = 0
for l = 1 .. L // over the training examples
y^ = argmaxy 02 Y (x ( l ) ) w ¢' (x (l ) ; y 0) // model prediction
if ( w ¢' (x (l ) ; y^ ) > w ¢' (x (l ) ; y (l ) )
// if mistake
Pushes the correct
¡
¢
sequence up and the
w += ´ ' (x (l ) ; y (l ) ) ¡ ' (x (l ) ; y^ )
// update
incorrectly predicted one
err ++ // # errors
down
endif
endfor
while ( err > 0 ) // repeat until no errors
return w
30
Str. perceptron: algebraic interpretation
if (w ¢' (x (l ) ;¡y^ ) > w ¢' (x (l ) ; y (l ) )) ¢
w += ´ ' (x (l ) ; y ( l ) ) ¡ ' (x ( l ) ; y
^)
endif


w ¢(' (x (l ) ; y ( l ) ) ¡ ' (x ( l ) ; y^ ))
We want after the update to increase

// if mistake
// update
(l )
If the increase is large enough then y
will be scored above y^
Clearly, that this is achieved as this product will be increased by
´ jj' (x
(l)
;y
(l )
) ¡ ' (x
(l )
; y^ )jj
2
There might be other
y 0 2 Y(x ( l ) )
but we will deal with them
on the next iterations
31
Structured Perceptron

Positive:





Drawbacks



Very easy to implement
Often, achieves respectable results
As other discriminative techniques, does not make assumptions about the
generative process
Additional features can be easily integrated, as long as decoding is tractable
“Good” discriminative algorithms should optimize some measure which is
closely related to the expected testing results: what perceptron is doing on
non-linearly separable data seems not clear
However, for the averaged (voted) version a generalization bound which
generalization properties of Perceptron (Freund & Shapire 98)
Later, we will consider more advance learning algorithms
32
Outline

Sequence labeling / segmentation problems: settings and example problems:



Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model

Standard definition + maximum likelihood estimation

General views: as a representative of linear models
Perceptron and Structured Perceptron

algorithms / motivations

Decoding with the Linear Model

Discussion: Discriminative vs. Generative
33
Decoding with the Linear model
y = argmaxy 02 Y (x ) w ¢' (x ; y 0)

Decoding:

Again a linear model with the following edge features (a generalization of a HMM)

In fact, the algorithm does not depend on the feature of input (they do not need to
be local)
y( 1)
y( 2)
y( 3)
y( 4)
…
…
x (1)
x (2)
x (3)
x (4)
34
Decoding with the Linear model
y = argmaxy 02 Y (x ) w ¢' (x ; y 0)

Decoding:

Again a linear model with the following edge features (a generalization of a HMM)

In fact, the algorithm does not depend on the feature of input (they do not need to
be local)
y( 1)
y( 2)
y( 3)
y( 4)
…
x
35
Decoding with the Linear model

Decoding:
y = argmaxy 02 Y (x ) w ¢' (x ; y 0)
y( 1)
y( 2)
y( 3)
y( 4)
…

Let’s change notation:
x

Edge scores f t (yt ¡ 1 ; yt ; x ) : roughly corresponds to log ay t ¡

Defined for t = 0 too (“start” feature:


Start/Stop symbol
information (\$) can be
encoded with them too.
Decode:
y = argmaxy 02 Y ( x )
P
y0 = \$
1 ;y t
+ log by t ;x t
)
jx j
0
0
f
(y
;
y
t
t¡ 1 t ; x )
t= 1
Decoding: a dynamic programming algorithm - Viterbi algorithm
36

Viterbi algorithmP
Decoding: y = argmaxy 02 Y ( x )
y( 1)
jx j
t= 1
y( 2)
f t (yt0¡ 1 ; yt0; x )
y( 3)
y( 4)
…

x
Loop invariant: ( t = 1; : : : ; jx j)

scoret[y] - score of the highest scoring sequence up to position t with

prevt[y] - previous tag on this sequence

Init: score0[\$] = 0, score0[y] = - 1 for other y

Recomputation ( t = 1; : : : ; jx j)
Time complexity ?
0
0
prevt [y] = argmaxy 0 scor et [y ] + f t (y ; y; x )
O(N 2 jxj)
scoret [y] = scoret ¡ 1 [pr evt [y]] + f t (prevt [y]; y; x )

1
Return:
retrace prev pointers starting from argmaxy scorej x j [y]
37
Outline

Sequence labeling / segmentation problems: settings and example problems:



Part-of-speech tagging, named entity recognition, gesture recognition
Hidden Markov Model

Standard definition + maximum likelihood estimation

General views: as a representative of linear models
Perceptron and Structured Perceptron

algorithms / motivations

Decoding with the Linear Model

Discussion: Discriminative vs. Generative
38
Recap: Sequence Labeling

Hidden Markov Models:


Discriminative models


How to estimate
How to learn with structured perceptron
Both learning algorithms result in a linear model

How to label with the linear models
39
Discriminative vs Generative

Generative models:




Not necessary the case
for generative models
with latent variables
Cheap to estimate: simply normalized counts
Hard to integrate complex features: need to come up with a generative
story and this story may be wrong
Does not result in an optimal classifier when model assumptions are
wrong (i.e., always)
Discriminative models



More expensive to learn: need to run decoding (here,Viterbi) during
training and usually multiple times per an example
Easy to integrate features: though some feature may make decoding
intractable
Usually less accurate on small datasets
40
Reminders



Speakers: slides about a week before the talk, meetings with me
before/after this point will normally be needed
Reviewers: reviews are accepted only before the day we consider the
topic