Sequence Models - Department of Computer and Information

Report
NLP Document and
Sequence Models
Computational models of how natural
languages work
These are sometimes called Language Models
or sometimes Grammars
Three main types (among many others):
1. Document models, or “topic” models
2. Sequence models: Markov models, HMMs,
others
3. Context-free grammar models
Computational models of how natural
languages work
Most of the models I will show you are
- Probabilistic models
- Graphical models
- Generative models
In other words, they are essentially Bayes Nets.
In addition, many (but not all) are
- Latent variable models
This means that some variables in the model are not
observed in data, and must be inferred.
(Like the hidden states in an HMM.)
Topic Models
Topic models are models that
• can predict the likelihood of a document
• using latent topic variables
Let x1, …, xN be the words in a document.
P(x1, …, xN) is the likelihood of a document.
Topic models compute this (conceptually) by computing
(1 , … ,  , 1 , … ,  )
1 ,…,
where each zi is a latent variable that represents the “topic” of word xi.
Three documents with the word “play”
(numbers & colors  topic assignments)
A Play082 is written082 to be performed082 on a stage082 before a live093
audience082 or before motion270 picture004 or television004 cameras004 ( for
later054 viewing004 by large202 audiences082). A Play082 is written082
because playwrights082 have something ...
He was listening077 to music077 coming009 from a passing043 riverboat. The
music077 had already captured006 his heart157 as well as his ear119. It was
jazz077. Bix beiderbecke had already had music077 lessons077. He
wanted268 to play077 the cornet. And he wanted268 to play077 jazz077...
Jim296 plays166 the game166. Jim296 likes081 the game166 for one. The
game166 book254 helps081 jim296. Don180 comes040 into the house038. Don180
and jim296 read254 the game166 book254. The boys020 see a game166 for
two. The two boys020 play166 the game166....
Example: topics from an educational corpus
(TASA)
• 37K docs, 26K words
• 1700 topics, e.g.:
PRINTING
PAPER
PRINT
PRINTED
TYPE
PROCESS
INK
PRESS
IMAGE
PRINTER
PRINTS
PRINTERS
COPY
COPIES
FORM
OFFSET
GRAPHIC
SURFACE
PRODUCED
CHARACTERS
PLAY
PLAYS
STAGE
AUDIENCE
THEATER
ACTORS
DRAMA
SHAKESPEARE
ACTOR
THEATRE
PLAYWRIGHT
PERFORMANCE
DRAMATIC
COSTUMES
COMEDY
TRAGEDY
CHARACTERS
SCENES
OPERA
PERFORMED
TEAM
GAME
BASKETBALL
PLAYERS
PLAYER
PLAY
PLAYING
SOCCER
PLAYED
BALL
TEAMS
BASKET
FOOTBALL
SCORE
COURT
GAMES
TRY
COACH
GYM
SHOT
JUDGE
TRIAL
COURT
CASE
JURY
ACCUSED
GUILTY
DEFENDANT
JUSTICE
EVIDENCE
WITNESSES
CRIME
LAWYER
WITNESS
ATTORNEY
HEARING
INNOCENT
DEFENSE
CHARGE
CRIMINAL
HYPOTHESIS
EXPERIMENT
SCIENTIFIC
OBSERVATIONS
SCIENTISTS
EXPERIMENTS
SCIENTIST
EXPERIMENTAL
TEST
METHOD
HYPOTHESES
TESTED
EVIDENCE
BASED
OBSERVATION
SCIENCE
FACTS
DATA
RESULTS
EXPLANATION
STUDY
TEST
STUDYING
HOMEWORK
NEED
CLASS
MATH
TRY
TEACHER
WRITE
PLAN
ARITHMETIC
ASSIGNMENT
PLACE
STUDIED
CAREFULLY
DECIDE
IMPORTANT
NOTEBOOK
REVIEW
Polysemy
PRINTING
PAPER
PRINT
PRINTED
TYPE
PROCESS
INK
PRESS
IMAGE
PRINTER
PRINTS
PRINTERS
COPY
COPIES
FORM
OFFSET
GRAPHIC
SURFACE
PRODUCED
CHARACTERS
PLAY
PLAYS
STAGE
AUDIENCE
THEATER
ACTORS
DRAMA
SHAKESPEARE
ACTOR
THEATRE
PLAYWRIGHT
PERFORMANCE
DRAMATIC
COSTUMES
COMEDY
TRAGEDY
CHARACTERS
SCENES
OPERA
PERFORMED
TEAM
GAME
BASKETBALL
PLAYERS
PLAYER
PLAY
PLAYING
SOCCER
PLAYED
BALL
TEAMS
BASKET
FOOTBALL
SCORE
COURT
GAMES
TRY
COACH
GYM
SHOT
JUDGE
TRIAL
COURT
CASE
JURY
ACCUSED
GUILTY
DEFENDANT
JUSTICE
EVIDENCE
WITNESSES
CRIME
LAWYER
WITNESS
ATTORNEY
HEARING
INNOCENT
DEFENSE
CHARGE
CRIMINAL
HYPOTHESIS
EXPERIMENT
SCIENTIFIC
OBSERVATIONS
SCIENTISTS
EXPERIMENTS
SCIENTIST
EXPERIMENTAL
TEST
METHOD
HYPOTHESES
TESTED
EVIDENCE
BASED
OBSERVATION
SCIENCE
FACTS
DATA
RESULTS
EXPLANATION
STUDY
TEST
STUDYING
HOMEWORK
NEED
CLASS
MATH
TRY
TEACHER
WRITE
PLAN
ARITHMETIC
ASSIGNMENT
PLACE
STUDIED
CAREFULLY
DECIDE
IMPORTANT
NOTEBOOK
REVIEW
Topic Modeling Techniques
The two most common techniques are:
• Probabilistic Latent Semantic Analysis (pLSA)
• Latent Dirichlet Allocation (LDA)
Commonly-used software packages:
• Mallet, a Java toolkit for various NLP related things like
document classification, and includes a widely-used
implementation of LDA.
• Stanford Topic Modeling Toolbox
• A list of implementations for various topic modeling
techniques
The LDA Model




z1
z2
z3
z4
z1
z2
z3
z4
z1
z2
z3
z4
w1
w2
w3
w4
w1
w2
w3
w4
w1
w2
w3
w4
For each document,
b
• Choose ~Dirichlet()
• For each of the N words wn:
– Choose a topic zn» Multinomial()
– Choose a word wn from p(wn|zn,b), a multinomial probability
conditioned on the topic zn.
Applications
• Visualization and exploration of a corpus
• Track news stories as they evolve
• Pre-processing of a corpus for document
classification tasks
(slide from tutorial by David Blei, KDD 2011)
Microsoft’s Twahpic System
http://twahpic.cloudapp.net/S4.aspx
LDA/pLSA for Text Classification
• Topic models are easy to incorporate into text
classification:
1. Train a topic model using a big corpus
2. Decode the topic model (find best topic/cluster
for each word) on a training set
3. Train classifier using the topic/cluster as a
feature
4. On a test document, first decode the topic
model, then make a prediction with the classifier
Why use a topic model for
classification?
• Topic models help handle polysemy and
synonymy
– The count for a topic in a document can be much
more informative than the count of individual words
belonging to that topic.
• Topic models help combat data sparsity
– You can control the number of topics
– At a reasonable choice for this number, you’ll observe
the topics many times in training data
(unlike individual words, which may be very sparse)
Synonymy and Polysemy
(example from Lillian Lee)
Training
auto
engine
bonnet
tyres
lorry
boot
car
emissions
hood
make
model
trunk
make
hidden
Markov
model
emissions
normalize
Synonymy
Polysemy
Appear dissimilar if
you compare words
Appear similar if you
compare words
but are related
but not truly related
Sequence Models
Sequences models are models that
• can predict the likelihood of a sequence of text (eg, a sentence).
• Sometimes using latent state variables
Let x1, …, xN be the words in a sentence.
P(x1, …, xN) is the likelihood of the sentence.
We’ll look at two types of generative sequence models:
N-gram models, which are slightly fancy versions of Markov models
Hidden Markov Models, which you’ve seen before
What’s a sequence model for?
• Speech Recognition:
– often the acoustic model will be confused between several possible words for
a given sound
– Speech recognition systems choose between these possibilities by selecting
the one with the highest probability, according to a sequence model
• Machine translation
– Often, the translation model will be confused between several possible
translations of a given phrase
– The system chooses between these possibilities by selecting the one with the
highest probability, according to a sequence model
Many other applications:
• Handwriting recognition
• Spelling correction
• Optical character recognition
• …
Example Language Model Application
Speech Recognition: convert an acoustic signal
(sound wave recorded by a microphone) to a
sequence of words (text file).
Straightforward model:
P(text | sound)
But this can be hard to train effectively.
Example Language Model Application
Speech Recognition: convert an acoustic signal
(sound wave recorded by a microphone) to a
sequence of words (text file).
Acoustic Model
(easier to train)
Traditional solution: Bayes’ Rule
P( sound | text) P(text)
P(text | sound) 
P( sound)
Language Model
Ignore: doesn’t matter for
picking a good text
Example Language Model Application
Speech Recognition: convert an acoustic signal (sound
wave recorded by a microphone) to a sequence of
words (text file).
Traditional solution: Bayes’ Rule
Acoustic Model
(easier to train)
 1 1 …    (1 , … ,  )
   =
(1 , … ,  )
Sequence Model
Ignore: doesn’t matter for
picking a good text
N-gram Models
P(w1, …, wT)=
P(w1) x P(w2|w1) x P(w3|w2,w1) x P(w4|w3,w2,w1) x … x
P(wT|wT-1,…,w1)
N-gram models make the assumption that the next word
depends only on the previous N-1 words.
For example, a 2-gram model (usually called bigram):
P(w1, …, wT)=
P(w1) x P(w2|w1) x P(w3|w2) x P(w4|w3) x … x P(wT|wT-1)
Notice: this is a Markov model, where the states are words.
N-gram Tools
• IRSTLM, a C++ n-gram toolkit often used for
speech recognition
• Berkeleylm, another n-gram toolkit, in Java
• Moses, a machine translation toolkit
• CMU Sphinx, open-source speech recognition
toolkit
Some HMM Tools
• jHMM, a Java implementation that’s relatively
easy to use
• MPI implementation of HMMs, for training
large HMMs in a distributed environment
(HMM Demo)
Conditional Random Fields
CRFs, like HMMs, are also latent-variable sequence models.
However, CRFs are discriminative instead of generative:
They cannot tell you P(x1, …, xN, z1, …, zN),
and they cannot tell you P(x1, …, xN).
But they are often better than HMMs at predicting
P(z1, …, zN | x1, …, xN).
For these reason, they are often used as sequence labelers.
Example Sequence Labeling Task
Slide from Jenny Rose Finkel (ACL 2005)
Other common
sequence-labeling tasks
•
Tokenization/segmentation
SSNNNNNNNNNNNSSNSNNSSNNNNNSSNNNSS
“Tokenization isn’t always easy.”
Chinese word segmentation: Pi-Chuan Chang, Michel Galley and Chris Manning, 2008:
Other common
sequence-labeling tasks
• Part-of-speech tagging
“ N
V Adv Adv
Adj . “
“ Tokenization is n’t always easy . ”
• Relation extraction
O
O
B-arg I-arg O B-rel
I-rel
Drug giant Pfizer Inc. has reached an
I-rel
I-rel O
O
agreement to buy the private
O
O
B-arg I-arg
biotechnology firm Rinat Neuroscience
I-arg O O
O
O
O
Corp. , the companies announced Thursday
 buy(Pfizer Inc., Rinat Neuroscience Corp.)
ReVerb demo
http://reverb.cs.washington.edu/
Some CRF implementations, tools
• http://teamcohen.github.io/MinorThird/, a
package of a variety of NLP tools, including
CRFs
• http://www.chokkan.org/software/crfsuite/, a
CRF toolkit
• http://nlp.stanford.edu/software/, a variety of
NLP tools, including several CRF models

similar documents