Slides

LATENT DIRICHLET
ALLOCATION
Outline
• Introduction
• Model Description
• Inference and Parameter Estimation
• Example
• Reference
Introduction
to access what we are looking for.
We need new tools to help us organize, search, and understand
these vast amounts of information.
Introduction
Topic modeling provides methods for automatically organizing,
understanding, searching, and summarizing large electronic archives.
• Uncover the hidden topical patterns that pervade the collection.
• Annotate the documents according to those topics.
• Use the annotations to organize, summarize, and search the texts.
Intuition behind LDA
P(b ,q , z,w)
Notation and Assumption
• We have a set of documents D , D
corpus.
1
2
,..., DM, constituting a
• Each document is a collection of words or a “bag of
words”. (Exchangeability)
• After elimination of some stopping words, a corpus
contains V words: w1,w2 ,..., wV , involve K topic with
distributions: b1,..., b k
• Each document is composed of N “important” or
“Effective” words: w1,w2 ,..., wN and with topic
proportions q d .
1….. topic …..K
1...nth word..Nd
1…word idx…V
Model Definition
(bd | h ) ~ Dir(h ) (q d | a ) ~ Dir(a ) Zd,n ~ Multi(q d ) Wd,n ~ Multi(bz ,n )
d
Dirichlet and Multinomial Distribution
• It’s more like such a distribution that is used to describe
another distribution. E.g. Multinomial
• Multinomial:
n!
P(X1 = x1,..., XK = xK ) =
p1x1 ...pKxK
x1 !...xK !
K
where X Î{0,..., n} and
å Xi = n
i
i=1
K
• Dirichlet
G(å a i )
P(q | a ) = K i=1
q1a1 -1 ...q Ka K -1
Õ G(a i )
qi > 0
åq
i
=1
i=1
Where variable \theta can take values in the (k-1) simplex.
Dirichlet and Multinomial Distribution
Properties
LSA & LDA
Reference
• Latent Dirichlet Allocation, DM Blei, AY Ng, MI jordan –
the journal of machine learning research, 2003
• Topic Models Vs. Unstructured Data, G Anthes –
Communications of the ACM, 2010
• Probabilistic Topic Models, M Steyvers, T Griffiths –
Handbook of latent sematic analysis, 2007
• GibbsSampling for the Uninitiated, P Resnik, E
Hardisty - 2010