Document

```The IBP Compound Dirichlet Process
and its Application to Focused Topic
Modeling
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Presented by Eric Wang
9/16/2011
Introduction
• Latent Dirichlet Allocation (LDA) is a powerful and ubiquitous
topic modeling framework.
• Incorporating the hierarchical Dirichlet process (HDP) into the
LDA allows for more flexible topic modeling by estimating the
global topic proportions.
• A drawback of HDP-LDA is that a topic that is rare globally will
also have a low expected proportion within each document.
• The authors propose a model that allows a rare topic to still
have large mass within individual documents.
Hierarchical Dirichlet Process
• The hierarchical Dirichlet process (HDP) is a prior for Bayesian
nonparametric mixed membership modeling of data groups.
• Hierarchically, it can be defined as
where m indexes the data group.
• In HDP, the expectation of the mixing weights in
is . In
practice, the mixing weights in
is the global average of the
mixture membership.
Indian Buffet Process
• The Indian Buffet Process (IBP) defines a distribution over
binary matrices with an infinite number of columns, and a
finite number of non-zero entries.
• Hierarchically, it is defined as
where m and k denote the rows and columns of binary
matrix b. It can be represented via a stick-breaking construction
IBP Compound Dirichlet Process
• Combining HDP and IBP into single prior yields an infinite
“spike-slab” prior (ICD).
• A spike distribution (IBP) determines which variables are
drawn from the slab (DP).
• The model assumes the following generative process
IBP Compound Dirichlet Process
• The atom masses of data group m is Dirichlet distributed as
follows
where
• In this construction, the
are the topic proportions for
document m and B is a binary vector indicating usage of the
dictionary elements.
Focused Topic Models
• The authors use ICD to develop the Focused Topic model
(FTM).
• In this framework, a global distribution over topics is drawn
and shared over all documents as in HDP-LDA.
• Each document infers a subset of topics from the global
menu. The subset is determined by the binary vector
.
Since the binary vector is independent of the global topic
proportions, topics that are rare globally can still make up a
large proportion of individual documents.
Focused Topic Models
• The generative process for the FTM is as follows
Posterior Inference
• To sample the topic indicator for word i in document m,
where the integral
has an analytical form and
• This is an important point because it suggests a general
framework that can be adapted to other applications.
.
Posterior Inference
• The joint probability of
assigned to topic k is
and the total number of words
and is log differentiable with respect to
and .
• A hybrid MC algorithm is used to sample from their
posteriors.
Posterior Inference
• The topic weights are sampled as
• And the binary topic indicators are sampled as
• Notice here that if a topic is used, it is automatically
considered “active”, and additional (unused) topics can be
activated.
Empirical Results
• The authors considered three different text datasets:
• All models were run for 1000 iterations, with the first 500
Empirical Results
• Model Perplexity
• Topic Correlation
Empirical Results
• Here, the authors compare the number of topics a word
appears in (a). The FTM has more concentrated topics.
• In (b), the authors show the number of documents the topics
appear in. The plot illustrates that HDP has many topics that
appear in only a few documents, while a significant portion of
the FTM topics appear in many documents.
Discussion
• The authors have proposed a novel model called the IBP
compound Dirichlet Process (ICD) that decouples the acrossdata topic prevalence and the intra-data topic proportions.
• The Focused Topic Model (FTM) was developed from the ICD
that addressed several key shortcomings of HDP-LDA.
• In HDL-LDA, the global topic prevalence affects the proportion
a topic can appear within a document, but in FTM, globally
rare topics can still be highly occupied within a document.
• FTM shows improved perplexity relative to HDP.
```