The IBP Compound Dirichlet Process
and its Application to Focused Topic
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Presented by Eric Wang
• Latent Dirichlet Allocation (LDA) is a powerful and ubiquitous
topic modeling framework.
• Incorporating the hierarchical Dirichlet process (HDP) into the
LDA allows for more flexible topic modeling by estimating the
global topic proportions.
• A drawback of HDP-LDA is that a topic that is rare globally will
also have a low expected proportion within each document.
• The authors propose a model that allows a rare topic to still
have large mass within individual documents.
Hierarchical Dirichlet Process
• The hierarchical Dirichlet process (HDP) is a prior for Bayesian
nonparametric mixed membership modeling of data groups.
• Hierarchically, it can be defined as
where m indexes the data group.
• In HDP, the expectation of the mixing weights in
is . In
practice, the mixing weights in
is the global average of the
mixture membership.
Indian Buffet Process
• The Indian Buffet Process (IBP) defines a distribution over
binary matrices with an infinite number of columns, and a
finite number of non-zero entries.
• Hierarchically, it is defined as
where m and k denote the rows and columns of binary
matrix b. It can be represented via a stick-breaking construction
IBP Compound Dirichlet Process
• Combining HDP and IBP into single prior yields an infinite
“spike-slab” prior (ICD).
• A spike distribution (IBP) determines which variables are
drawn from the slab (DP).
• The model assumes the following generative process
IBP Compound Dirichlet Process
• The atom masses of data group m is Dirichlet distributed as
• In this construction, the
are the topic proportions for
document m and B is a binary vector indicating usage of the
dictionary elements.
Focused Topic Models
• The authors use ICD to develop the Focused Topic model
• In this framework, a global distribution over topics is drawn
and shared over all documents as in HDP-LDA.
• Each document infers a subset of topics from the global
menu. The subset is determined by the binary vector
Since the binary vector is independent of the global topic
proportions, topics that are rare globally can still make up a
large proportion of individual documents.
Focused Topic Models
• The generative process for the FTM is as follows
Posterior Inference
• To sample the topic indicator for word i in document m,
where the integral
has an analytical form and
• This is an important point because it suggests a general
framework that can be adapted to other applications.
Posterior Inference
• The joint probability of
assigned to topic k is
and the total number of words
and is log differentiable with respect to
and .
• A hybrid MC algorithm is used to sample from their
Posterior Inference
• The topic weights are sampled as
• And the binary topic indicators are sampled as
• Notice here that if a topic is used, it is automatically
considered “active”, and additional (unused) topics can be
Empirical Results
• The authors considered three different text datasets:
• All models were run for 1000 iterations, with the first 500
iterations discarded as burn-in.
Empirical Results
• Model Perplexity
• Topic Correlation
Empirical Results
• Here, the authors compare the number of topics a word
appears in (a). The FTM has more concentrated topics.
• In (b), the authors show the number of documents the topics
appear in. The plot illustrates that HDP has many topics that
appear in only a few documents, while a significant portion of
the FTM topics appear in many documents.
• The authors have proposed a novel model called the IBP
compound Dirichlet Process (ICD) that decouples the acrossdata topic prevalence and the intra-data topic proportions.
• The Focused Topic Model (FTM) was developed from the ICD
that addressed several key shortcomings of HDP-LDA.
• In HDL-LDA, the global topic prevalence affects the proportion
a topic can appear within a document, but in FTM, globally
rare topics can still be highly occupied within a document.
• FTM shows improved perplexity relative to HDP.

similar documents