Report

The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei Presented by Eric Wang 9/16/2011 Introduction • Latent Dirichlet Allocation (LDA) is a powerful and ubiquitous topic modeling framework. • Incorporating the hierarchical Dirichlet process (HDP) into the LDA allows for more flexible topic modeling by estimating the global topic proportions. • A drawback of HDP-LDA is that a topic that is rare globally will also have a low expected proportion within each document. • The authors propose a model that allows a rare topic to still have large mass within individual documents. Hierarchical Dirichlet Process • The hierarchical Dirichlet process (HDP) is a prior for Bayesian nonparametric mixed membership modeling of data groups. • Hierarchically, it can be defined as where m indexes the data group. • In HDP, the expectation of the mixing weights in is . In practice, the mixing weights in is the global average of the mixture membership. Indian Buffet Process • The Indian Buffet Process (IBP) defines a distribution over binary matrices with an infinite number of columns, and a finite number of non-zero entries. • Hierarchically, it is defined as where m and k denote the rows and columns of binary matrix b. It can be represented via a stick-breaking construction IBP Compound Dirichlet Process • Combining HDP and IBP into single prior yields an infinite “spike-slab” prior (ICD). • A spike distribution (IBP) determines which variables are drawn from the slab (DP). • The model assumes the following generative process IBP Compound Dirichlet Process • The atom masses of data group m is Dirichlet distributed as follows where • In this construction, the are the topic proportions for document m and B is a binary vector indicating usage of the dictionary elements. Focused Topic Models • The authors use ICD to develop the Focused Topic model (FTM). • In this framework, a global distribution over topics is drawn and shared over all documents as in HDP-LDA. • Each document infers a subset of topics from the global menu. The subset is determined by the binary vector . Since the binary vector is independent of the global topic proportions, topics that are rare globally can still make up a large proportion of individual documents. Focused Topic Models • The generative process for the FTM is as follows Posterior Inference • To sample the topic indicator for word i in document m, where the integral has an analytical form and • This is an important point because it suggests a general framework that can be adapted to other applications. . Posterior Inference • The joint probability of assigned to topic k is and the total number of words and is log differentiable with respect to and . • A hybrid MC algorithm is used to sample from their posteriors. Posterior Inference • The topic weights are sampled as • And the binary topic indicators are sampled as • Notice here that if a topic is used, it is automatically considered “active”, and additional (unused) topics can be activated. Empirical Results • The authors considered three different text datasets: • All models were run for 1000 iterations, with the first 500 iterations discarded as burn-in. Empirical Results • Model Perplexity • Topic Correlation Empirical Results • Here, the authors compare the number of topics a word appears in (a). The FTM has more concentrated topics. • In (b), the authors show the number of documents the topics appear in. The plot illustrates that HDP has many topics that appear in only a few documents, while a significant portion of the FTM topics appear in many documents. Discussion • The authors have proposed a novel model called the IBP compound Dirichlet Process (ICD) that decouples the acrossdata topic prevalence and the intra-data topic proportions. • The Focused Topic Model (FTM) was developed from the ICD that addressed several key shortcomings of HDP-LDA. • In HDL-LDA, the global topic prevalence affects the proportion a topic can appear within a document, but in FTM, globally rare topics can still be highly occupied within a document. • FTM shows improved perplexity relative to HDP.