and Class-based Generative Modeling for Text Classification

Report
Integrated Instance- and Classbased Generative Modeling for
Text Classification
Antti Puurula
Sung-Hyon Myaeng
University of Waikato
KAIST
5/12/2013
Australasian Document Computing Symposium
Instance vs. Class-based Text Classification
• Class-based learning
• Multinomial Naive Bayes, Logistic Regression, Support Vector Machines, …
• Pros: compact models, efficient inference, accurate with text data
• Cons: document-level information discarded
• Instance-based learning
• K-Nearest Neighbors, Kernel Density Classifiers, …
• Pros: document-level information preserved, efficient learning
• Cons: data sparsity reduces accuracy
Instance vs. Class-based Text Classification 2
• Proposal: Tied Document Mixture
• integrated instance- and class-based model
• retains benefits from both types of modeling
• exact linear time algorithms for estimation and inference
• Main ideas:
• replace Multinomial class-conditional in MNB with a mixture over documents
• smooth document models hierarchically with class and background models
Multinomial Naive Bayes
• Standard generative model for text classification
• Result of simple generative assumptions
• Bayes
• Naive
• Multinomial
Multinomial Naive Bayes 2
Tied Document Mixture
• Replace Multinomial in MNB by a mixture over all documents
• , where documents models are smoothed hierarchically
• , where class models are estimated by averaging the documents
Tied Document Mixture 2
Tied Document Mixture 3
• Can be described as constraints on a two-level mixture
• Document level mixture:
• Number of components= 
• Components assigned to instances
• Component weights= 1/ 
• Word level mixture:
• Number of components= 3 (hierarchy depth)
• Components assigned to hierarchy
• Component weights= 1 − 1– 2, 1, and 2
Tied Document Mixture 3
• Can be described as constraints on a two-level mixture
• Document level mixture:
• Number of components= 
• Components assigned to instances
• Component weights= 1/ 
• Word level mixture:
• Number of components= 3 (hierarchy depth)
• Components assigned to hierarchy
• Component weights= 1 − 1– 2, 1, and 2
Tied Document Mixture 3
• Can be described as constraints on a two-level mixture
• Document level mixture:
• Number of components= 
• Components assigned to instances
• Component weights= 1/ 
• Word level mixture:
• Number of components= 3 (hierarchy depth)
• Components assigned to hierarchy
• Component weights= 1 − 1– 2, 1, and 2
Tied Document Mixture 4
• Can be described as a class-smoothed Kernel Density Classifier
• Document mixture equivalent to a Multinomial kernel density
• Hierarchical smoothing corresponds to mean shift or data
sharpening with class-centroids
Hierarchical Sparse Inference
• Reduces complexity from () to
( : ≠0(1 + :
  ≠0 1))
• Same complexity as K-Nearest Neighbors
based on inverted indices (Yang, 1994)
Hierarchical Sparse Inference 2
• Precompile values:
•
•
•
•
()
 
 ()

 
⟶
⟶
⟶
⟶
• (|) decomposes:
′
• Store 
() and ′  in inverted indices
Hierarchical Sparse Inference 3
• Compute first  
• Update by  ()

• Update by 
() to get (|)
• Compute (|) =
1

 (|)
• Bayes rule    ≈   (|)
Hierarchical Sparse Inference 3
• Compute first  
• Update by  ()

• Update by 
() to get (|)
• Compute (|) =
1

 (|)
• Bayes rule    ≈   (|)
Hierarchical Sparse Inference 3
• Compute first  
• Update by  ()

• Update by 
() to get (|)
• Compute (|) =
1

 (|)
• Bayes rule    ≈   (|)
Hierarchical Sparse Inference 3
• Compute first  
• Update by  ()

• Update by 
() to get (|)
• Compute (|) =
1

 (|)
• Bayes rule    ≈   (|)
Hierarchical Sparse Inference 3
• Compute first  
• Update by  ()

• Update by 
() to get (|)
• Compute (|) =
1

 (|)
• Bayes rule    ≈   (|)
Hierarchical Sparse Inference 3
• Compute first  
• Update by  ()

• Update by 
() to get (|)
• Compute (|) =
1

 (|)
• Bayes rule    ≈   (|)
Hierarchical Sparse Inference 2
• Compute first  
• Update by  ()

• Update by 
() to get (|)
• Compute (|) =
1

 (|)
• Bayes rule    ≈   (|)
Experimental Setup
• 14 classification datasets used:
•
•
•
•
3 spam classification
3 sentiment analysis
5 multi-class classification
3 multi-label classification
• Scripts and datasets in LIBSVM
format:
• http://sourceforge.net/projects
/sgmweka/
Experimental Setup 2
• Classifiers compared:
•
•
•
•
•
•
Multinomial Naive Bayes (MNB)
Tied Document Mixture (TDM)
K-Nearest Neighbors (KNN) (Multinomial distance, distance-weighted vote)
Kernel Density Classifier (KDC) (Smoothed multinomial kernel)
Logistic Regression (LR, LR+)
(L2-regularized)
Support Vector Machine (SVM, SVM+) (L2-regularized L2-loss)
• LR+ and SVM+ weighted feature vectors by TFIDF
• Smoothing parameters optimized for MicroFscore on held-out
development sets using Gaussian Random Searches
Results
• Training times for MNB, TDM, KNN and KDC linear
• At most 70 s for MNB on for OHSU-TREC, 170 s for the others
• SVM and LR require iterative algorithms
• At most 936 s, for LR on Amazon12
• Did not scale to multi-label datasets in practical times
• Classification times for instance-based classifiers higher
• At most mean 226 ms for TDM on OHSU-TREC, compared to 70 ms for MNB
• (with 290k terms, 196k labels, 197k documents)
Results 2
TDM significantly improves
on MNB, KNN and KDC
Across comparable datasets,
TDM is on par with SVM+
• SVM+ is significantly better
on multi-class datasets
• TDM is significantly better
on spam classification
Results 2
TDM significantly improves
on MNB, KNN and KDC
Across comparable datasets,
TDM is on par with SVM+
• SVM+ is significantly better
on multi-class datasets
• TDM is significantly better
on spam classification
Results 3
TDM reduces classification
errors compared to MNB by:
>65% in spam classification
>26% in sentiment analysis
Some correlation between
error reduction and number of
instances/class. Task types
form clearly separate clusters
Conclusion
• Tied Document Mixture
•
•
•
•
•
Integrated instance- and class-based model for text classification
Exact linear time algorithms, with same complexities as KNN and KDC
Accuracy substantially improved over MNB, KNN and KDC
Competitive with optimized SVM, depending on task type
Many improvements to the basic model possible
• Sparse inference scales to hierarchical mixtures of >340k components
• Toolkit, datasets and scripts available:
• http://sourceforge.net/projects/sgmweka/
Sparse Inference
• Sparse Inference (Puurula, 2012)
• Use inverted indices to reduce complexity of computing joint (, ) for a
given 
• Instead of computing (, ) as dot products, compute    and update by
 () for each  () ≠ 0 from the inverted index
• Reduces joint (, ) inference time complexity from dense   to
( : ≠0(1 + :  ≠0 1))

Sparse Inference 2
• Dense representation: ()
=
number of
classes
1()
2()
3()
4()
5()
6()
7()
8()
9()
10()
11()
12()
= number of features
(1) (2) (3) (4) (5) (6) (7) (8) (9)
0.21
0.25
0.38
0.32
0.53
0.68
0.68
0.68
0.68
0.1
0.18
0.34
0.43
0.22
0.07
0.07
0.07
0.07
0.16
0.13
0.08
0.06
0.06
0.06
0.06
0.06
0.06
0.09
0.1
0.06
0.06
0.06
0.06
0.06
0.06
0.06
0.1
0.14
0.04
0.04
0.04
0.04
0.04
0.04
0.04
0.06
0.13
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.07
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.02
0.05
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.05
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.02
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.1
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.05
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
Time complexity:
 
Sparse Inference 3
• Sparse representation: ()
()
(1) 0.18
(2) 0.07
(3) 0.06
(4) 0.06
(5) 0.04
(6) 0.02
(7) 0.02
(8) 0.01
(9) 0.01
(10) 0.01
(11) 0.01
(12) 0.01
1()
2()
3()
4()
5()
6()
7()
8()
9()
10()
11()
12()
(1) (2) (3) (4) (5) (6) (7) (8) (9)
0.03
0.07
0.21
0.14
0.35
0.5
0.5
0.5
0.5
0.02
0.11
0.27
0.36
0.15
0
0
0
0
0.1
0.07
0.02
0
0
0
0
0
0
0.03
0.04
0
0
0
0
0
0
0
0.06
0.1
0
0
0
0
0
0
0
0.04
0.11
0
0
0
0
0
0
0
0.05
0
0
0
0
0
0
0
0
0.03
0
0
0
0
0
0
0
0
0.04
0
0
0
0
0
0
0
0
0.01
0
0
0
0
0
0
0
0
0.09
0
0
0
0
0
0
0
0
0.03
0
0
0
0
0
0
0
0
Time complexity:
( : ≠0(1 + :

 ≠0 1))

similar documents