Report

Parsing with Compositional Vector Grammars Socher, Bauer, Manning, NG 2013 Problem • How can we parse a sentence and create a dense representation of it? – N-grams have obvious problems, most important is sparsity • Can we resolve syntactic ambiguity with context? “They ate udon with forks” vs “They ate udon with chicken” Standard Recursive Neural Net [ Vector((I-like)green)] W Classifier? Main Score [ Vector(I-like)] W Main [ Vector(I)] [ Vector(like)] [ Vector(green)] [ Vector(eggs)] I like green eggs Standard Recursive Neural Net = ( + ) Where ∙ is usually tanh(∙) or logistic() In other words, stack the two word vectors and multiply through a matrix W and you get a vector of the same dimensionality as the children a or b. Syntactically Untied RNN Classifier Score [ Vector(green-eggs)] [ Vector(I-like)] W W N,V adj,N N First, parse lower level with PCFG N [ Vector(I)] V [ Vector(like)] Adj [ Vector(green)] [ Vector(eggs)] I like green eggs Syntactically Untied RNN = (1,2 + ) The weight matrix is determined by the PCFG parse category of a and b. (You have one per parse combination) Examples: Composition Matrixes • Notice that he initializes them with two identity matrixes (in the absence of other information we should average Learning the Weights • Errors are backpropagated through structure (Goller and Kuchler, 1996) input ′ () = ( (1− ))( − ) (for logistic) Weight derivatives are additive across branches! (Not obvious- good proof/explanation in Socher, 2014) Tricks • Our good friend, ada-grad (diagonal variant): = −1 − 2+ −1 (Elementwise) • Initialize matrixes with identity + small random noise • Uses Collobert and Weston (2008) word embeddings to start Learning the Tree • We want the scores of the correct parse trees to be better than all incorrect trees by a margin: ( , , ≥ , , + ∆( , ) (Correct Parse Trees are Given in the Training Set) Finding the Best Tree (inference) • Want to find the parse tree with the max score (which is the sum all the scores of all sub trees) • Too expensive to try every combination • Trick: use non-RNN method to select best 200 trees (CKY algorithm). Then, beam search these trees with RNN. Model Comparisons (WSJ Dataset) (Socher’s Model) F1 for parse labels Analysis of Errors Conclusions: • Not the best model, but fast • No hand engineered features • Huge number of parameters: ∗ + 2 ∗ ∗ + ∗ + • Notice that Socher can’t make the standard RNN perform better than the PCFG: there is a pattern here. Most of the papers from this group involve very creative modifications to the standard RNN. (SU-RNN, RNTN, RNN+Max Pooling) • The model in this paper has (probably) been eclipsed by the Recursive Neural Tensor Network. Subsequent work showed this model performed better (in different situations) than the SU-RNN