Parsing with Compositional Vector Grammars

Report
Parsing with Compositional
Vector Grammars
Socher, Bauer, Manning, NG 2013
Problem
• How can we parse a sentence and create a
dense representation of it?
– N-grams have obvious problems, most important
is sparsity
• Can we resolve syntactic ambiguity with
context? “They ate udon with forks” vs “They
ate udon with chicken”
Standard Recursive Neural Net
[ Vector((I-like)green)]
W
Classifier?
Main
Score
[ Vector(I-like)]
W
Main
[ Vector(I)]
[ Vector(like)]
[ Vector(green)]
[ Vector(eggs)]
I
like
green
eggs
Standard Recursive Neural Net
 = (

+ )

Where  ∙ is usually tanh(∙) or logistic()
In other words, stack the two word vectors and
multiply through a matrix W and you get a vector of the
same dimensionality as the children a or b.
Syntactically Untied RNN
Classifier
Score
[ Vector(green-eggs)]
[ Vector(I-like)]
W
W
N,V
adj,N
N
First, parse lower
level with PCFG
N
[ Vector(I)]
V
[ Vector(like)]
Adj
[ Vector(green)]
[ Vector(eggs)]
I
like
green
eggs
Syntactically Untied RNN
 = (1,2

+ )

The weight matrix is determined by the PCFG
parse category of a and b. (You have one per
parse combination)
Examples: Composition Matrixes
• Notice that he initializes them
with two identity matrixes (in the
absence of other information we
should average
Learning the Weights
• Errors are backpropagated through structure
(Goller and Kuchler, 1996)
input
 ′ ()


=  ( (1− ))( −  )

(for logistic)

Weight derivatives are additive across branches!
(Not obvious- good proof/explanation in Socher,
2014)
Tricks
• Our good friend, ada-grad (diagonal variant):
 = −1 −

2+

−1

(Elementwise)
• Initialize matrixes with identity + small random
noise
• Uses Collobert and Weston (2008) word
embeddings to start
Learning the Tree
• We want the scores of the correct parse trees
to be better than all incorrect trees by a
margin:
( ,  ,  ≥   ,  , 
+ ∆( , )
(Correct Parse Trees are Given in the Training
Set)
Finding the Best Tree (inference)
• Want to find the parse tree with the max
score (which is the sum all the scores of all sub
trees)
• Too expensive to try every combination
• Trick: use non-RNN method to select best 200
trees (CKY algorithm). Then, beam search
these trees with RNN.
Model Comparisons (WSJ Dataset)
(Socher’s Model)
F1 for parse labels
Analysis of Errors
Conclusions:
• Not the best model, but fast
• No hand engineered features
• Huge number of parameters:
 ∗  + 2 ∗  ∗  +  ∗  + 
• Notice that Socher can’t make the standard RNN
perform better than the PCFG: there is a pattern
here. Most of the papers from this group involve
very creative modifications to the standard RNN.
(SU-RNN, RNTN, RNN+Max Pooling)
• The model in this paper has (probably) been
eclipsed by the Recursive Neural Tensor
Network. Subsequent work showed this
model performed better (in different
situations) than the SU-RNN

similar documents