Incremental inference

```Incremental Integer Linear Programming for
Non-projective Dependency Parsing
Authors Sebastian Riedel and James Clarke
Paper review by Anusha Buchireddygari
Structured Prediction
Cutting plane Algorithm
•
•
•
•
Figure above shows a convex region of feasible solutions defined by several
constraints.
The grid indicates where inside the polygon the feasible integer solutions lie
The dot represents the optimal solution (for the linear programming problem)
gained from maximizing x1 + x2
Note that although it is not an integer solution it is the upper bound for an
optimum one
Why this paper?
• Integer Linear Programming (ILP) was applied
for inference of sequential conditional random
fields(Roth and Yin,2004).
• Exponential number of constraints required to
prevent cycles occurring in the dependency
graph.
• Model these constraints using ILP is too large to
solve efficiently.
What does the paper solve?
• Method which extends applicability of ILP to
more complex set of problems.
to solve with a fraction of constraints.
• Solution is examined and additional constraints
• Procedure is repeated until all the constraints
are satisfied.
What’s the problem the authors picked?
• Authors applied dependency parsing approach
to Dutch due to language’s non-projective
nature.
• Took the parser of McDonald et al as starting
point for our model.
• “I will come at twelve and then you will get what
you deserve”.
What’s Dependency Parsing?
• Task of attaching words to arguments.
• Dependency graph “kom” attached to “ik”.
• “kom” is head for “ik” and “ik” is child.
• Dependency tree every token must be the child
of exactly one other node, either another token
or dummy root token.
• It cannot have cycles like “en”-> “kom” -> “ik” >”en”.
How the model looks like?
• X is a sentence, y is a set of labelled
dependencies
• F(I,j,l) multi dimensional feature vector of the
edge from token I to token j with label l
Constraints
• Decoding/Inference in this model is to maximize
• T1 for every non-root token in x there exists
• T2 There are no cycles
• This corresponds to maximum spanning tree
problem.
Linguistic constraints
• A1 Heads are not allowed to have more than
one outgoing edge labelled l for all l in a set of
labels U.
• A1 tells that there can be only one “subject” in
the sentence
• C1 In a symmetric coordination there is exactly
one argument to the right of the conjunction and
at least one argument on the left.
• C1 applies to “and”,”or”,”but”
Some more linguistic constraints
●
●
●
C2 In an asymmetric coordination there are no
arguments to the left of the conjunction and at
least two arguments to the right.
Example is “both” having arguments to its left.
There are other such constraints total of 8
constraints in the paper.
Process
●
●
●
●
The function to be maximized is called objective
function Ox.
Variables Vx, e(I,j,l) is 1 if edge exists else 0
Base constraints
T1
Constraints representation
• A1
• C1
Incremental Constraints
• T2
Algorithm
• For a sentence x, Bx is the set of constraints that
• Ix are the constraints we add in incrementally
• Ox is objective function and Vx is a set of
variables including variable declaration.
What happens in the algorithm?
• Solve(C,O,V) maximizes the objective function O
with respect to the set of constraints C and
variables V.
• Violated(y,I) inspects the proposed solution (y)
and returns all constraints in I which are violated
Interesting result
The number of iterations is most polynomial
with respect to number of variables.
In practice this technique converges quickly
i.e. less than 20 iterations in 99%
approximately 12,000 sentences.
Yielding average solve times of less than 0.5
seconds.
Experiment
• How much do our additional constraints help improve accuracy?
• How fast is our generic inference method in comparison with Chu-LiuEdmonds algorithm?
• Can approximations be used to increase the speed of our method while
remaining accurate?
• Data
• Alpino Treebank 13,300 sentences with a average length of 14.6 tokens
• Environment
• Intel Xeon with 3.8 Ghz and 4 GB RAM. Mixed Integer Programming lib
IP_Solve. Code ran on Java and called JNI-wrapper aroung IP_solve lib
• Feature Sets
• Along with POS tags there are additional attributes like gender,number and
case. Combined attr of head to child
Results
• Accuracy
• Nl is the number of tokens with correct head and
label and Nt is the total number of tokens.
• Unlabelled accuracy
• Nu is the number of tokens with correct head
Accuracy Results table
• bl means baseline without any linguistic
constraints which is compared it to a system with
• There are problems the system suffers from
poor next best solution due to inaccurate local
score distributions.
Runtime Efficiency
• Average solve time (ST) for sentences with
respect to no of tokens in each sentence.
• Approximation
• Total runtime of the system is competitive to
CLE.
Discussion
• Higher order features by using extended set of
variables and a modified function which is likely
to increase runtime.
• Fast for real world applications.
• Large time for first iteration after that algorithm
uses last state to efficiently search for solutions
in the presence od new constraints.
• Exponential blow when the cycle constraints are
Conclusion
• Novel approach to inference of ILP
• Efficiently use ILP for dependency parsing
• Slower than baseline approach but parses large
sentences with more than 50 tokens.
• Parsing time can be significantly reduced using
a simple approximation like from q=10 to q=5
which will only marginally degrades the
performance from 85% to 84% in the table we
have seen in approximation.
THANK YOU
Questions?
```