Report

Search-Based Structured Prediction Hal Daume III, John Langford, Daniel Marcu ABSTRACT This paper proposes SEARN, an algorithm for integrating search and learning to solve complex structured prediction problem. Transform complex structured problems into simple binary classification problem which simple binary classifier could be used. Suitable for any loss function and class of feature SEARN guarantee that: good performance of the derived classification problem implies the good performance of whole structured prediction problem What complex internal structure is Normal classification focus on simple output {-1, 1}; more complex structure could be label sequences. What we focus on are: The space of English sentence (in a machine translation application). The space of short documents (in a automatic documents summarization application). The space of possible assignment of elements in a database What structured algorithm do we want Not limited to models with bounded treewidth Applicable to any loss function Handle arbitrary features Can cope with imperfect data SEARN meet all these requirement by transforming structured prediction into binary prediction which applying the binary classifier. Good binary classification performance implies good structured prediction performance. Structured Prediction Problem Definitions As a simple example, consider a parsing problem under F1 loss. In this case, D is a distribution over (x; c) where x is an input sequence and for all trees y with |x|-many leaves, cy is the F1 loss of y on the “true" output. The goal to create result is to minimizes the loss given in Eq (1), The Searn Algorithm There are several vital ingredients in the application of SEARN: A search space for decomposing the prediction problem. A cost sensitive learning algorithm Labeled structured prediction training data A known loss function for the structured prediction problem A good initial policy (we call the learned classifier a policy) Search space and cost sensitive learning algorithm For search space S, it could be concrete in simple application, like all possible parts of speech tag for each word. But in general, structured prediction has an abstract search space in the algorithm, and a function that turn abstract space into output of correct form SEARN algorithm requires a learning algorithm to return a classifier h(s) given cost sensitive training data. The performance of SEARN is strongly dependent upon how capable the learned classifier is. SEARN at train time: initial policy SEARN operates in an iterative fashion. At each iteration, it uses the current policy to create the new cost-sensitive classification examples, and use them create a new classifier. It keeps iteration until current classifier has really low dependence with initial policy SEARN relies on a good initial policy on the training data. SEARN at train time: Cost-sensitive examples SEARN generates one path per structured training example, and create cost-sensitive example for each path. The cost associated with taking an action that leads to state s is the regret associated with this action, given our current policy. At each state s, we take action a and then execute the policy to gain a full sequence of predictions ^y for which minimize the expect loss cost. the deference in loss between taking action a and taking the action a’; see Eq (2). Viterbi algorithmas best path Stochastic in policy classification The policy may be stochastic when the base classifier learned is deterministic due to the stochastic interpolation with SEARN. The paper suggest using Monte-Carlo sampling as the best solution: one draws many paths according to h beginning at s0 and average over the costs. Complete SEARN algorithm Cost-sensitive Examples are put into cost-sensitive classification algorithm to create new policy Compute initial policy We mention SERAN relies upon the ability to start with a good initial policy π. under standard loss function, compute initial policy is easy: in sequence labeling problem with Hamming loss, just choosing the correct label at position i given the correct label sequence. SEARN is not limited to simple Hamming loss. It could learn under more complex structures and loss function. This is the feature that many other frameworks do not have. Search-base policies The SEARN algorithm do not require the initial policy be optimal, it can train against any policy. At any step of SEARN, including the initial, we need to computer the best next step. So searching the shortest path and taking the first step along this shortest path, we can obtain a good initial policy. It is good for case where a good initial policy is not available. Theoretical analysis Each iteration of Searn degrades the current policy. The main theorem states that the learned policy is not much worse than the starting (optimal) policy plus a term related to the average cost sensitive loss of the learned classifiers and another term related to the maximum cost sensitive loss. where The paper claim the choice of iteration and β is not need to be too small, and each iteration, make sure only one different choice from previous policy at each iteration. Comparison: Independent classifiers Pro: Independent classifier make training and testing classifiers incredibly straightforward. Con: One cannot define complex features over the output space SEARN is more similar to MEMM prediction setting, which MEMM’s prediction is on basis of the k previous predictions. However, MEMM make an assumption that all previous predictions are correct, which may not be the case. Comparison: Perceptron-style algorithms The incremental perceptron is a variant on the structured perceptron that deals with the issue that the arg max may not be analytically available. It replaces the arg max with a beam search algorithm. The key observation is that it is often possible to detect in the process of executing search whether it is possible for the resulting output to ever be correct. The incremental perceptron is essentially a search-based structured prediction technique. it is, however, much more limited. It cannot cope with arbitrary loss functions, and is limited to a beam-search application. Moreover, for search problems with a large number of internal decisions, aborting search at the first error is far from optimal. Experimental results: Sequence Labeling Sequence labeling is the task of assigning a label to each element in an input sequence. Sequence labeling is an attractive test for structured prediction algorithms because it is the simplest non-trivial structure. The paper implements four Sequence labeling: handwriting recognition named entity recognition (in Spanish) Syntactic chunking joint chunking and part-of-speech tagging Sequence Labeling property In handwriting recognition and Spanish named entity recognition, the feature are basic the same, each possible letter/entity has an unique feature that counts how many time that letter shows in the output, and there are group feature counting pair/triple appearance in output. For syntactic chunking, the paper use the same set of features across all models, separated into “base features" and “meta features." The base features apply to words individually, while meta features apply to entire chunks. The initial policy is to label the next word correctly with respect to loss function. Experimental results: Sequence Labeling Experimental results: Sequence Labeling structured techniques consistently outperform their classification counterparts. For all classifiers, adding DEARN consistently improve performance. Large data set have better performance than small data set. And some simple classification techniques when applied to large data sets outperform structured learning algorithm. the only task on which Searn does not outperform the competing techniques is on the raw chunking task. With exception of the handwriting recognition task, Searn using logistic regression as a base learner performs at the top of the pack. SVM-based models tend to be expensive to train compare to perceptron. Experimental results: Automatic Document Summarization Multi-document summarization is the task of creating a summary out of a collection of documents on a focused topic. The choice for automatic documents evaluation is Rouge, which computes n-gram overlap between a system summary and a set of human written summaries. The features are any aspect of currently generated summary, and any part of the document set, including word and term identity, syntactic relation, length of sentence. Experimental results: Automatic Document Summarization the oracle results for the vine-growth method are signicantly higher. That’s because under Searn, the summaries produced by the vine-growth technique are uniformly better than those produced by raw extraction. The Searn-based systems uniformly dominate this result, but this comparison is not fair due to the training data. Discussion and Conclusions about SEARN SEARN is an easy technique for training models for which complex search algorithms must be used. SEARN contrasts with more typical algorithms such as CRFs and M3Ns based on considering how information is shared at test time. It share information at training time, without showing entire output. Limitation for SEARN: although training via Searn is likely preferable to training against only an initial policy, it can still be overly optimistic. there is a slight disparity between what SEARN does at a theoretical level and how SEARN functions in practice. Thank you!