CoNLL Poster

Report
Design Challenges and Misconceptions in Named Entity Recognition
Lev Ratinov and Dan Roth
Named entity recognition problem: identify people,
locations, organizations and other entities in text.
Demo: http://L2R.cs.uiuc.edu/~cogcomp/LbjNer.php
Download: http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=FLBJNE
How to represent the NE’s?
Inference algorithm choice:
There are at least 5 representation
schemas that vary in the way they mark
the beginning, the end and single-token
entities. The letters used are B(egin),
E(nd), L(ast), I(nside), U(nit), and
O(ther). The more states we use, the
more expressive the model is, but it may
require more training data to converge.
Viterbi is the most popular inference
algorithm in NER. It allows to find the most
likely label assignment in single-order
HMM/CRF in time quadratic in the number of
states and linear in the input size. When nonlocal features are used, exact dynamic
programming with Viterbi becomes
intractable, and Beam search is usually used.
We show that greedy left-to-right decoding
(beam search of size one) performs
comparably to Viterbi and Beam search,
while being much faster. The reason is that in
NER, short NE-chunks are separated by many
O-labeled tokens, which “break” the
likelihood maximization problem to short
independent chunks, where local features
perform well. The non-local dependencies in
NER have a very long range, which the
second-order state transition features fail to
model.
With the label set: Per/Org/Loc/Misc/O, and
the BILOU encoding, there are 17 states (BPer, I-Per, U-Per, etc.), and with second order
Viterbi, we need to make 173 decisions in
each step, in contrast to only 17 necessary in
greedy decoding.
BIO1 BIO2 IOE1 IOE2 BILOU
retain
O
O
O
O
O
the
O
O
O
O
O
Golan
I-loc
B-loc I-loc
Heights
I-loc
I-loc
Israel
B-loc B-loc I-loc
B-loc B-loc
E-loc E-loc L-loc
E-loc U-loc
captured O
O
O
O
O
from
O
O
O
O
O
Syria
I-loc
B-loc I-loc
Key result:
E-loc U-loc
BILOU performs the best overall on 5
datasets, and better than BIO2, which is
most commonly used for NER.
Main Results:
A typical solution is modeling as a structured problem: HMM/CRF:
…
B-PER
I-PER
B-ORG
…
Reggie
Blinker
Feyenoord
How to inject knowledge?
How to model non-local information?
Common intuition- the same tokens should be assigned the same labels.
We have compared three approaches proposed in the literature, and found that no single
approach consistently outperformed the rest on a collection of 5 datasets. However, the
combination of the features was the most stable and generally performed the best.
Context Aggregation- for all the instances of a token, collect the contexts, then extract
features from the context. For example, all the instances of FIFA the sample text, will be
added context aggregation features extracted from:
suspension lifted by
on Friday and
FIFA
two games after
Slapped a worldwide
Two-stage Prediction Aggregation- the model tags the text in the first round of
inference. Then , for each token, we mark the most frequent label it was assigned, and if
there were named entities found in the document which contained the token, we mark
this information too. We use these statistics as features in a second round of inference.
Extended Prediction History - the previous two techniques treat all the tokens in the
document identically. However, it is often the case that the easiest named entities to
identify occur in the beginning of the document. When using beam search or greedy
decoding, we collect the statistic for each token about how often it was previously
assigned each of the labels.
Conclusions:
We have presented a simple model for NER
that uses expressive features to achieve
new state of the art performance on the
Named Entity recognition task. We
explored four fundamental design decisions:
text chunks representation, inference
algorithm, using non-local features and
external knowledge.
We investigated two knowledge resources: word class models
extracted from unlabeled text and extracting Gazetteers from
Wikipedia. They are independent and additive.
Word class models hierarchically cluster words based on
distribution-similarity-like measure. Two clusters are merged if they
tend to appear in similar contexts. For example, clustering Friday
and Monday together, will allow us to abstract both as day of the
week and allow us to fight the data sparsity problem common in
NLP tasks. Sample binary clustering is given below:
Deciding which level of abstraction to use is challenging, however
similarly to Hoffman encoding, each word can be assigned a path
from the root to the leaf. For example, the encoding of IBM is 011.
By taking substrings of different length, we can consider different
levels of abstraction.
Gazetteers: In this work, we use a collection of 14 high-precision,
low-recall lists extracted from the web that cover common names,
countries, monetary units, temporal expressions, etc. While these
gazetteers have excellent accuracy, they do not provide
sufficient coverage. To further improve the coverage, we have
extracted 16 gazetteers from Wikipedia, which collectively contain
over 1.5M entities.
Wikipedia is an open, collaborative encyclopedia with several
attractive properties. (1) It is kept updated manually by it
collaborators, hence new entities are constantly added to it. (2) It
contains redirection pages, mapping several variations of spelling of
the same name to one canonical entry. For example, Suker is
redirected to an entry about Davor Šuker, the Croatian footballer (3)
The entries in Wikipedia are manually tagged with categories.
For example, the entry about the Microsoft in Wikipedia has the
following categories: Companies listed on NASDAQ; Cloud
computing vendors; etc. We used the category structure in
Wikipedia to group the titles and the corresponding redirects into
Higher-level concepts, such as: locations, organizations, artworks,
etc.

similar documents