Word representations - Boston Data Festival

Vector space word
Rani Nelken, PhD
Director of Research, Outbrain
Words = atoms?
That would be crazy for numbers
The distributional hypothesis
What is a word?
Wittgenstein (1953): The meaning of a word is
its use in the language
Firth (1957): You shall know a word by the
company it keeps
From atomic symbols to vectors
• Map words to dense numerical vectors
“representing” their contexts
• Map words with similar contexts to vectors
with small angle
• Hard Clustering: Brown clustering
• Soft clustering: LSA, Random projections, LDA
• Neural nets
Feedforward Neural Net Language
• Input is one-hot vectors of context
• We’re trying to learn a vector for each word
• Such that the output is close to the one-hot
vector of w(t)
Simpler model: Word2Vec
What can we do with these
• Plug them into your existing classifier
• Plug them into further neural nets – better!
• Improves accuracy on many NLP tasks
Named entity recognition
POS tagging
sentiment analysis
semantic role labeling
Back to cheese…
• cos(crumbled, cheese) = 0.042
• cos(crumpled, cheese) = 0.203
And now for the magic
“Magical” property
• [Paris] - [France] + [Italy] ≈ [Rome]
• [king] - [man] + [woman] ≈ [queen]
• We can use it to solve word analogy problems
Boston: Red_Sox= New_York: ?
Why does it work?
[king] - [man] + [woman] ≈ [queen]
cos (x, ([king] – [man] + [woman])) =
cos (x, [king]) – cos(x, [man]) + cos(x, [woman])
[queen] is a good candidate
It doesn’t always work
• London : England = Baghdad : ?
• We expect Iraq, but get Mosul
• We’re looking for a word that is close to
Baghdad, and to England, but not to London
Why did it fail?
• London : England = Baghdad : ?
• cos(Mosul, Baghdad) >> cos(Iraq, London)
• Instead of adding the cosines, multiply them
• Improves accuracy
• Open source C implementation from Google
• Comes with pre-learned embeddings
• Gensim: fast python implementation
Active field of research
• Bilingual embeddings
• Joint word and image embeddings
• Embeddings for sentiment
• Phrase and document embeddings
Bigger picture: how can we make NLP
less fragile?
• 90’s: Linguistic engineering
• 00’s: Feature engineering
• 10’s: Unsupervised preprocessing
• https://code.google.com/p/word2vec/
• http://www.cs.bgu.ac.il/~yoavg/publications/c
• http://radimrehurek.com/2014/02/word2vectutorial/
We’re hiring for NLP positions

similar documents