Raghotham Sripadraj

10 Oct 2015

Word Embedding

Word embedding is a technique of converting words to vectors of a high dimension space. In simple terms, in each dimension, we group words based on a particular aspect — gender, colour etc., and score the words based on similarity in that space.

For example — “I have a red car, maroon shirt and a grey bicycle”

One of the dimensions can represent colour. Red, maroon and grey are assigned similar scores. While rest of the words will have very different scores. Another dimension can represent type of object. Car and bicycle are assigned similar scores because they are automobiles.

The output of word embedding is nothing but a high dimension matrix. Each row in the matrix represents a word vector with scores for each dimension.

W(red) = [(0.5, 0.08, 1.2, …)]

W(grey) = [(0.4, 0.78, 0.01, …)]

Why?

By building word vectors, we are building relations between words. Not just in one dimension, but hundreds.

With this matrix in hand, we can perform powerful tasks like:

  1. Grammar check
  2. Sentiment analysis
  3. Next word prediction

How?

For grammar check, we run a function *G* on a group of words (n-gram) and predict if the input is grammatically correct. This means we need to learn both *W* & *G.*

What makes word embedding powerful is that we do not have to train the model with entirety of possible n-grams. Rather, we just train the model with one kind of occurrence and the model generalizes the learning to the entire class.

For example, we train the model with a lot of sentences and below is one of them

“I have a red car, maroon shirt and a grey bicycle”

Now we ask the model to predict if the below sentence is grammatically correct

“I have a blue car, orange shirt and a brown truck”

Though we have changed a lot of words, *G* predicts this as grammatically correct. This is because, the vector value (*W*) for red, blue, maroon, orange, grey and brown are very similar. The same applies to car, truck and bicycle. When we have similar values for *W*, the result of *G* will be similar too.

Same technique can be used for gender classification, sentiment analysis, predict next word of a sentence and also language translation.

All the above applications of vectorization prove the power of word embedding.

References