Thinking of Words as Numbers

‘Word Embeddings’, the building blocks of ‘NLP’

Vijit Mathur
3 min readOct 24, 2020
Words relationship with numbers

Recently, while using Google translate, I started to think that how computers understand human languages? When I started exploring about this conundrum, I came to know that computers can’t directly process text from languages directly. Text has to be first converted into numbers before computers can process them. On researching more, I came across the term ‘NLP’, which is the acronym for ‘Natural language Processing’. NLP is a science of understanding human languages and converting them into numbers and using those numbers for tasks such as language translation. Words are converted into vector of numbers, which are known as ‘Word Embeddings’ (Yes! the vectors that we studied during our school days and then forgot about them). Hence, the first task for developing any NLP application such as language translation is to convert words into ‘word embeddings’ vectors.

Language Translation

Now, if you think actually converting words into vectors makes lot of sense. As in the vector space, there exists a relationship between various vectors, similarly there exists semantic relationships between words of a language. For example, in the vector space dot product of two vectors can tell whether two vectors are related or not. Similarly, Delhi has a relationship with India and that relationship is same as the relationship between Washington and United States. Hence, after converting words into vectors such relationships can be established easily and mathematically. Thus, the below equation holds meaning after converting words into vectors

King - Man = Queen - Woman

Word Vectors

The earliest word embeddings, known as ‘One hot encoding’, were quite simple having ‘0’ in all the position, except for one, in which it has ‘1’ corresponding to the word. Now, with lot more computation power at hand, and lot more language data available, researchers are able to develop more sophisticated word embeddings. The embeddings now entails more information around the meaning of the word that is learnt from the context in which word is used.

Converting word into the word embeddings is just the first but the most important step in developing a complete system. The next step is to develop an algorithm that takes these embeddings as inputs and provide the relevant output. For example, a system can be developed that takes word embeddings from an email text and output whether the email is spam or not.

I am excited to see all the advancements in the field of Natural Language Processing that can play a pivotal role in our society. For example, by translating educational content into regional languages, we can teach students even in the remotest corners of the world.

--

--

Vijit Mathur
Vijit Mathur

Written by Vijit Mathur

A digital transformation consultant with keen interests in the areas of Deep Learning technologies such as Natural Language Processing

No responses yet