Natural Language Processing (NLP) is a great task in Machine Learning to work with languages. However, you must have seen everyone working with only in the English language while working on a task of NLP. So what about other languages that we have. In this article, I will take you through NLP for other Languages with Machine Learning.
Everyone knows India is a very diverse country and a hotbed of many languages, but did you know India speaks 780 languages. It’s time to move beyond English when it comes to NLP. This article is intended for those who know a little about NLP and want to start using NLP for other languages.
NLP for Other Languages
Before we get into the task of NLP for other languages, let’s take a look at some essential concepts and recent achievements in NLP. NLP helps computers understand human language. Text classification, information extraction, semantic analysis, question answering, text synthesis, machine translation and chatbots are some applications of NLP.
For computers to understand human language, we must first represent words in digital form. The digitally represented words can then be used by machine learning models to perform any NLP task. Traditionally, methods like One Hot Encoding, TF-IDF Representation have been used to describe the text as numbers. But traditional methods have resulted in sparse representation by not grasping the meaning of the word.
Neural Word Embeddings then came to the rescue by solving the problems in traditional ways. Word2Vec and GloVe are the two most commonly used word embedding elements. These methods have resulted in dense representations where words with similar meanings will have similar representations. A significant weakness of this method is that the words are considered to have only one meaning. But we know that a word can have many meanings depending on the context in which it is used.
NLP has leapt forward in the modern family of language models. The incorporation of words is no longer independent of the context. The same word can have multiple digital representations depending on the context in which it is used. BERT, Elmo, ULMFit, GPT-2 are currently popular language models. The last generation is so good and some people see it as dangerous. The information written by these linguistic models was even deemed as credible as the New York Times by readers.
NLP for Other Languages in Action
I will now get into the task of NLP for other languages by getting the integration of words for Indian languages. The digital representation of words plays a role in any NLP task. We are going to use the iNLTK (Natural Language Toolkit for Indic Languages) library. You can easily install the iNLTK library by using the pip command: pip install inltk.
The Languages provided by inltk library are given below:
Using iNLTK we can quickly get the embedding vectors for the sentences written in Indian languages. Below is an example that shows how to get the integration vectors for a sentence written in Hindi. The given sentence will be divided into tokens, and each token will be represented using a vector. A token can be a word or a subword. Since tokens can be subwords, we can also get meaningful vector representation for rare words.
Let’s see how to use inltk library for NLP for other languages:
from inltk.inltk import setup from inltk.inltk import tokenize from inltk.inltk import get_embedding_vectors setup('hi') example_sent = "बहुत समय से मिले नहीं" # Tokenize the sentence example_sent_tokens = tokenize(example_sent,'hi') # Get the embedding vector for each token example_sent_vectors = get_embedding_vectors(example_sent, 'hi') print("Tokens:", example_sent_tokens) print("Number of vectors:", len(example_sent_vectors)) print("Shape of each vector:", len(example_sent_vectors))
Output: Tokens: ['▁बहुत', '▁समय', '▁से', '▁मिले', '▁नहीं '] Number of vectors: 5 Shape of each vector: 400
We have got the word embeddings in the output above. Next is NLP for Indian languages.
Multiple NLP Tasks for Indic Languages
Numerically represented natural language can be used by machine learning models to perform many NLP tasks. Apart from that, we can immediately use iNLTK for many NLP tasks.
In the example below, we will use iNLTK to predict the next n-words and get similar sentences. For an entry like “It’s been a while since we last met” in Tamil, we get a prediction for the next word like “And, because of this”. And the results for a similar sentence task are also impressive:
from inltk.inltk import setup from inltk.inltk import predict_next_words from inltk.inltk import get_similar_sentences setup('ta') example_sent = "உங்களைப் பார்த்து நிறைய நாட்கள் ஆகிவிட்டது" # Predict next 'n' tokens n = 5 pred_sent = predict_next_words(example_sent, n, 'ta') # Get 'n' similar sentence n = 2 simi_sent = get_similar_sentences(example_sent, n, 'ta') print("Predicted Words:", pred_sent) print("Similar Sentences:", simi_sent)
Output: Predicted Words: உங்களைப் பார்த்து நிறைய நாட்கள் ஆகிவிட்டது. மேலும், இதற்கு காரணமாக Similar Sentences: ['உங்களைத் பார்த்து நாட்கள் ஆகிவிட்டது ', 'உங்களைப் பார்த்து ஏராளமான நாட்கள் ஆகிவிட்டது ']
It is time to move beyond English and use the real power of NLP for other languages to serve all. Much recent research has focused on multilingual NLP.
So this is how we can use NLP for other languages with Machine Learning. I hope you liked this article on NLP for other languages with Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.