In this article, I’ll walk you through how to create a language classification model to classify text into language categories. This language classification model can be used on data analyzed from online sources to categorize text into languages and filter the desired language before running an analysis such as sentiment analysis.
The language classification is the grouping of associated languages in the same category. Languages are grouped diachronically into language families. In other words, languages are grouped according to their development and evolution throughout history, with languages that descend from a common ancestor being grouped in the same language family.
Language Classification with Machine Learning Using Python
First I will need to import some of the common Python packages and modules used to manage data, metrics and machine learning models needed to build and evaluate our predictive models, as well as modules to visualize our data.
So let’s start with the task of language classification with Machine Learning using the python programming language by importing all the modules and packages needed for this task:
Now, let’s import and clean our data. You can download the dataset that I am using in this task from here:
Language Classification: Feature Creation
I will now create a new set of features that will be used in the language classification model to classify text into three languages: English, Afrikaans, and Dutch. As mentioned, different features may be more effective in classifying other languages:
Language Classification: Summarizing Features
After building the above feature set, we can calculate averages of these features by language to check if there are any obvious significant differences. To do this, simply run the command below:
Looking at the first feature, for example, word_count, we can notice that Afrikaans sentences are likely to be made up of more words than English and Dutch.
Language Classification: Correlation
Next, we need to look at the degree of correlation between the characteristics we have created. The idea behind correlation with the context of our task it that if two or more characteristics are strongly correlated with each other, then it is likely that they will have very similar explanatory power when classifying languages.
As such, we can only keep one of these features and get the same predictive power from our model. To calculate the correlation matrix, we can run the following command:
We can also visualize the pairwise correlation matrix using the following command:
We can notice how several of the variables are strongly positively correlated. For example, word_count and character_count have a correlation of around 96%, which means they tell us roughly the same thing in terms of the length of a text for each language considered.
Language Classification: Splitting The Data
Before going any further in building our linguistic classification model, we need to divide the dataset into training and test sets:
- I divided the dataset into 80% training and 20% testing. Other percentage splits can be used, but these values are generally used.
- The training set will be used to fit the models and store the parameters, the robustness of which will be tested on the test set.
We should aim to use only the most unique characteristics in our classification models, as the correlated variables do not add much to the predictive power of the models.
One method used in machine learning to reduce the correlation between features is called principal component analysis or PCA:
After running the code above, you will notice that the PCA reduced the number of features from 20 to 13 by turning the original features into a new set of components that keep 95% of the variance of the information in the original set.
Using Decision Tree Algorithm
A decision tree model learns by dividing the training set into subsets based on an attribute value test, and this process is repeated over recursive partitions until the subset at a node has the same value as the target parameter, or when additional splitting does not improve. the predictive capacity of the model.
I will adapt the decision tree classifier to the training set and save the model parameters to a pickle file, which can be imported for future use. We then use the model to predict or rank the texts in the languages using the test set.
Now let’s have a look at the accuracy of our language classification model:
accuracy_score_dt = accuracy_score(y_test, y_pred)
The decision tree algorithm gave an accuracy of almost 90%. Now let’s have a look at the confusion matrix to visualize the classified languages with their accuracy:
The graph above shows how many texts were categorized correctly in each of the languages, with the y-axis representing actual or actual output and the x-axis representing expected output. This tells us that the model does well at predicting English texts, in addition to Afrikaans texts.
Hope you liked this article on the task of classifying languages with machine learning using the programming language Python. Please feel free to ask your valuable questions in the comments section below.