In this article, I will take you through how we can classify the nationalities of people by using their names. You will be thinking about how we can classify nationalities by using just names. There is a lot about how we can play with names.
Classify Nationalities
Let’s get started with this machine learning task to classify nationalities by importing the necessary packages. I will classify nationalities based on names as Indian or Non-Indian. So, let’s import some packages and get started with the task:
Also, Read – Machine Learning Full Course for free.
from tensorflow import keras
import tensorflow as tf
import pandas as pd
import os
import re
Code language: JavaScript (javascript)
Now, let’s import the datasets. The datasets I am using here in this article can be easily downloaded from here. Now after importing the datasets I will prepare two helper functions for data cleaning and data processing:
male_data = pd.read_csv(male.csv)
female_data = pd.read_csv(femaile.csv)
13754
After loading and removing the wrong entries in the data, we got a few records around 13,000.
For non-Indian names, there is a nifty package called Faker. This generates names from different regions:
from faker import Faker
fake = Faker(‘en_US’)
fake.name()
Code language: JavaScript (javascript)
‘Brian Evans’
We have generated approximately the same number of names as we have in the Indian data set. We then removed samples longer than 5 words. The Indian data set contained a lot of names with just first names. So we need to make the overall non-Indian distribution also similar.
non_indian_data.head()
Code language: CSS (css)
name | count_words | |
---|---|---|
0 | sara gulbrandsen | 2 |
1 | kathryn villarreal | 2 |
2 | jennifer mccormick | 2 |
3 | james eaton | 2 |
4 | melissa bond | 2 |
We end up with about 14,000 non-Indian names and 13,000 Indian names. Now let’s build a neural network to classify nationalities using names:
names | predictions_lstm_char | |
---|---|---|
0 | lalitha | indian |
1 | tyson | non_indian |
2 | shailaja | indian |
3 | shyamala | indian |
4 | vishwanathan | indian |
5 | ramanujam | indian |
6 | conan | non_indian |
7 | kryslovsky | non_indian |
8 | ratnani | indian |
9 | diego | non_indian |
10 | kakoli | indian |
11 | shreyas | indian |
12 | brayden | non_indian |
13 | shanon | non_indian |
So this is how we can easily classify nationalities with machine learning. I did not include the full code and exploration here, you can have a look at the full code from here. Feel free to ask your valuable questions in the comments section below.
Also, Read – How to Save Machine Learning Models?