In this article, I will take you through how we can classify the nationalities of people by using their names. You will be thinking about how we can classify nationalities by using just names. There is a lot about how we can play with names.
Let’s get started with this machine learning task to classify nationalities by importing the necessary packages. I will classify nationalities based on names as Indian or Non-Indian. So, let’s import some packages and get started with the task:
Now, let’s import the datasets. The datasets I am using here in this article can be easily downloaded from here. Now after importing the datasets I will prepare two helper functions for data cleaning and data processing:
male_data = pd.read_csv(male.csv) female_data = pd.read_csv(femaile.csv)
After loading and removing the wrong entries in the data, we got a few records around 13,000.
For non-Indian names, there is a nifty package called Faker. This generates names from different regions:
We have generated approximately the same number of names as we have in the Indian data set. We then removed samples longer than 5 words. The Indian data set contained a lot of names with just first names. So we need to make the overall non-Indian distribution also similar.
non_indian_data.head()Code language: CSS (css)
We end up with about 14,000 non-Indian names and 13,000 Indian names. Now let’s build a neural network to classify nationalities using names:
So this is how we can easily classify nationalities with machine learning. I did not include the full code and exploration here, you can have a look at the full code from here. Feel free to ask your valuable questions in the comments section below.