In this article, I will introduce you to a data science project on income classification with Python programming language. The objective of this task is to classify if a person earns more than 50K per year.
Data Science Project on Income Classification with Python
The dataset I’m going to use here is publicly available data, so we don’t have to collect and scratch the data, just load it into memory. So let’s start the income classification task with Python by importing the necessary Python libraries and the dataset:
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv('income_evaluation.csv') df.isnull().any()
age False workclass False fnlwgt False education False education-num False marital-status False occupation False relationship False race False sex False capital-gain False capital-loss False hours-per-week False native-country False income False dtype: bool
There is a column that has a rather obscure name: fnlwgt. Upon closer inspection, this variable is translated as “final weight” which represents the total number of people matching that particular row of information. Another thing to note is that each name has a space in front of it. We need to delete it:
Index(['age', 'workclass', 'final_weight', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hrs_per_week', 'native_country', 'income'], dtype='object')
Some of the variables have binary or discrete values. We can apply to encode or transform some of the variables from string to category. Since “income” is our target variables, we want it to be numeric for ease of calculation. I’m going to create new variables derived from “income ”:
df['income'].unique() df['income_encoded'] = [1 if value == ' >50K' else 0 for value in df['income'].values] df['income_encoded'].unique() # Let's check some descriptive statistics df.describe()
Observations from the above statistics:
- In the dataset the mean and median age is similar, I guess it will be a normal distribution, we will check it later using visualizations.
- The variables of capital gain and loss are suspect. All observations greater than 0 are in the 4th quartile.
- In the “hrs_per_week” columns, the min is 1 and the max is 99, which is not common in real life. We will have to investigate this later.
- Only about a quarter of the population can earn more than 50,000 a year.
Let’s see how each profession plays out by comparing the number of people earning over 50K. We’ll look at the total number of workers for each area and the total number of people earning over 50K in each:
df[df['income'] == ' >50K']['occupation'].value_counts().head(3) pd.crosstab(df["occupation"], df['income']).plot(kind='barh', stacked=True, figsize=(20, 10))
- The 3 main occupations in total number are the professional speciality, home repair, executive management.
- The top 3 occupations in terms of a total number of people earning more than 50K (in order) are Executive, Occupational Specialties and Handicraft Sales and Repairs (with a close margin).
- Senior executives have the highest percentage of people earning more than 50,000 people: 48%.
I hope you liked this article on income classification with Python programming language. Feel free to ask your valuable questions in the comments section below.