Income Classification with Python

In this article, I will introduce you to a data science project on income classification with Python programming language. The objective of this task is to classify if a person earns more than 50K per year.

Data Science Project on Income Classification with Python

The dataset I’m going to use here is publicly available data, so we don’t have to collect and scratch the data, just load it into memory. So let’s start the income classification task with Python by importing the necessary Python libraries and the dataset:

Also, Read – 100+ Machine Learning Projects Solved and Explained.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('income_evaluation.csv')
df.isnull().any()
age                False
 workclass         False
 fnlwgt            False
 education         False
 education-num     False
 marital-status    False
 occupation        False
 relationship      False
 race              False
 sex               False
 capital-gain      False
 capital-loss      False
 hours-per-week    False
 native-country    False
 income            False
dtype: bool

There is a column that has a rather obscure name: fnlwgt. Upon closer inspection, this variable is translated as “final weight” which represents the total number of people matching that particular row of information. Another thing to note is that each name has a space in front of it. We need to delete it:

Index(['age', 'workclass', 'final_weight', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hrs_per_week', 'native_country',
       'income'],
      dtype='object')

Some of the variables have binary or discrete values. We can apply to encode or transform some of the variables from string to category. Since ā€œincomeā€ is our target variables, we want it to be numeric for ease of calculation. I’m going to create new variables derived from “income ”:

df['income'].unique()
df['income_encoded'] = [1 if value == ' >50K' else 0 for value in df['income'].values]
df['income_encoded'].unique()
# Let's check some descriptive statistics
df.describe()
agefinal_weighteducation_numcapital_gaincapital_losshrs_per_weekincome_encoded
count32561.0000003.256100e+0432561.00000032561.00000032561.00000032561.00000032561.000000
mean38.5816471.897784e+0510.0806791077.64884487.30383040.4374560.240810
std13.6404331.055500e+052.5727207385.292085402.96021912.3474290.427581
min17.0000001.228500e+041.0000000.0000000.0000001.0000000.000000
25%28.0000001.178270e+059.0000000.0000000.00000040.0000000.000000
50%37.0000001.783560e+0510.0000000.0000000.00000040.0000000.000000
75%48.0000002.370510e+0512.0000000.0000000.00000045.0000000.000000
max90.0000001.484705e+0616.00000099999.0000004356.00000099.0000001.000000

Observations from the above statistics:

  1. In the dataset the mean and median age is similar, I guess it will be a normal distribution, we will check it later using visualizations.
  2. The variables of capital gain and loss are suspect. All observations greater than 0 are in the 4th quartile.
  3. In the “hrs_per_week” columns, the min is 1 and the max is 99, which is not common in real life. We will have to investigate this later.
  4. Only about a quarter of the population can earn more than 50,000 a year.

Income Classification

Let’s see how each profession plays out by comparing the number of people earning over 50K. We’ll look at the total number of workers for each area and the total number of people earning over 50K in each:

df[df['income'] == ' >50K']['occupation'].value_counts().head(3)
pd.crosstab(df["occupation"], df['income']).plot(kind='barh', stacked=True, figsize=(20, 10))
income classification

Observations:

  1. The 3 main occupations in total number are the professional speciality, home repair, executive management.
  2. The top 3 occupations in terms of a total number of people earning more than 50K (in order) are Executive, Occupational Specialties and Handicraft Sales and Repairs (with a close margin).
  3. Senior executives have the highest percentage of people earning more than 50,000 people: 48%.

I hope you liked this article on income classification with Python programming language. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of datašŸ“ˆ.

Articles: 1535

Leave a Reply