# Income Classification with Python

In this article, I will introduce you to a data science project on income classification with Python programming language. The objective of this task is to classify if a person earns more than 50K per year.

## Data Science Project on Income Classification with Python

The dataset I’m going to use here is publicly available data, so we don’t have to collect and scratch the data, just load it into memory. So let’s start the income classification task with Python by importing the necessary Python libraries and the dataset:

Also, Read – 100+ Machine Learning Projects Solved and Explained.

```import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
df.isnull().any()```
```age                False
workclass         False
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation        False
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country    False
income            False
dtype: bool```

There is a column that has a rather obscure name: fnlwgt. Upon closer inspection, this variable is translated as “final weight” which represents the total number of people matching that particular row of information. Another thing to note is that each name has a space in front of it. We need to delete it:

```Index(['age', 'workclass', 'final_weight', 'education', 'education_num',
'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hrs_per_week', 'native_country',
'income'],
dtype='object')```

Some of the variables have binary or discrete values. We can apply to encode or transform some of the variables from string to category. Since “income” is our target variables, we want it to be numeric for ease of calculation. I’m going to create new variables derived from “income ”:

```df['income'].unique()
df['income_encoded'] = [1 if value == ' >50K' else 0 for value in df['income'].values]
df['income_encoded'].unique()
# Let's check some descriptive statistics
df.describe()```

Observations from the above statistics:

1. In the dataset the mean and median age is similar, I guess it will be a normal distribution, we will check it later using visualizations.
2. The variables of capital gain and loss are suspect. All observations greater than 0 are in the 4th quartile.
3. In the “hrs_per_week” columns, the min is 1 and the max is 99, which is not common in real life. We will have to investigate this later.
4. Only about a quarter of the population can earn more than 50,000 a year.

## Income Classification

Let’s see how each profession plays out by comparing the number of people earning over 50K. We’ll look at the total number of workers for each area and the total number of people earning over 50K in each:

```df[df['income'] == ' >50K']['occupation'].value_counts().head(3)
pd.crosstab(df["occupation"], df['income']).plot(kind='barh', stacked=True, figsize=(20, 10))```

## Observations:

1. The 3 main occupations in total number are the professional speciality, home repair, executive management.
2. The top 3 occupations in terms of a total number of people earning more than 50K (in order) are Executive, Occupational Specialties and Handicraft Sales and Repairs (with a close margin).
3. Senior executives have the highest percentage of people earning more than 50,000 people: 48%. 