# Income Classification with Python

In this article, I will introduce you to a data science project on income classification with Python programming language. The objective of this task is to classify if a person earns more than 50K per year.

## Data Science Project on Income Classification with Python

The dataset Iām going to use here is publicly available data, so we donāt have to collect and scratch the data, just load it into memory. So letās start the income classification task with Python by importing the necessary Python libraries and the dataset:

Also, Read ā 100+ Machine Learning Projects Solved and Explained.

```import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
df.isnull().any()```
```age                False
workclass         False
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation        False
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country    False
income            False
dtype: bool```

There is a column that has a rather obscure name: fnlwgt. Upon closer inspection, this variable is translated as āfinal weightā which represents the total number of people matching that particular row of information. Another thing to note is that each name has a space in front of it. We need to delete it:

```Index(['age', 'workclass', 'final_weight', 'education', 'education_num',
'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hrs_per_week', 'native_country',
'income'],
dtype='object')```

Some of the variables have binary or discrete values. We can apply to encode or transform some of the variables from string to category. Since āincomeā is our target variables, we want it to be numeric for ease of calculation. Iām going to create new variables derived from āincome ā:

```df['income'].unique()
df['income_encoded'] = [1 if value == ' >50K' else 0 for value in df['income'].values]
df['income_encoded'].unique()
# Let's check some descriptive statistics
df.describe()```

Observations from the above statistics:

1. In the dataset the mean and median age is similar, I guess it will be a normal distribution, we will check it later using visualizations.
2. The variables of capital gain and loss are suspect. All observations greater than 0 are in the 4th quartile.
3. In the āhrs_per_weekā columns, the min is 1 and the max is 99, which is not common in real life. We will have to investigate this later.
4. Only about a quarter of the population can earn more than 50,000 a year.

## Income Classification

Letās see how each profession plays out by comparing the number of people earning over 50K. Weāll look at the total number of workers for each area and the total number of people earning over 50K in each:

```df[df['income'] == ' >50K']['occupation'].value_counts().head(3)
pd.crosstab(df["occupation"], df['income']).plot(kind='barh', stacked=True, figsize=(20, 10))```

## Observations:

1. The 3 main occupations in total number are the professional speciality, home repair, executive management.
2. The top 3 occupations in terms of a total number of people earning more than 50K (in order) are Executive, Occupational Specialties and Handicraft Sales and Repairs (with a close margin).
3. Senior executives have the highest percentage of people earning more than 50,000 people: 48%.