Data Cleaning using Python

Data cleaning, also known as data cleansing or scrubbing, is a crucial step in data preparation and analysis. It refers to the systematic and meticulous identification, correction, and removal of errors, inconsistencies, and inaccuracies from data. The main objective of data cleaning is to ensure that the data is accurate, reliable, and ready for analysis. If you want to learn how to clean data using Python, this article is for you. In this article, I’ll take you through a guide to Data Cleaning using Python.

What is Data Cleaning?

Data Cleaning is a crucial step in data preparation and analysis. It refers to the systematic and meticulous identification, correction, and removal of errors, inconsistencies, and inaccuracies from data.

In real-world scenarios, data is gathered from various sources, and it is common for datasets to contain errors and inconsistencies due to factors such as human errors, data entry mistakes, technical glitches, or missing values. These inaccuracies can significantly impact the results and conclusions drawn from the analysis. 

Data cleaning involves a series of steps to detect and handle these issues. It may include removing duplicate records, filling in missing values, correcting data entry errors, resolving inconsistencies, and validating data against predefined rules or constraints. So by performing data cleaning, we can ensure the reliability and quality of the data used for analysis, leading to more accurate and meaningful insights.

Data Cleaning using Python

The process we can follow for data cleaning depends on the type of data you have and the kind of errors your data has. Still, I’ll try to cover the most useful steps and operations you need to clean your data as a Data Science professional.

First, we need to have a dataset for this task. You can download the dataset from here. Let’s have a look at the dataset:

import pandas as pd
data = pd.read_csv("sample_data.csv")
print(data.head())
    ID               Name   Age Gender             Phone  \
0  177       James Rhodes  52.0      M      504.554.3776   
1  984  Kimberly Crawford  56.0    NaN               NaN   
2  722       Karen Walker  59.0      M               NaN   
3  870       Sandra Smith   NaN      M  479.487.9989x242   
4  664       Alison Walsh  77.0      M  001-879-250-3309   

                     Email                                            Address  \
0   fostergina@example.com  3892 Kathleen Square Apt. 817\nLake Jill, AZ 1...   
1     krussell@example.com     51987 Bush Ranch Suite 663\nScottton, PA 03611   
2  kiarapeters@example.net      29197 Austin Flat\nEast Robertmouth, NE 42823   
3     rodney99@example.net           185 Le Villages\nPort Johnside, WY 42113   
4     denglish@example.com  295 Matthew Inlet Suite 676\nJohnsonland, NC 4...   

  Credit_Score Last_Purchase_Date  
0      unknown         2021-09-19  
1          700         2023-02-25  
2          870         2021-09-17  
3      unknown         2021-11-24  
4      unknown         2022-05-08  

Now let’s have a look at the column insights:

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  15 non-null     int64  
 1   Name                15 non-null     object 
 2   Age                 12 non-null     float64
 3   Gender              13 non-null     object 
 4   Phone               13 non-null     object 
 5   Email               15 non-null     object 
 6   Address             14 non-null     object 
 7   Credit_Score        15 non-null     object 
 8   Last_Purchase_Date  14 non-null     object 
dtypes: float64(1), int64(1), object(7)
memory usage: 1.2+ KB

Let’s get started with data cleaning using Python. The Age column contains 12 non-null float values (float64). It seems there are three missing (NaN) values in this column. Below is how we can clean the age column:

# Convert 'Age' column to integer type
data['Age'] = data['Age'].astype('Int64')
median_age = data['Age'].median()
median_age = round(median_age)

# Fill missing ages with the median value
data['Age'].fillna(median_age, inplace=True)

Here we are dealing with the missing or unknown age values in the ‘Age’ column of the dataset by converting it to an integer type that can handle missing values. It then calculates the median age from the available data and fills the missing ages with this median value, making sure all the ages are represented as integers. This way, we have a cleaner and more complete ‘Age’ column for further analysis.

Now let’s handle missing values in the Gender, Phone, and Address columns:

# Handling missing values in 'Gender', 'Phone', and 'Address'
data['Gender'].fillna('Unknown', inplace=True)
data['Phone'].fillna('Unknown', inplace=True)
data['Address'].fillna('Unknown', inplace=True)

Here, we replaced the missing values with the string ‘Unknown’, creating a more complete and consistent dataset. These placeholder values can later be identified and handled accordingly during analysis or data processing tasks. You can also replace the missing values in the string columns, such as gender, with the mode value. But as phone numbers and addresses are always unique, we cannot replace the missing values with the mode value here.

Now let’s see how to handle the missing value in the Last_Purchase_Date column:

# Handling missing values in 'Last_Purchase_Date'
most_recent_date = pd.to_datetime(data['Last_Purchase_Date']).max()
data['Last_Purchase_Date'].fillna(most_recent_date, inplace=True)

Here we calculated the most recent date from the available data and then filled in the missing dates with the most recent date. It ensures that the Last_Purchase_Date column is complete and contains valid date information for all individuals in the dataset.

Now let’s see how to handle the Credit Score column:

# Convert 'Credit_Score' to numeric type and replace 'unknown' with NaN
data['Credit_Score'] = pd.to_numeric(data['Credit_Score'], 
                                     errors='coerce')

# Calculate the mean of the available credit scores
mean_credit_score = data['Credit_Score'].mean()

# Fill missing credit scores with the mean value
data['Credit_Score'].fillna(mean_credit_score, inplace=True)

Here we first Converted the Credit_Score column to a numeric data type, replacing any non-numeric values with NaN. Then we calculated the mean of the available credit scores. Then we filled the missing credit scores in the Credit_Score column with the mean value, providing an estimation for the missing values based on the average credit score.

And, in the end, here’s how we can drop duplicate rows from the data:

data.drop_duplicates(inplace=True)
print(data.head())
    ID               Name  Age   Gender             Phone  \
0  177       James Rhodes   52        M      504.554.3776   
1  984  Kimberly Crawford   56  Unknown           Unknown   
2  722       Karen Walker   59        M           Unknown   
3  870       Sandra Smith   48        M  479.487.9989x242   
4  664       Alison Walsh   77        M  001-879-250-3309   

                     Email                                            Address  \
0   fostergina@example.com  3892 Kathleen Square Apt. 817\nLake Jill, AZ 1...   
1     krussell@example.com     51987 Bush Ranch Suite 663\nScottton, PA 03611   
2  kiarapeters@example.net      29197 Austin Flat\nEast Robertmouth, NE 42823   
3     rodney99@example.net           185 Le Villages\nPort Johnside, WY 42113   
4     denglish@example.com  295 Matthew Inlet Suite 676\nJohnsonland, NC 4...   

   Credit_Score Last_Purchase_Date  
0    775.916667         2021-09-19  
1    700.000000         2023-02-25  
2    870.000000         2021-09-17  
3    775.916667         2021-11-24  
4    775.916667         2022-05-08  

So this is how you can perform data cleaning on your datasets using Python.

Summary

So, this is how you can perform data cleaning and prepare your datasets for analysis using Python. Data cleaning involves a series of steps to detect and handle these issues. It may include removing duplicate records, filling in missing values, correcting data entry errors, resolving inconsistencies, and validating data against predefined rules or constraints. I hope you liked this article on Data Cleaning using Python. Feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1498

Leave a Reply