Data cleaning, also known as data cleansing or scrubbing, is a crucial step in data preparation and analysis. It refers to the systematic and meticulous identification, correction, and removal of errors, inconsistencies, and inaccuracies from data. The main objective of data cleaning is to ensure that the data is accurate, reliable, and ready for analysis. If you want to learn how to clean data using Python, this article is for you. In this article, I’ll take you through a guide to Data Cleaning using Python.
What is Data Cleaning?
Data Cleaning is a crucial step in data preparation and analysis. It refers to the systematic and meticulous identification, correction, and removal of errors, inconsistencies, and inaccuracies from data.
In real-world scenarios, data is gathered from various sources, and it is common for datasets to contain errors and inconsistencies due to factors such as human errors, data entry mistakes, technical glitches, or missing values. These inaccuracies can significantly impact the results and conclusions drawn from the analysis.
Data cleaning involves a series of steps to detect and handle these issues. It may include removing duplicate records, filling in missing values, correcting data entry errors, resolving inconsistencies, and validating data against predefined rules or constraints. So by performing data cleaning, we can ensure the reliability and quality of the data used for analysis, leading to more accurate and meaningful insights.
Data Cleaning using Python
The process we can follow for data cleaning depends on the type of data you have and the kind of errors your data has. Still, I’ll try to cover the most useful steps and operations you need to clean your data as a Data Science professional.
First, we need to have a dataset for this task. You can download the dataset from here. Let’s have a look at the dataset:
import pandas as pd data = pd.read_csv("sample_data.csv") print(data.head())
ID Name Age Gender Phone \ 0 177 James Rhodes 52.0 M 504.554.3776 1 984 Kimberly Crawford 56.0 NaN NaN 2 722 Karen Walker 59.0 M NaN 3 870 Sandra Smith NaN M 479.487.9989x242 4 664 Alison Walsh 77.0 M 001-879-250-3309 Email Address \ 0 fostergina@example.com 3892 Kathleen Square Apt. 817\nLake Jill, AZ 1... 1 krussell@example.com 51987 Bush Ranch Suite 663\nScottton, PA 03611 2 kiarapeters@example.net 29197 Austin Flat\nEast Robertmouth, NE 42823 3 rodney99@example.net 185 Le Villages\nPort Johnside, WY 42113 4 denglish@example.com 295 Matthew Inlet Suite 676\nJohnsonland, NC 4... Credit_Score Last_Purchase_Date 0 unknown 2021-09-19 1 700 2023-02-25 2 870 2021-09-17 3 unknown 2021-11-24 4 unknown 2022-05-08
Now let’s have a look at the column insights:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 15 entries, 0 to 14 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 15 non-null int64 1 Name 15 non-null object 2 Age 12 non-null float64 3 Gender 13 non-null object 4 Phone 13 non-null object 5 Email 15 non-null object 6 Address 14 non-null object 7 Credit_Score 15 non-null object 8 Last_Purchase_Date 14 non-null object dtypes: float64(1), int64(1), object(7) memory usage: 1.2+ KB
Let’s get started with data cleaning using Python. The Age column contains 12 non-null float values (float64). It seems there are three missing (NaN) values in this column. Below is how we can clean the age column:
# Convert 'Age' column to integer type data['Age'] = data['Age'].astype('Int64') median_age = data['Age'].median() median_age = round(median_age) # Fill missing ages with the median value data['Age'].fillna(median_age, inplace=True)
Here we are dealing with the missing or unknown age values in the ‘Age’ column of the dataset by converting it to an integer type that can handle missing values. It then calculates the median age from the available data and fills the missing ages with this median value, making sure all the ages are represented as integers. This way, we have a cleaner and more complete ‘Age’ column for further analysis.
Now let’s handle missing values in the Gender, Phone, and Address columns:
# Handling missing values in 'Gender', 'Phone', and 'Address' data['Gender'].fillna('Unknown', inplace=True) data['Phone'].fillna('Unknown', inplace=True) data['Address'].fillna('Unknown', inplace=True)
Here, we replaced the missing values with the string ‘Unknown’, creating a more complete and consistent dataset. These placeholder values can later be identified and handled accordingly during analysis or data processing tasks. You can also replace the missing values in the string columns, such as gender, with the mode value. But as phone numbers and addresses are always unique, we cannot replace the missing values with the mode value here.
Now let’s see how to handle the missing value in the Last_Purchase_Date column:
# Handling missing values in 'Last_Purchase_Date' most_recent_date = pd.to_datetime(data['Last_Purchase_Date']).max() data['Last_Purchase_Date'].fillna(most_recent_date, inplace=True)
Here we calculated the most recent date from the available data and then filled in the missing dates with the most recent date. It ensures that the Last_Purchase_Date column is complete and contains valid date information for all individuals in the dataset.
Now let’s see how to handle the Credit Score column:
# Convert 'Credit_Score' to numeric type and replace 'unknown' with NaN data['Credit_Score'] = pd.to_numeric(data['Credit_Score'], errors='coerce') # Calculate the mean of the available credit scores mean_credit_score = data['Credit_Score'].mean() # Fill missing credit scores with the mean value data['Credit_Score'].fillna(mean_credit_score, inplace=True)
Here we first Converted the Credit_Score column to a numeric data type, replacing any non-numeric values with NaN. Then we calculated the mean of the available credit scores. Then we filled the missing credit scores in the Credit_Score column with the mean value, providing an estimation for the missing values based on the average credit score.
And, in the end, here’s how we can drop duplicate rows from the data:
data.drop_duplicates(inplace=True) print(data.head())
ID Name Age Gender Phone \ 0 177 James Rhodes 52 M 504.554.3776 1 984 Kimberly Crawford 56 Unknown Unknown 2 722 Karen Walker 59 M Unknown 3 870 Sandra Smith 48 M 479.487.9989x242 4 664 Alison Walsh 77 M 001-879-250-3309 Email Address \ 0 fostergina@example.com 3892 Kathleen Square Apt. 817\nLake Jill, AZ 1... 1 krussell@example.com 51987 Bush Ranch Suite 663\nScottton, PA 03611 2 kiarapeters@example.net 29197 Austin Flat\nEast Robertmouth, NE 42823 3 rodney99@example.net 185 Le Villages\nPort Johnside, WY 42113 4 denglish@example.com 295 Matthew Inlet Suite 676\nJohnsonland, NC 4... Credit_Score Last_Purchase_Date 0 775.916667 2021-09-19 1 700.000000 2023-02-25 2 870.000000 2021-09-17 3 775.916667 2021-11-24 4 775.916667 2022-05-08
So this is how you can perform data cleaning on your datasets using Python.
Summary
So, this is how you can perform data cleaning and prepare your datasets for analysis using Python. Data cleaning involves a series of steps to detect and handle these issues. It may include removing duplicate records, filling in missing values, correcting data entry errors, resolving inconsistencies, and validating data against predefined rules or constraints. I hope you liked this article on Data Cleaning using Python. Feel free to ask valuable questions in the comments section below.