When analyzing and modelling data, a significant amount of time is spent preparing the data: loading, cleansing, transforming, and reorganizing. These tasks are often reported to take 80% or more of an analyst’s time. Sometimes the way data is stored in files or databases is not in the right format for a particular task. In this article, I will take you through the techniques of data preparation data cleaning with Python.
Fortunately, pandas, along with the built-in features of the Python language, provide you with a high-level, flexible and fast set of tools to let you manipulate data in the right form. So we only need pandas and some functions of Numpy for data cleaning with Python.
Also, Read – Summarize Text with Machine Learning.
Data Cleaning: Handling Missing Data
Missing data typically occurs in many data analysis applications. One of the goals of pandas is to make working with missing data as easy as possible. For example, all descriptive statistics on pandas objects exclude missing data by default.
The way missing data is represented in pandas objects is somewhat flawed, but it is functional for many users. For numeric data, pandas use the floating point value NaN (Not a Number) to represent missing data. Now let’s see how we can handle missing data with python:
import pandas as pd
import numpy as np
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
Code language: JavaScript (javascript)
0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object
string_data.isnull()
Code language: CSS (css)
0 False 1 False 2 True 3 False dtype: bool
When cleaning data for analysis, it is often important to analyze the missing data itself to identify data collection issues or potential biases in the data caused by missing data.
Filtering Out Missing Data
There are several ways to filter out missing data. While you still have the option to do this by hand using pandas.isnull and Boolean indexing, dropna can be useful. On a series, it returns the series with only non-zero data and index values:
from numpy import nan as NA
data = pd.Series([1, NA, 3.5, NA, 7])
data.dropna()
Code language: JavaScript (javascript)
0 1.0 2 3.5 4 7.0 dtype: float64
With Pandas DataFrame, things are a bit more complex. You can delete rows or columns that are all NA values or only those containing NAs.dropna by default removes any row containing a missing value:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
print(cleaned)
Code language: PHP (php)
0 1 2 0 1.0 6.5 3.0
A related way of filtering DataFrame rows tends to be for time series data. Suppose you want to keep only the rows that contain a certain number of observations. You can indicate this with the thresh argument:
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df

Filling In Missing Data
Rather than filtering out the missing data (and possibly discarding other data with it), there are a few ways you might want to fill in the “gaps”. In most cases, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with this value:
df.fillna(0)
Code language: CSS (css)

By calling fillna with a dict, you can use a different fill value for each column:
df.fillna({1: 0.5, 2: 0})
Code language: HTTP (http)

Data Cleaning: Data Transformation
So far we have been interested in data reorganization. Another important class of operations is filtering, cleansing, and other transformations.
Removing Duplicates
Duplicate rows can be found in a DataFrame for a number of reasons. Here is an example:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
'k2': [1, 1, 2, 3, 3, 4, 4]})
data
Code language: JavaScript (javascript)
k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4 6 two 4
The duplicated() method in DataFrame returns a Boolean series indicating whether each row is a duplicate or not:
data.duplicated()
Code language: CSS (css)
0 False 1 False 2 False 3 False 4 False 5 False 6 True dtype: bool
Likewise, drop_duplicates returns a DataFrame where the duplicated array is False:
data.drop_duplicates()
Code language: CSS (css)
k1 k2 0 one 1 1 two 1 2 one 2 3 two 3 4 one 3 5 two 4
Data Cleaning: Transforming Data
For many datasets, you might want to perform a transformation based on the values of an array, series, or column in a DataFrame. Consider the following hypothetical data collected on different types of meat:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
'Pastrami', 'corned beef', 'Bacon',
'pastrami', 'honey ham', 'nova lox'],
'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
Code language: JavaScript (javascript)
food ounces 0 bacon 4.0 1 pulled pork 3.0 2 bacon 12.0 3 Pastrami 6.0 4 corned beef 7.5 5 Bacon 8.0 6 pastrami 3.0 7 honey ham 5.0 8 nova lox 6.0
Suppose you want to add a column indicating the type of animal each food comes from. Let’s write a map of each distinct type of meat-based on the type of animal:
meat_to_animal = {
'bacon': 'pig',
'pulled pork': 'pig',
'pastrami': 'cow',
'corned beef': 'cow',
'honey ham': 'pig',
'nova lox': 'salmon'
}
Code language: JavaScript (javascript)
The map method on a series accepts a function or an object of type dict containing a mapping, but here we have a little problem in that some meats are uppercase and some are not. So, we need to convert each value to lowercase using the str.lower series method:
lowercased = data['food'].str.lower()
lowercased
Code language: JavaScript (javascript)
0 bacon 1 pulled pork 2 bacon 3 pastrami 4 corned beef 5 bacon 6 pastrami 7 honey ham 8 nova lox Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal)
data
Code language: JavaScript (javascript)

Also, Read – Scraping YouTube with Python.
Using the map function is a convenient way to perform basic transformations and other data cleaning operations using Python. I hope you liked this article on Data Cleaning using Python. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.