Data Cleaning with Python

When analyzing and modelling data, a significant amount of time is spent preparing the data: loading, cleansing, transforming, and reorganizing. These tasks are often reported to take 80% or more of an analyst’s time. Sometimes the way data is stored in files or databases is not in the right format for a particular task. In this article, I will take you through the techniques of data preparation data cleaning with Python.

Fortunately, pandas, along with the built-in features of the Python language, provide you with a high-level, flexible and fast set of tools to let you manipulate data in the right form. So we only need pandas and some functions of Numpy for data cleaning with Python.

Also, Read – Summarize Text with Machine Learning.

Data Cleaning: Handling Missing Data

Missing data typically occurs in many data analysis applications. One of the goals of pandas is to make working with missing data as easy as possible. For example, all descriptive statistics on pandas objects exclude missing data by default.

The way missing data is represented in pandas objects is somewhat flawed, but it is functional for many users. For numeric data, pandas use the floating point value NaN (Not a Number) to represent missing data. Now let’s see how we can handle missing data with python:

import pandas as pd import numpy as np string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado']) string_data
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool

When cleaning data for analysis, it is often important to analyze the missing data itself to identify data collection issues or potential biases in the data caused by missing data.

Filtering Out Missing Data

There are several ways to filter out missing data. While you still have the option to do this by hand using pandas.isnull and Boolean indexing, dropna can be useful. On a series, it returns the series with only non-zero data and index values:

from numpy import nan as NA data = pd.Series([1, NA, 3.5, NA, 7]) data.dropna()
0    1.0
2    3.5
4    7.0
dtype: float64

With Pandas DataFrame, things are a bit more complex. You can delete rows or columns that are all NA values or only those containing NAs.dropna by default removes any row containing a missing value:

data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]]) cleaned = data.dropna() data
        0	1	2
0	1.0	6.5	3.0
1	1.0	NaN	NaN
2	NaN	NaN	NaN
3	NaN	6.5	3.0
print(cleaned)
    0    1    2
0  1.0  6.5  3.0

A related way of filtering DataFrame rows tends to be for time series data. Suppose you want to keep only the rows that contain a certain number of observations. You can indicate this with the thresh argument:

df = pd.DataFrame(np.random.randn(7, 3)) df.iloc[:4, 1] = NA df
data cleansing

Filling In Missing Data

Rather than filtering out the missing data (and possibly discarding other data with it), there are a few ways you might want to fill in the “gaps”. In most cases, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values ​​with this value:

df.fillna(0)
image for post

By calling fillna with a dict, you can use a different fill value for each column:

df.fillna({1: 0.5, 2: 0})
data cleaning image for post

Data Cleaning: Data Transformation

So far we have been interested in data reorganization. Another important class of operations is filtering, cleansing, and other transformations.

Removing Duplicates

Duplicate rows can be found in a DataFrame for a number of reasons. Here is an example:

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4]}) data
        k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4
6	two	4

The duplicated() method in DataFrame returns a Boolean series indicating whether each row is a duplicate or not:

data.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Likewise, drop_duplicates returns a DataFrame where the duplicated array is False:

data.drop_duplicates()
        k1	k2
0	one	1
1	two	1
2	one	2
3	two	3
4	one	3
5	two	4

Data Cleaning: Transforming Data

For many datasets, you might want to perform a transformation based on the values ​​of an array, series, or column in a DataFrame. Consider the following hypothetical data collected on different types of meat:

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'], 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]}) data
           food	      ounces
0	bacon	        4.0
1	pulled pork	3.0
2	bacon	       12.0
3	Pastrami	6.0
4	corned beef	7.5
5	Bacon	        8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

Suppose you want to add a column indicating the type of animal each food comes from. Let’s write a map of each distinct type of meat-based on the type of animal:

meat_to_animal = { 'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow', 'honey ham': 'pig', 'nova lox': 'salmon' }

The map method on a series accepts a function or an object of type dict containing a mapping, but here we have a little problem in that some meats are uppercase and some are not. So, we need to convert each value to lowercase using the str.lower series method:

lowercased = data['food'].str.lower() lowercased
0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object
data['animal'] = lowercased.map(meat_to_animal) data
image for post 4 data cleaning

Also, Read – Scraping YouTube with Python.

Using the map function is a convenient way to perform basic transformations and other data cleaning operations using Python. I hope you liked this article on Data Cleaning using Python. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Follow Us:

Leave a Reply