Detect and Remove Outliers using Python

Outliers are data points that deviate significantly from the rest of the data. These data points lie far away from the majority of the data points and can have a substantial impact on statistical analysis and modelling. If you want to learn how to detect and remove outliers from your data, this article is for you. In this article, I’ll take you through how to detect and remove outliers using Python.

What are Outliers?

Outliers are data points that deviate significantly from the rest of the data. These data points lie far away from the majority of the data points and can have a substantial impact on statistical analysis and modelling. Outliers can be caused by various factors such as measurement errors, data entry mistakes, or rare occurrences in the underlying phenomenon being studied. Identifying and handling outliers is crucial for maintaining the integrity and accuracy of data analysis.

Let’s have a look at an example of outliers. Look at the image below:

Detect and Remove Outliers

In the above image, the data points are represented with regular markers, while the outliers are marked with red ‘x’ markers. You can see that some data points are deviating from the bulk of the data distribution. These points are nothing but outliers.

Outliers can arise due to various reasons, including data entry mistakes, as well as genuine deviations in the data. Let’s explore both scenarios with examples:

  1. A data entry error: Consider a dataset of student exam results in a particular subject. Most student scores range from 60 to 90, but due to a data entry error, a student’s score is recorded as 200. This value is much higher than all other scores and is likely an outlier. Such an outlier could have a significant impact on any analysis or model built using this data.
  2. A Genuine Deviation: In a dataset of annual employee earnings for a tech company, most salaries fall within a range. However, the CEO’s income is exceptionally high compared to others. It is a genuine deviation since it reflects the significant income gap for a high-level executive. In this case, the outlier represents valuable information rather than an error.

The approach to dealing with outliers depends on the context and the reason for their occurrence. In some cases, it’s appropriate to remove outliers or apply data cleaning techniques. 

However, if the outliers are genuine and represent important information, it is essential to retain them in the dataset and consider robust methods that are not strongly influenced by the outliers. Thoughtful treatment of outliers ensures that we avoid bias in our analysis while capturing valuable insights from the data.

How to Detect and Remove Outliers?

Detecting outliers involves identifying data points that appear to be unusually distant from the bulk of the data distribution. There are several statistical methods to detect outliers, including:

  1. Z-Score Method: This method measures how many standard deviations a data point is away from the mean. Data points with a Z-score greater than a predefined threshold (often 2 or 3) are considered outliers.
  2. IQR (Interquartile Range) Method: The IQR is the range between the 75th and 25th percentiles of the data. Data points below Q1 – 1.5 * IQR or above Q3 + 1.5 * IQR are considered outliers.
  3. Visualization: Plotting the data on a box plot or scatter plot can help visually identify outliers that fall far outside the main cluster of data points.

Once you have detected outliers, you can:

  1. exclude the identified outliers from the dataset;
  2. apply data transformations, such as logarithmic or square root transformations, to mitigate the effect of outliers;
  3. replace outliers with more reasonable values through imputation techniques like using the median or mean of the non-outlying data points;

Detect and Remove Outliers using Python

I hope you have understood what outliers are and the techniques you can use to detect and remove them. Now let’s see how to detect and remove outliers using Python.

Let’s start by creating a dataset of outliers:

import numpy as np
import pandas as pd

# Create a sample dataset with outliers
np.random.seed(42)
data = pd.DataFrame({
    'Feature_A': np.random.normal(loc=50, scale=10, size=100),
    'Feature_B': np.random.normal(loc=100, scale=20, size=100),
})

# Add some outliers to the dataset
data.iloc[5, 0] = 500
data.iloc[20, 1] = 200
data.iloc[35, 1] = 250

Here’s how to remove outliers using Python by using the z-score method:

from scipy import stats

# Function to detect and remove outliers using Z-score method
def remove_outliers_zscore(data, threshold=3):
    z_scores = np.abs(stats.zscore(data))
    filtered_data = data[(z_scores < threshold).all(axis=1)]
    return filtered_data

# Detect and remove outliers
filtered_data = remove_outliers_zscore(data)

You can also use the IQR method to remove outliers. Here’s how to remove outliers using Python by using the IQR method:

# Function to detect and remove outliers using IQR method
def remove_outliers_iqr(data, threshold=1.5):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    filtered_data = data[~((data < (Q1 - threshold * IQR)) | (data > (Q3 + threshold * IQR))).any(axis=1)]
    return filtered_data
  
# Detect and remove outliers
filtered_data = remove_outliers_iqr(data)

Now let’s say you don’t want to remove outliers. You want to replace them with the median value. Here’s how you can replace the value of outliers with the median value using Python:

# Function to detect and replace outliers with median using IQR method

def replace_outliers_with_median(data, threshold=1.5):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    # Calculate lower and upper bounds for outliers
    lower_bound = Q1 - threshold * IQR
    upper_bound = Q3 + threshold * IQR

    # Replace outliers with the median value
    median_value = data.median()
    data_replaced = data.where(~((data < lower_bound) | (data > upper_bound)), median_value, axis=0)

    return data_replaced

# Detect and replace outliers  
filtered_data = replace_outliers_with_median(data)

So this is how you can detect and remove outliers from your data using Python.

Detecting and removing outliers is a part of data preprocessing. You can learn to perform all data preprocessing steps using a data preprocessing pipeline using Python here.

Summary

Outliers are data points that deviate significantly from the rest of the data. These data points lie far away from the majority of the data points and can have a substantial impact on statistical analysis and modelling. I hope you liked this article on how to detect and remove outliers using Python. Feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Articles: 1613

3 Comments

  1. Hi Aman, im currently doing a climate change analysis. I do have outliers in my data such as max temp , rain … I cant remove that as it shows some extreme climatic conditions . As an expert what do u suggest , ignore the outliers or replace with median values ?

  2. We can use seaborn boxplot also for the detection of outliers and we can directly remove them by setting the approx range for top and bottom. Right ?

Leave a Reply

Discover more from thecleverprogrammer

Subscribe now to keep reading and get access to the full archive.

Continue reading