Here's How to Fill in Missing Values in a Dataset

Data Preparation is one of the most valuable skills every Data Science professional should have. One challenging task in data preparation is filling in missing values and deciding what measure you should consider between mean, median, and mode to fill in the missing values in a dataset. So if you want to learn how to fill in missing values and how to decide the strategy, this article is for you. In this article, I will introduce how to fill in missing values in a dataset.

Here’s How to Choose Between Mean, Median, and Mode to Fill in Missing Values

Choosing between Mean, Median, and Mode to fill in missing values in a dataset depends on the data you are working with. Below are some valuable guidelines that will help you decide what to choose between mean, median, and mode to fill in missing values in a dataset:

Mean: When your dataset is in a normal distribution, you can use mean to fill in the missing values.
Median: When your dataset is not in a normal distribution, you can use the median value to fill in the missing values.
Mode: When the missing values in your data are categorical and discrete, you can use the mode value to fill in the missing values.

So the first step is to see if your data has missing values. If your data has missing values, you need to check the distribution of each numerical variable (with missing values). If the values in the numerical variables are missing, use the Mean value if the variable is in a normal distribution. Otherwise, choose Median. And if the variable is categorical or discrete, you can select mode. So you need to choose a different measure for each variable.

Now Here’s How to Fill in Missing Values in a Dataset

Now let’s create a sample data with missing values so that we can fill in the missing values using Mean, Median, and Mode:

import pandas as pd
import numpy as np

data = {'A': [1, 2, 3, 4, np.nan, 6, 7, 8, 9, np.nan],
        'B': [2, 4, 6, 8, np.nan, 12, 14, 16, 18, np.nan],
        'C': ['red', 'blue', np.nan, 'green', 'green', 
              'blue', 'red', 'blue', 'green', np.nan]}
df = pd.DataFrame(data)
print(df)

     A     B      C
0  1.0   2.0    red
1  2.0   4.0   blue
2  3.0   6.0    NaN
3  4.0   8.0  green
4  NaN   NaN  green
5  6.0  12.0   blue
6  7.0  14.0    red
7  8.0  16.0   blue
8  9.0  18.0  green
9  NaN   NaN    NaN

Here’s how to fill in missing values using the mean value:

mean_A = df['A'].mean()
df['A'].fillna(mean_A, inplace=True)
print(df)

     A     B      C
0  1.0   2.0    red
1  2.0   4.0   blue
2  3.0   6.0    NaN
3  4.0   8.0  green
4  5.0   NaN  green
5  6.0  12.0   blue
6  7.0  14.0    red
7  8.0  16.0   blue
8  9.0  18.0  green
9  5.0   NaN    NaN

Here’s how to fill in missing values using the median value:

median_B = df['B'].median()
df['B'].fillna(median_B, inplace=True)
print(df)

     A     B      C
0  1.0   2.0    red
1  2.0   4.0   blue
2  3.0   6.0    NaN
3  4.0   8.0  green
4  5.0  10.0  green
5  6.0  12.0   blue
6  7.0  14.0    red
7  8.0  16.0   blue
8  9.0  18.0  green
9  5.0  10.0    NaN

And now, here’s how to fill in missing values using the mode value:

mode_C = df['C'].mode()[0]
df['C'].fillna(mode_C, inplace=True)
print(df)

     A     B      C
0  1.0   2.0    red
1  2.0   4.0   blue
2  3.0   6.0   blue
3  4.0   8.0  green
4  5.0  10.0  green
5  6.0  12.0   blue
6  7.0  14.0    red
7  8.0  16.0   blue
8  9.0  18.0  green
9  5.0  10.0   blue

So this is how you can fill in missing values in your data.