When working on a data science task, sometimes many missing values act as a hindrance while getting the correct information from a dataset. We can easily remove the missing values, but sometimes we need to fill these values depending on the quantity of the dataset. So if you want to learn how to fill in missing values in a dataset, this article is for you. In this article, I’ll walk you through a tutorial on how to fill in missing values in a dataset using Python.
Why do We Need to Fill Missing Values in a Dataset?
Sometimes the dataset we use to solve a problem contains a lot of missing values that can adversely affect the performance of a machine learning model. A dataset with a lot of missing values can give us wrong information. So if we have missing values in a dataset, here are some strategies we can choose to deal with them:
- Removing the whole row which contains missing values
- Filling the missing values according to the other known values
The first strategy is to remove the entire row containing a missing value. This is not a bad idea, but it can only be considered when the data is very large. If removing missing values results in a data shortage, then this will not be an ideal dataset for any data science task. This is where the second strategy comes in, which is to fill in the missing values according to the other known values. This strategy can be considered in any type of dataset.
So this is why we need to fill the missing values in a dataset. In the section below, I will take you through a tutorial on how to fill in missing values in a dataset using Python.
Fill Missing Values in a Dataset using Python
The scikit-learn library in Python offers the SimpleImputer() class which can be used for filling the missing values based on:
- Mean of the known values
- Median of the known values
- Most frequent value among the known values
So let’s go through all these methods one by one for filling the missing values of a dataset. I will first create a very simple dataset with some missing values:
[[10. nan 8.] [ 9. 8. nan] [ 7. 10. 9.]]
Here is how you can use the Mean of the other known values for filling the missing values:
[[10. 9. 8. ] [ 9. 8. 8.5] [ 7. 10. 9. ]]
Here is how you can use the Median of the other known values for filling the missing values:
[[10. 9. 8. ] [ 9. 8. 8.5] [ 7. 10. 9. ]]
Here is how you can use the most frequent value among the other known values for filling the missing values:
[[10. 8. 8.] [ 9. 8. 8.] [ 7. 10. 9.]]
Summary
So this is how we can fill missing values in any kind of data while working on a data science task. Filling the missing values is very important as these values act as a hindrance while collecting the correct information from a dataset. I hope you liked this article on a tutorial on filling the missing values in a dataset using Python. Feel free to ask your valuable questions in the comments section below.