In the process of building a machine learning model after handling null values and turning categories into numbers and preparing them for our models, the next step is to transform the data for outliers detection and models that require normally distributed features. In this article, I’ll walk you through how to remove outliers in Machine Learning using Python.
What are Outliers in Machine Learning?
Outliers are anomalous observations that diverge from other groups. They can have negative effects on our perception of data and the construction of our model. We could have outliers due to data entry or human error, damaged or unqualified measuring instruments, data manipulation, dummies to test detection methods or to add noise, and finally news in the data.
Even when you generate random numbers from distribution, there will be rare values that deviate from the mean of all other examples. These values are the ones we want to get rid of to properly train a machine learning model.
There are two types of outliers in machine learning:
- Univariate Outliers: When we look at the values in single feature space (for example, looking only at the distribution of the Selling Price column).
- Multivariate outliers: When we look at an n-dimensional space, each dimension representing an entity. In this case, because we have too many features to consider, we can’t just plot the data and detect how far away from the normal groups is, so we use models to do that detection for us.
Why Do We Need To Remove Outliers?
There are several reasons why someone would consider removing a few examples from their dataset, even when the dataset is small and we need all the information we can get. We need to remove outliers because they can be destructive to our machine learning model and the perception of reality.
We want our model to predict the most likely label and not be affected by a random value in our data set. The best way is to remove as little as possible but to make the models robust so that it can ignore or emulate their effect on the prediction of the machine learning model.
How To Remove Outliers in Machine Learning?
To remove outliers we need to detect them. Them best way to detect outliers is the manual method. You need to go through all the information and see the trends of the data. Any point that is too far away from the rest of the data is a signal of an outlier.
Still, if you want to see how to detect outliers by using the Python programming language you can look at this tutorial. Now let’s see how to remove outliers in Machine Learning. I will first import the dataset and do some data processing to understand the data and to prepare the data so that I can remove outliers:
Now Let’s See How To Remove Outliers:
Watch how the data is less distributed and more concentrated in a smaller range of floor space. Although you can still take a closer look and see that there are very small values in the lower-left corner of the graph that show properties being sold with abnormally small values.
You can go ahead and delete them and see what happens to your results. I have manually removed the prices below 40,000 and it will indeed help the accuracy of our machine learning model.
I hope you liked this article on how to remove outliers in Machine Learning using Python. Feel free to ask your valuable questions in the comments section below.