SMOTE increases the number of low impact examples in a dataset using synthetic minority oversampling. In this article, I’m going to walk you through what is SMOTE in Machine Learning and how you can use it to deal with unbalanced datasets.
What is SMOTE in Machine Learning?
The Synthetic Minority Oversampling (SMOTE) technique is used to increase the number of less presented cases in a data set used for machine learning. This is a better way to increase the number of cases than to simply duplicate existing cases.
We need to use SMOTE when we are dealing with an unbalanced dataset. There are many reasons why a dataset can be out of balance, for example:
- The category you are targeting may be very rare in the population.
- The dataset can simply be difficult to collect.
Simply put, you should use SMOTE when you find that the class you want to analyze is under-represented in the dataset.
For example, let’s say you used it on a dataset that had data on men and women from India. But somehow you don’t have a lot of Male class instances in the dataset compared to the female class. So in this situation, you know that the number of men is more than that of women in India. So here, SMOTE will return a dataset containing the original samples of this “Male” class, plus an additional number of synthetic minority samples of the “Male” class, depending on the percentage you specify.
How SMOTE Works?
It is a statistical technique for increasing the number of observations in your data set in a balanced way. It works by generating new instances from the existing minority cases that you need to provide as an input. Note that this implementation does not change the number of the majority of cases.
The newly generated instances are not simply copies of the existing minority class as the algorithm takes examples of all the features for each target class and its nearest neighbours. This approach will increase the number of available features for each class and makes the examples look more general.
Then at the end, SMOTE takes the dataset as an input, but it only increases the percentage for the minority class in the data. Let’s consider the same example as above, suppose you have an unbalanced dataset where only 1% of the cases have the target value of “Male” and 99% of the cases have the value “Female”. To increase the percentage of minority cases to twice the previous percentage, you must enter 200 for the SMOTE percentage.
You can learn its practical implementation from below:
I hope you now know what is Synthetic Minority Oversampling (SMOTE) in Machine Learning. Feel free to ask your valuable questions in the comments section below.