When using a machine learning algorithm, it is very important to train the model on a dataset with almost the same number of samples. This is known as a balanced class. We need to have balanced classes to train a model, but if the classes are not balanced, we need to use a class balancing technique before using a machine learning algorithm. So in this article, I will walk you through what class balancing is and how to implement class balancing techniques using Python.
What is Class Balancing?
In machine learning, class balancing means balancing classes with unbalanced samples. Avoiding Class Imbalance is important before using a machine learning algorithm because our end goal is to train a machine learning model that generalizes well for all possible classes assuming we have a binary dataset with an equal number of samples.
So, before using a machine learning algorithm, it is very important to look at the class distribution to correct the class balancing issues. For example, let’s see how we can spot unbalanced classes by creating an unbalanced dataset using the make_classification function in the Scikit-learn library in Python:
from sklearn.datasets import make_classification nb_samples = 1000 weights = (0.95, 0.05) x, y = make_classification(n_samples=nb_samples, n_features=2, n_redundant=0, weights=weights, random_state=1000) print(x[y==0].shape) print(x[y==1].shape)
(946, 2) (54, 2)
So as expected, the first class is dominant. To balance the classes of this kind of dataset we have two techniques for avoiding class imbalance in machine learning:
- Resampling with replacement
- SMOTE Resampling
Now let’s go through both these class balancing techniques to see how we can balance the classes before using any machine learning algorithm.
Resampling with Replacement:
In the resampling with replacement method, we resample from the dataset limited to the minor class until we reach the desired number of samples in both classes. As we operate with replacing, it can be iterated by the n number of times. But the resulting dataset will contain data points sampled from 54 possible values (according to our example). Here is how we can use the resampling with replacement technique using Python:
# Resampling with Replacement import numpy as np from sklearn.utils import resample x_resampled = resample(x[y==1], n_samples=x[y==0].shape, random_state=1000) x_ = np.concatenate((x[y==0], x_resampled)) y_ = np.concatenate((y[y==0], np.ones(shape=(x[y==0].shape,), dtype=np.int32))) print(x_[y_==0].shape) print(x_[y_==1].shape)
(946, 2) (946, 2)
SMOTE resampling is one of the most robust approaches for avoiding class imbalance. It stands for Synthetic Minority Over-sampling Technique. This technique was designed to generate new samples consistent with the minor classes. To implement the SMOTE resampling technique for class balancing, we can use the imbalanced-learn library which has many algorithms for this kind of problem. Here’s how to implement SMOTE resampling for class balancing using Python:
from imblearn.over_sampling import SMOTE smote = SMOTE(random_state=1000) x_, y_ = smote.fit_sample(x, y) print(x_[y_==0].shape) print(x_[y_==1].shape)
(946, 2) (946, 2)
Both the Resampling with replacement and SMOTE resampling are very useful techniques for avoiding Class imbalance in machine learning. Resampling with replacement method is used to increase the number of samples but the resulting distribution will be the same as the values are taken from the existing set. Whereas, SMOTE resampling generates the same number of samples by considering the neighbours. I hope you liked this article on avoiding Class imbalance in machine learning and the implementation of class balancing techniques using Python. Feel free to ask your valuable questions in the comments section below.