Scaling and Normalization in Machine Learning

In Machine Learning, Scaling and Normalization are techniques used in data preprocessing to transform features or variables in a dataset. These techniques ensure that the data is in an appropriate range and distribution, which facilitates efficient training of Machine Learning models. If you want to know everything about Scaling and Normalization, this article is for you. In this article, I’ll introduce you to Scaling and Normalization in Machine Learning and their implementation using Python.

Introduction to Scaling and Normalization in Machine Learning

Scaling refers to the process of transforming feature values into a specific range. It ensures that all features have comparable scales and prevents some features from dominating the model due to their greater magnitude. Scaling techniques include:

  1. Min-max scaling: This technique scales the data within a specific range, often between 0 and 1. It is suitable when the data distribution is relatively even and preserving the relationship between the original values of the features is not crucial.
  2. Standardization: This technique transforms the data to have mean and unit variance of zero. It is appropriate when the data distribution is not necessarily uniform, and it is important to preserve relative differences between feature values. Standardization is used when dealing with models that assume normality, such as linear regression.

Normalization refers to adjusting feature values to follow a specific distribution or pattern. It can be useful when the data has varying scales and distributions, and we want to bring it to a common standard. Normalization techniques include:

  1. Z-score Normalization: This technique transforms the data to have a mean of zero and a standard deviation of one. It is useful when the data has a normal or near-normal distribution. Z-Score Normalization helps make comparisons between features more meaningful by eliminating differences in scale.
  2. Log Transformation: Log transformation is used to normalize data with skewed distributions. By taking the logarithm of the values, the data is compressed, reducing the impact of outliers and making the distribution more symmetric.

So when data features have different scales or units, scaling techniques such as Min-Max scaling or standardization can help ensure the fairness of feature contributions. And when the data distribution is not uniform or skewed, normalization techniques such as Z-Score Normalization or logarithmic transformation can help align the data to a desired pattern.

Implementation of Scaling and Normalization using Python

To show the implementation of scaling and normalization, I’ll start by creating highly imbalanced data:

from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import numpy as np

#Create a highly imbalanced dataset with varying feature scales
X, y = make_classification(
    n_samples=1000,
    n_features=5,
    weights=[0.9, 0.1],  #Imbalanced class distribution
    random_state=42
)

Now let’s have a look at the first few samples of the data:

print("Original Data:")
print(X[:5])
Original Data:
[[-0.43964281  0.54254734 -0.82241993  0.40136622 -0.85484   ]
 [-1.32026785 -0.45165631 -1.14769139  0.21799084  2.51556893]
 [-0.90241409 -0.30179019 -2.08411294  0.15228215  1.70250872]
 [-1.65818269  1.18308467  1.11268837  1.10425263 -1.11576542]
 [-1.59871055  0.16926783 -0.92669831  0.60376341  1.29684502]]

Now here’s how to use Scaling and Normalization to transform the data:

#Perform Scaling
scaler = MinMaxScaler()  #Min-Max Scaling
X_scaled_minmax = scaler.fit_transform(X)

#Perform Normalization
normalizer = StandardScaler()  #Z-score Normalization
X_normalized_zscore = normalizer.fit_transform(X)

Now let’s have a look at the transformed data using both techniques:

print("Scaled (MinMax):")
print(X_scaled_minmax[:5])

print("Normalized (Z-Score):")
print(X_normalized_zscore[:5])
Scaled (MinMax):
[[0.49437724 0.6214084  0.31632182 0.56885557 0.3145778 ]
 [0.38828992 0.45983898 0.26949153 0.53613957 0.79094871]
 [0.43862802 0.48419394 0.13467217 0.52441649 0.67603143]
 [0.34758193 0.72550302 0.59492504 0.69425754 0.27769881]
 [0.35474643 0.56074623 0.30130855 0.60496526 0.61869524]]
Normalized (Z-Score):
[[ 0.36932732  0.22503412 -0.84205748 -0.07829223 -0.67051123]
 [-0.46515818 -1.14689718 -1.17158646 -0.36406708  2.10148778]
 [-0.06919748 -0.94009244 -2.12026495 -0.46646846  1.43278491]
 [-0.78536828  1.10893071  1.11837971  1.01709614 -0.88510982]
 [-0.72901212 -0.29006542 -0.94770077  0.23712649  1.09914606]]

We can also compare the effect of Scaling and Normalization on the data by visualizing the data distribution:

import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig = make_subplots(rows=3, cols=5, 
                    subplot_titles=[f'Feature {i+1}' for i in range(X.shape[1])])

#Original Data Histograms
for i in range(X.shape[1]):
    hist_trace = go.Histogram(x=X[:, i], 
                              nbinsx=20, marker=dict(color='rgb(219, 209, 240)'))
    fig.add_trace(hist_trace, row=1, col=i+1)

#Scaled Data Histograms
for i in range(X_scaled_minmax.shape[1]):
    hist_trace = go.Histogram(x=X_scaled_minmax[:, i], 
                              nbinsx=20, marker=dict(color='rgb(196, 234, 222)'))
    fig.add_trace(hist_trace, row=2, col=i+1)

#Normalized Data Histograms
for i in range(X_normalized_zscore.shape[1]):
    hist_trace = go.Histogram(x=X_normalized_zscore[:, i], 
                              nbinsx=20, marker=dict(color='rgb(251, 205, 231)'))
    fig.add_trace(hist_trace, row=3, col=i+1)

fig.update_layout(title='Feature Distribution Comparison', showlegend=False)
fig.update_layout(height=800, width=1000)

fig.show()
Feature Distribution Comparison: Scaling and Normalization

The first row in purple represents the distribution of the original data. The second row in green represents the distribution of the data scaled using the Min-Max Scaling technique. And the third row in pink represents the distribution of data normalized using the Z-score Normalization technique.

When Not to Use Scaling and Normalization

In some situations, the data may not require Scaling or Normalization, and the raw values can be used directly as input for the model. Such as:

  • When features in the dataset already have similar scales and distributions.
  • When the model used is not sensitive to feature scales, such as decision trees or random forests.

Summary

So Scaling refers to transforming feature values into a specific range. It ensures that all features have comparable scales and prevents some features from dominating the model due to their greater magnitude. And Normalization refers to adjusting feature values to follow a specific distribution or pattern. It can be useful when the data has varying scales and distributions, and we want to bring it to a common standard. I hope you liked this article on Scaling and Normalization in Machine Learning. Feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1498

Leave a Reply