In this article, I’ll walk you through scaling and normalization in machine learning and what the difference between these two is. Now, before I dive into this task let’s import all the libraries we need because I will take you through the Scaling and Normalization both practically and conceptually.
Let’s import all the necessary libraries:
import pandas as pd import numpy as np # for Box-Cox Transformation from scipy import stats # for min_max scaling from mlxtend.preprocessing import minmax_scaling # plotting modules import seaborn as sns import matplotlib.pyplot as plt # set seed for reproducibility np.random.seed(0)
Scaling and Normalization: What’s the difference?
One of the reasons it’s easy to confuse scaling and normalization is that the terms are sometimes used interchangeably and, to make matters even more confusing, they are very similar! In both cases, you transform the values of numeric variables so that the transformed data points have specific useful properties. The difference is that:
- when scaling, you change the range of your data, while
- in normalization, you change the shape of the distribution of your data.
Let’s talk a bit more about each of these options.
Scaling means that you transform your data to fit into a specific scale, like 0-100 or 0-1. You want to scale the data when you use methods based on measurements of the distance between data points, such as supporting vector machines and the k nearest neighbors. With these algorithms, a change of “1” in any numeric characteristic has the same importance.
For example, you could look at the prices of certain products in both yen and US dollars. One US dollar is worth about 100 yen, but if you don’t change your prices, methods like SVM or KNN will consider a 1 yen price difference as big as a difference of 1 US dollar! This does not correspond to our intuitions of the world. With currency, you can convert between currencies. But what if you look at something like height and weight? It’s not entirely clear how many pounds should equal an inch.
By scaling your variables, you can help compare different variables on an equal footing. To help solidify what scaling looks like, let’s look at an invented example:
# generate 1000 data points randomly drawn from an exponential distribution original_data = np.random.exponential(size=1000) # mix-max scale the data between 0 and 1 scaled_data = minmax_scaling(original_data, columns=) # plot both together to compare fig, ax = plt.subplots(1,2) sns.distplot(original_data, ax=ax) ax.set_title("Original Data") sns.distplot(scaled_data, ax=ax) ax.set_title("Scaled data")
Text(0.5, 1.0, 'Scaled data')
Note that the shape of the data doesn’t change, but instead of expanding from 0 to 8ish, it now goes from 0 to 1.
Scaling only changes the range of your data. Normalization is a more radical transformation. The idea behind normalization is to change our observations in a way that they can be described as a normal distribution.
The normal distribution is also known as the bell curve, this is a specific statistical distribution where roughly equal observations fall above and below the mean, the mean and the median are same and more number of observations are closer to the mean.
In general, you will normalize your data if you are going to use a machine learning or statistics technique that assumes that your data is normally distributed. Some examples of these include linear discriminant analysis and Gaussian Naive Bayes.
The method I’m using to normalize the data here is called the Box-Cox transformation. Let’s take a quick look at what normalizing some data looks like:
# normalize the exponential data with boxcox normalized_data = stats.boxcox(original_data) # plot both together to compare fig, ax=plt.subplots(1,2) sns.distplot(original_data, ax=ax) ax.set_title("Original Data") sns.distplot(normalized_data, ax=ax) ax.set_title("Normalized data")
Note that the form of our data has changed. Before normalizing it was almost L-shaped. I hope you liked this article on the concept of Scaling and Normalization in Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.