StandardScaler in Machine Learning

In Machine Learning, StandardScaler is used to resize the distribution of values ​​so that the mean of the observed values ​​is 0 and the standard deviation is 1. In this article, I will walk you through how to use StandardScaler in Machine Learning.

StandardScaler is an important technique that is mainly performed as a preprocessing step before many machine learning models, in order to standardize the range of functionality of the input dataset.

Also, Read – Why Python is the best language for Machine Learning.

Some machine learning practitioners tend to standardize their data blindly before each machine learning model without making the effort to understand why it should be used, or even whether it is needed or not. So you need to understand when you should use the StandardScaler to scale your data.

When and How To Use StandardScaler?

StandardScaler comes into play when the characteristics of the input dataset differ greatly between their ranges, or simply when they are measured in different units of measure.

StandardScaler removes the mean and scales the data to the unit variance. However, outliers have an influence when calculating the empirical mean and standard deviation, which narrows the range of characteristic values.

These differences in the initial features can cause problems for many machine learning models. For example, for models based on the calculation of distance, if one of the features has a wide range of values, the distance will be governed by that particular characteristic.

The idea behind the StandardScaler is that variables that are measured at different scales do not contribute equally to the fit of the model and the learning function of the model and could end up creating a bias. 

So, to deal with this potential problem, we need to standardize the data (μ = 0, σ = 1) that is typically used before we integrate it into the machine learning model.

Now, let’s see how to use StandardScaler using Scikit-learn:

from sklearn.preprocessing import StandardScaler
import numpy as np

# 4 samples/observations and 2 variables/features
X = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
# the scaler object (model)
scaler = StandardScaler()
# fit and transform the data
scaled_data = scaler.fit_transform(X)
print(X)Code language: PHP (php)
[[0, 0],
 [1, 0],
 [0, 1],
 [1, 1]])
print(scaled_data)Code language: PHP (php)
[[-1. -1.]
 [ 1. -1.]
 [-1.  1.]
 [ 1.  1.]]

To verify the mean of features is 0:

scaled_data.mean(axis = 0)
array([0., 0.])

I hope you liked this article on the StandardScaler in Machine Learning. Feel free to ask your valuable questions in the comments section below.

Also, Read – How to predict IPL winner with Machine Learning.

Follow Us:

Aman Kharwal
Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Articles: 1618

Leave a Reply

Discover more from thecleverprogrammer

Subscribe now to keep reading and get access to the full archive.

Continue reading