One Hot Encoding in Machine Learning

In machine learning, one hot encoding is a method of quantifying categorical data. Briefly, this method produces a vector of length equal to the number of categories in the dataset. In this article, I will introduce you to the One Hot Encoding Algorithm in Machine Learning.

To learn what One Hot Encoding is we first need to go through what Encoding is to understand what One Hot Encoding is. And yes it is one of the most important concepts in Machine Learning. So let’s get started with this task.

Also, Read – Machine Learning Books You Need to Read.

Encoding Class Labels

Many machine learning libraries require class labels to be coded as integer values. Although most classification estimators in scikit-learn convert class labels to integers internally, it is considered a good practice to provide class labels in the form of integer arrays to avoid technical problems.

To encode class labels, we must remember that class labels are not ordinal, and no matter what integer we assign to a particular string label. So we can just list the class labels, starting from 0:

import pandas as pd
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                       ['red', 'L', 13.5, 'class2'],
                       ['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'price', 'classlabel']
import numpy as np
class_mapping = {label:idx for idx,label in 
                 enumerate(np.unique(df['classlabel']))}
class_mappingCode language: JavaScript (javascript)
{'class1': 0, 'class2': 1}

Then we can use the mapping dictionary to turn the class labels into integers:

df['classlabel'] = df['classlabel'].map(class_mapping)
dfCode language: JavaScript (javascript)
image for post

We can now reverse the pairs in the mapping dictionary as follows to map the converted labels to represent the original string:

inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
dfCode language: JavaScript (javascript)
image for post

Alternatively, there is a handy LabelEncoder class directly implemented in scikit-learn to achieve this:

from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
yCode language: JavaScript (javascript)
array([0, 1, 0])

The fit_transform method is only a shortcut to call the fit and transform separately, and we can also use the inverse_transform method to transform the set of class labels into their original string representation:

class_le.inverse_transform(y)Code language: CSS (css)
array(['class1', 'class2', 'class1'], dtype=object)

Performing One Hot encoding 

In the section above, we used a simple dictionary mapping approach to convert the ordinal size function to integers. Since scikit-learn estimators for classification treat class labels as categorical data that does not imply any (nominal) ordering, we used the LabelEncoder practice to encode the string labels as integers. We could use a similar approach to transform the nominal color column of our data set, like so:

X = df[['color', 'size', 'price']].values
color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
XCode language: JavaScript (javascript)
array([[1, 'M', 10.1],
       [2, 'L', 13.5],
       [0, 'XL', 15.3]], dtype=object)

After running the previous code, the first column of the NumPy X table now contains the new color values, which are coded as follows:

  • blue = 0 
  • green = 1 
  • red = 2

If we stop at this here and feed this data to our classification model, we will end up by making one of the most common mistakes in the processing of categorical data. Can you spot this problem? Although color values do not come in a particular order, a learning algorithm will now assume that Green is larger than blue and red is larger than green. As this assumption is incorrect, but the algorithm can still produce useful results. However, these results would not be optimal.

A common workaround for this problem is to use a technique called one hot coding. The idea behind this approach is to create a new dummy entity for each unique value in the nominal characteristic column. Here we would convert the color feature to three new features: blue, green, and red. 

Then the Binary figures can ​be used to represent the particular color of each sample; for example, a blue sample can be encoded as Blue = 1, Green = 0, Red = 0. To perform this transformation, we can use the One Hot Encoding implemented with the scikit-learn.preprocessing module:

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()
Code language: JavaScript (javascript)
array([[ 0. , 1. , 0. , 1. , 10.1],
       [ 0. , 0. , 1. , 2. , 13.5],
       [ 1. , 0. , 0. , 3. , 15.3]])

One more efficient way to create dummy features via one hot encoding is by using the get_dummies method which is implemented in the pandas package:

pd.get_dummies(df[['price', 'color', 'size']])Code language: CSS (css)
image for post

If we use the get_dummies function, we can drop the first column by passing a True argument to the drop_first parameter, as shown below:

pd.get_dummies(df[['price', 'color', 'size']], drop_first=True)
Code language: PHP (php)
one hot encoding

Also, Read – Daily Births Forecasting with Machine Learning.

I hope you liked this article on One Hot Encoding algorithm in Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1535

Leave a Reply