In machine learning, one hot encoding is a method of quantifying categorical data. Briefly, this method produces a vector of length equal to the number of categories in the dataset. In this article, I will introduce you to the One Hot Encoding Algorithm in Machine Learning.
To learn what One Hot Encoding is we first need to go through what Encoding is to understand what One Hot Encoding is. And yes it is one of the most important concepts in Machine Learning. So let’s get started with this task.
Also, Read – Machine Learning Books You Need to Read.
Encoding Class Labels
Many machine learning libraries require class labels to be coded as integer values. Although most classification estimators in scikit-learn convert class labels to integers internally, it is considered a good practice to provide class labels in the form of integer arrays to avoid technical problems.
To encode class labels, we must remember that class labels are not ordinal, and no matter what integer we assign to a particular string label. So we can just list the class labels, starting from 0:
import pandas as pd
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'price', 'classlabel']
import numpy as np
class_mapping = {label:idx for idx,label in
enumerate(np.unique(df['classlabel']))}
class_mapping
Code language: JavaScript (javascript)
{'class1': 0, 'class2': 1}
Then we can use the mapping dictionary to turn the class labels into integers:
df['classlabel'] = df['classlabel'].map(class_mapping)
df
Code language: JavaScript (javascript)

We can now reverse the pairs in the mapping dictionary as follows to map the converted labels to represent the original string:
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df
Code language: JavaScript (javascript)

Alternatively, there is a handy LabelEncoder class directly implemented in scikit-learn to achieve this:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y
Code language: JavaScript (javascript)
array([0, 1, 0])
The fit_transform method is only a shortcut to call the fit and transform separately, and we can also use the inverse_transform method to transform the set of class labels into their original string representation:
class_le.inverse_transform(y)
Code language: CSS (css)
array(['class1', 'class2', 'class1'], dtype=object)
Performing One Hot encoding
In the section above, we used a simple dictionary mapping approach to convert the ordinal size function to integers. Since scikit-learn estimators for classification treat class labels as categorical data that does not imply any (nominal) ordering, we used the LabelEncoder practice to encode the string labels as integers. We could use a similar approach to transform the nominal color column of our data set, like so:
X = df[['color', 'size', 'price']].values
color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X
Code language: JavaScript (javascript)
array([[1, 'M', 10.1], [2, 'L', 13.5], [0, 'XL', 15.3]], dtype=object)
After running the previous code, the first column of the NumPy X table now contains the new color values, which are coded as follows:
- blue = 0
- green = 1
- red = 2
If we stop at this here and feed this data to our classification model, we will end up by making one of the most common mistakes in the processing of categorical data. Can you spot this problem? Although color values do not come in a particular order, a learning algorithm will now assume that Green is larger than blue and red is larger than green. As this assumption is incorrect, but the algorithm can still produce useful results. However, these results would not be optimal.
A common workaround for this problem is to use a technique called one hot coding. The idea behind this approach is to create a new dummy entity for each unique value in the nominal characteristic column. Here we would convert the color feature to three new features: blue, green, and red.
Then the Binary figures can ​be used to represent the particular color of each sample; for example, a blue sample can be encoded as Blue = 1, Green = 0, Red = 0. To perform this transformation, we can use the One Hot Encoding implemented with the scikit-learn.preprocessing module:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()
Code language: JavaScript (javascript)
array([[ 0. , 1. , 0. , 1. , 10.1], [ 0. , 0. , 1. , 2. , 13.5], [ 1. , 0. , 0. , 3. , 15.3]])
One more efficient way to create dummy features via one hot encoding is by using the get_dummies method which is implemented in the pandas package:
pd.get_dummies(df[['price', 'color', 'size']])
Code language: CSS (css)

If we use the get_dummies function, we can drop the first column by passing a True argument to the drop_first parameter, as shown below:
pd.get_dummies(df[['price', 'color', 'size']], drop_first=True)
Code language: PHP (php)

Also, Read – Daily Births Forecasting with Machine Learning.
I hope you liked this article on One Hot Encoding algorithm in Machine Learning. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Machine Learning.