Handling Categorical Data in Machine Learning

While solving problems based on classification with machine learning, we mostly find datasets made up of categorical labels that cannot be processed by all machine learning algorithms. So, if you want to learn how to handle categorical data in machine learning, this article is for you. In this article, I will introduce you to the techniques of handling categorical data in machine learning and their implementation using Python.

Handling Categorical Data in Machine Learning

Not all machine learning algorithms can handle categorical data, so it is very important to convert the categorical features of a dataset into numeric values. The scikit-learn library in Python provides many methods for handling categorical data. Some of the best techniques for handling categorical data are:

  1. LabelEncoder
  2. LabelBinarizer

To use these two methods to handle categorical data, we first need to have a dataset with categorical features. So let’s create one:

[0.03345401 0.48645195]
Female

So, as you can see, I created a very small dataset consisting of 10 categorical samples as Male and Female. In the section below, I’ll show you how to handle these categorical features in machine learning by usingĀ LabelEncoderĀ andĀ LabelBinarizer.

LabelEncoder:

TheĀ LabelEncoderĀ class of the scikit-learn library in Python takes a dictionary-oriented approach to associate each categorical value with a progressive integer value. Below is how to useĀ LabelEncoderĀ for handling categorical data in machine learning:

[1 0 0 1 1 1 1 1 1 1]

This is how we can useĀ LabelEncoderĀ to handle categorical features, you can also decode these transformed values back to the original categorical labels as shown below:

['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male']

LabelBinarizer:

TheĀ LabelEncoderĀ method works in many cases when transforming categorical data into numeric values. But it has the disadvantage that all the labels are transformed into sequential numbers. For this reason, it is best to use one-hot-encoding that binarizes categorical data. So here’s how to use theĀ LabelBinarizerĀ class in scikit-learn to handle categorical data:

[[0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]]

Here is how you can decode these transformed values back to the original categorical labels:

['Female' 'Male' 'Female' 'Female' 'Female' 'Male' 'Male' 'Male' 'Female'
 'Female']

Summary

When solving problems based on classification with machine learning, we mostly find datasets made up of categorical labels that cannot be processed by all machine learning algorithms. This is why we need to convert the categorical features into numerical values. I hope you liked this article on how to handle categorical data in machine learning. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of datašŸ“ˆ.

Articles: 1534

Leave a Reply