Handling Categorical Data in Machine Learning

While solving problems based on classification with machine learning, we mostly find datasets made up of categorical labels that cannot be processed by all machine learning algorithms. So, if you want to learn how to handle categorical data in machine learning, this article is for you. In this article, I will introduce you to the techniques of handling categorical data in machine learning and their implementation using Python.

Handling Categorical Data in Machine Learning

Not all machine learning algorithms can handle categorical data, so it is very important to convert the categorical features of a dataset into numeric values. The scikit-learn library in Python provides many methods for handling categorical data. Some of the best techniques for handling categorical data are:

  1. LabelEncoder
  2. LabelBinarizer

To use these two methods to handle categorical data, we first need to have a dataset with categorical features. So let’s create one:

[0.03345401 0.48645195]
Female

So, as you can see, I created a very small dataset consisting of 10 categorical samples as Male and Female. In the section below, I’ll show you how to handle these categorical features in machine learning by using LabelEncoder and LabelBinarizer.

LabelEncoder:

The LabelEncoder class of the scikit-learn library in Python takes a dictionary-oriented approach to associate each categorical value with a progressive integer value. Below is how to use LabelEncoder for handling categorical data in machine learning:

[1 0 0 1 1 1 1 1 1 1]

This is how we can use LabelEncoder to handle categorical features, you can also decode these transformed values back to the original categorical labels as shown below:

['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male', 'Male', 'Male']

LabelBinarizer:

The LabelEncoder method works in many cases when transforming categorical data into numeric values. But it has the disadvantage that all the labels are transformed into sequential numbers. For this reason, it is best to use one-hot-encoding that binarizes categorical data. So here’s how to use the LabelBinarizer class in scikit-learn to handle categorical data:

[[0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]]

Here is how you can decode these transformed values back to the original categorical labels:

['Female' 'Male' 'Female' 'Female' 'Female' 'Male' 'Male' 'Male' 'Female'
 'Female']

Summary

When solving problems based on classification with machine learning, we mostly find datasets made up of categorical labels that cannot be processed by all machine learning algorithms. This is why we need to convert the categorical features into numerical values. I hope you liked this article on how to handle categorical data in machine learning. Feel free to ask your valuable questions in the comments section below.

Default image
Aman Kharwal
Coder with the ♥️ of a Writer || Data Scientist | Solopreneur | Founder
Articles: 1102

Leave a Reply