The most common type of non-numeric features is categorical features. In this article, I’ll walk you through what categorical features are in Machine Learning, and how to convert categorical features to numerical values with Python.
What are Categorical Features?
A feature is categorical if the values can be placed in buckets and the order of the values is not important. In some cases, this type of functionality is easy to identify. For example, when it only takes a few string values, such as spam and ham.
In other cases, the question of whether an entity is a numeric (whole) or a categorical entity is not so obvious. Sometimes one or the other can be a valid representation and the choice can affect the performance of the model.
An example is a function representing the day of the week, which could be validly coded either as numeric (number of days since Sunday) or as categorical (the names Monday, Tuesday, etc.).
How to Identify Categorical Features?
The image below describes how to identify categorical features in Machine Learning:
At the top is the Single Person dataset, which has a Marital Status categorical function. At the bottom is a dataset with information about the passengers of the Titanic.
The features identified as categorical here are Survival (whether the passenger survived or not), Pclass (in which class the passenger was travelling), Gender (male or female), and Onboard (from which city the passenger boarded).
Convert Categorical Features to Numeric Values
Some machine learning algorithms use categorical functionality natively, but they generally require data in digital form. You can encode categorical entities as numbers (one number per category), but you cannot use this encoded data as a true categorical characteristic, because you then introduced an (arbitrary) order of categories.
Remember that one of the properties of categorical entities is that they are not ordered. Instead, you can convert each of the categories to a separate binary entity that has a value of 1 for the instances where the category has appeared and a value of 0 if it doesn’t.
Therefore, each categorical feature is converted into a set of binary entities, one for each category. Entities constructed in this way are sometimes called dummy variables.
Python Code to Convert Categorical Features to Numerical Values
The pseudocode for converting categorical entities to numeric values is as follows:
The above Python code for converting categorical features to numerical values works for most machine learning algorithms. But a few algorithms such as certain types of decision tree algorithms and associated algorithms such as random forests can use categorical functionality natively.
Hope you liked this article on how to convert categorical features to numerical values. Please feel free to ask your valuable questions in the comments section below.