Feature Engineering in Machine Learning

In the real world, data rarely comes in perfect form. With this in mind, one of the more critical steps in using machine learning in practice is Feature Engineering, that is, taking whatever information you have about your problem and turning it into numbers that you can use to build your feature matrix.

What is Feature Engineering for Machine Learning?

Feature Engineering is the procedure of using the domain knowledge of the data to create features that can be used in training a Machine Learning algorithm. If the process of feature engineering is executed correctly, it increases the accuracy of our trained machine learning model’s prediction.

Also, read – Feature Selection Techniques in Machine Learning with Python

In this article, I will cover a few common examples of feature engineering tasks: features for representing categorical data, functions for rendering text.

Categorical Features

One common type of non-numerical data is categorical data. For example, imagine you are exploring some data on housing prices, and along with numerical features like “price” and “rooms,” you also have “neighborhood” information.

For example, your data might look something like this:

data = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]Code language: Python (python)

You might be tempted to encode this data with a straightforward numerical mapping:

{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3}Code language: Python (python)

It turns out that this is not generally a useful approach in Scikit-Learn: the package’s models make the fundamental assumption that numerical features reflect algebraic quantities.

Thus such a mapping would imply, for example, that Queen Anne < Fremont < Wallingford, or even that Wallingford – Queen Anne = Fremont, which (niche demographic jokes aside) does not make much sense.

In this case, one proven technique is to use one-hot encoding, which effectively creates extra columns indicating the presence or absence of a category with a value of 1 or 0, respectively. When your data comes as a list of dictionaries, Scikit-Learn’s DictVectorizer will do this for you:

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)Code language: Python (python)
array([[     0,      1,      0, 850000,      4],
       [     1,      0,      0, 700000,      3],
       [     0,      0,      1, 650000,      3],
       [     1,      0,      0, 600000,      2]], dtype=int64)

Notice that the ‘neighborhood’ column has been expanded into three separate columns, representing the three neighborhood labels and that each row has a 1 in the column associated with its neighborhood.

With these categorical features thus encoded, you can proceed as normal with fitting a Scikit-Learn model.To see the meaning of each column, you can inspect the feature names:

vec.get_feature_names()Code language: Python (python)
 'neighborhood=Queen Anne',

There is one clear disadvantage of this approach: if your category has many possible values, this can significantly increase the size of your dataset. However, because the encoded data contains mostly zeros, a sparse output can be a very efficient solution:

vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)Code language: Python (python)
<4x5 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

Many (though not yet all) of the Scikit-Learn estimators accept such sparse inputs when fitting and evaluating models. sklearn.preprocessing.OneHotEncoder and sklearn.feature_extraction.FeatureHasher are two additional tools that Scikit-Learn includes to support this type of encoding.

Text Features

Another common need in feature engineering is to convert text to a set of representative numerical values. For example, most automatic mining of social media data relies on some form of encoding the text as numbers.

One of the simplest methods of encoding data is by word counts: you take each snippet of text, count each word’s occurrences, and put the results in a table.

For example, consider the following set of three phrases:

sample = ['problem of evil',
          'evil queen',
          'horizon problem']Code language: Python (python)

For a vectorization of this data based on word count, we could construct a column representing the word “problem,” the word “evil,” the word “horizon,” and so on.

While doing this by hand would be possible, the monotony can be avoided by using Scikit-Learn’s CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(sample)
XCode language: Python (python)
<3x5 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

The result is a sparse matrix recording the number of times each word appears; it is easier to inspect if we convert this to a DataFrame with labeled columns:

import pandas as pd
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())Code language: Python (python)

There are some issues with this approach; however: the raw word counts lead to features that put too much weight on words that appear very frequently, which can be sub-optimal in some classification algorithms.

One approach to fix this is term frequency-inverse document frequency (TF–IDF), which weights the word counts by measuring how often they appear in the documents. The syntax for computing these features is similar to the previous example:

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())Code language: Python (python)

Feature Pipelines with Feature Engineering

With any of the preceding examples, it can quickly become tedious to do the transformations by hand, especially if you wish to string together multiple steps.

For example, we might want a processing pipeline that looks something like this:

  1. Impute missing values using the mean
  2. Transform features to quadratic
  3. Fit a linear regression

To streamline this type of processing pipeline, Scikit-Learn provides a Pipeline object, which can be used as follows:

from sklearn.pipeline import make_pipeline

model = make_pipeline(Imputer(strategy='mean'),
                      LinearRegression())Code language: Python (python)

This pipeline looks and acts like a standard Scikit-Learn object, and will apply all the specified steps to any input data.

model.fit(X, y)  # X with missing values, from above
print(model.predict(X))Code language: Python (python)
[14 16 -1  8 -5]
[ 14.  16.  -1.   8.  -5.]

All the steps of the model have applied automatically. Notice that I have used the model to the data it was trained on; this is why it was able to predict the result correctly.

I hope you will like this article on Feature Engineering in Machine Learning, feel free to ask questions on feature engineering or any other topic in the comments section below.

Also, Read – 10 Machine Learning Projects to Boost your Portfolio

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1435

Leave a Reply