Feature Engineering using Python

Feature Engineering is a critical step in solving a Data Science problem that involves creating or selecting the most relevant and informative features (input variables) from raw data. These features are then used to train machine learning models, so the quality of the features directly impacts a model’s performance. So, if you want to learn how to perform feature engineering, this article is for you. In this article, I’ll take you through a practical guide to Feature Engineering using Python.

Feature Engineering using Python

In a typical process of solving a data science problem, feature engineering is performed after data exploration and preprocessing. Because to create or select the most important features, you must first know your features. Let’s go through the steps you should follow when performing feature engineering using Python.

I will use a dataset based on Dynamic Pricing to show feature engineering concepts practically using Python. You can download the dataset from here.

Now, let’s import the necessary Python libraries and the dataset:

import pandas as pd
data = pd.read_csv("dynamic_pricing.csv")

print(data.head())
   Number_of_Riders  Number_of_Drivers Location_Category  \
0                90                 45             Urban   
1                58                 39          Suburban   
2                42                 31             Rural   
3                89                 28             Rural   
4                78                 22             Rural   

  Customer_Loyalty_Status  Number_of_Past_Rides  Average_Ratings  \
0                  Silver                    13             4.47   
1                  Silver                    72             4.06   
2                  Silver                     0             3.99   
3                 Regular                    67             4.31   
4                 Regular                    74             3.77   

  Time_of_Booking Vehicle_Type  Expected_Ride_Duration  \
0           Night      Premium                      90   
1         Evening      Economy                      43   
2       Afternoon      Premium                      76   
3       Afternoon      Premium                     134   
4       Afternoon      Economy                     149   

   Historical_Cost_of_Ride  
0               284.257273  
1               173.874753  
2               329.795469  
3               470.201232  
4               579.681422  

Step 1: Feature Selection

Once you have explored your data, identify the most important features to solve your problem. Use techniques like correlation analysis, feature importance from tree-based models, or domain knowledge to select a subset of features.

Below is an example of using correlation analysis for feature selection:

# Calculate correlation matrix
correlation_matrix = data.corr()

# Set a correlation threshold (Example 0.7)
threshold = 0.7

# Identify highly correlated feature pairs
correlated_features = set()
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            colname = correlation_matrix.columns[i]
            correlated_features.add(colname)

# Drop highly correlated features
correlated_data = data.drop(correlated_features, axis=1)

print(correlated_data.head())
   Number_of_Riders  Number_of_Drivers Location_Category  \
0                90                 45             Urban   
1                58                 39          Suburban   
2                42                 31             Rural   
3                89                 28             Rural   
4                78                 22             Rural   

  Customer_Loyalty_Status  Number_of_Past_Rides  Average_Ratings  \
0                  Silver                    13             4.47   
1                  Silver                    72             4.06   
2                  Silver                     0             3.99   
3                 Regular                    67             4.31   
4                 Regular                    74             3.77   

  Time_of_Booking Vehicle_Type  Expected_Ride_Duration  
0           Night      Premium                      90  
1         Evening      Economy                      43  
2       Afternoon      Premium                      76  
3       Afternoon      Premium                     134  
4       Afternoon      Economy                     149  

Here, we are performing feature selection to identify and remove highly correlated features from the dataset. We start by calculating the correlation matrix, which measures the linear relationships between pairs of numerical features. Then, we set a correlation threshold of 0.7, indicating the maximum allowed correlation coefficient. We iterate through the upper triangular part of the correlation matrix to find feature pairs where the absolute correlation coefficient exceeds the threshold. These pairs of highly correlated features are stored in the correlated_features set. Finally, we remove these highly correlated features from the dataset to reduce multicollinearity, which can enhance the performance and interpretability of machine learning models.

We can also select features based on domain knowledge, as shown below:

# Choose relevant columns based on domain knowledge
selected_features = ['Number_of_Riders', 'Number_of_Drivers', 
                     'Location_Category', 'Number_of_Past_Rides', 
                     'Average_Ratings', 'Vehicle_Type', 
                     'Expected_Ride_Duration', 'Historical_Cost_of_Ride']

# Create a new DataFrame with selected features
domain_based_features = data[selected_features]

print(domain_based_features.head())
   Number_of_Riders  Number_of_Drivers Location_Category  \
0                90                 45             Urban   
1                58                 39          Suburban   
2                42                 31             Rural   
3                89                 28             Rural   
4                78                 22             Rural   

   Number_of_Past_Rides  Average_Ratings Vehicle_Type  Expected_Ride_Duration  \
0                    13             4.47      Premium                      90   
1                    72             4.06      Economy                      43   
2                     0             3.99      Premium                      76   
3                    67             4.31      Premium                     134   
4                    74             3.77      Economy                     149   

   Historical_Cost_of_Ride  
0               284.257273  
1               173.874753  
2               329.795469  
3               470.201232  
4               579.681422  

Step 2: Feature Creation

Create new features based on existing ones. It may involve mathematical transformations (e.g., logarithmic, square root), combining multiple features, or generating interaction terms.

Here are examples of feature creation based on the data we are working with:

data = domain_based_features.copy()

# Create a feature for the ratio of riders to drivers
data['Riders_to_Drivers_Ratio'] = data['Number_of_Riders'] / data['Number_of_Drivers']

# Create a feature for the cost per past ride
data['Cost_Per_Past_Ride'] = data['Historical_Cost_of_Ride'] / data['Number_of_Past_Rides']

Here, we are creating two new features in the dataset. The first feature, Riders_to_Drivers_Ratio, calculates the ratio of the number of riders to the number of drivers, which can provide insights into the balance between supply and demand in the context of dynamic pricing.

The second feature, Cost_Per_Past_Ride, computes the cost per past ride by dividing the historical cost of rides by the number of past rides. This new feature helps in understanding the average cost incurred for each ride, which can be valuable information for pricing strategies and analysis.

Step 3: Handle Text and Categorical Data

If your data has text or categorical data, you can convert text data into numerical representations using techniques like TF-IDF or word embeddings. For categorical variables, employ one-hot encoding or ordinal encoding.

We have categorical data in the Location_Category and Vehicle_Type columns. We can convert them into a numerical format that machine learning models can understand. Here’s how we can convert them into numerical format using one-hot encoding:

# Perform one-hot encoding for categorical columns
data = pd.get_dummies(data, columns=['Location_Category', 'Vehicle_Type'])

This step converts categorical variables into binary (0 or 1) columns, making them suitable for Machine Learning models.

So, these were the feature engineering steps that could be performed on the dataset we are working with. Other feature engineering steps include:

  • Handling Time Series Data: If your data involves time series, extract meaningful time-based features such as day of the week, month, or seasonality.
  • Dimensionality Reduction (if necessary): If you have a high-dimensional dataset, consider dimensionality reduction techniques like Principal Component Analysis (PCA) to reduce the number of features while preserving important information.

Summary

So, this is how you can perform feature engineering using Python step by step. Feature engineering is a critical step in solving a Data Science problem that involves creating or selecting the most relevant and informative features (input variables) from raw data. I hope you liked this article on a practical guide to Feature Engineering using Python. Feel free to ask valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1535

Leave a Reply