Feature Engineering is a critical step in solving a Data Science problem that involves creating or selecting the most relevant and informative features (input variables) from raw data. These features are then used to train machine learning models, so the quality of the features directly impacts a model’s performance. So, if you want to learn how to perform feature engineering, this article is for you. In this article, I’ll take you through a practical guide to Feature Engineering using Python.
Feature Engineering using Python
In a typical process of solving a data science problem, feature engineering is performed after data exploration and preprocessing. Because to create or select the most important features, you must first know your features. Let’s go through the steps you should follow when performing feature engineering using Python.
I will use a dataset based on Dynamic Pricing to show feature engineering concepts practically using Python. You can download the dataset from here.
Now, let’s import the necessary Python libraries and the dataset:
import pandas as pd data = pd.read_csv("dynamic_pricing.csv") print(data.head())
Number_of_Riders Number_of_Drivers Location_Category \ 0 90 45 Urban 1 58 39 Suburban 2 42 31 Rural 3 89 28 Rural 4 78 22 Rural Customer_Loyalty_Status Number_of_Past_Rides Average_Ratings \ 0 Silver 13 4.47 1 Silver 72 4.06 2 Silver 0 3.99 3 Regular 67 4.31 4 Regular 74 3.77 Time_of_Booking Vehicle_Type Expected_Ride_Duration \ 0 Night Premium 90 1 Evening Economy 43 2 Afternoon Premium 76 3 Afternoon Premium 134 4 Afternoon Economy 149 Historical_Cost_of_Ride 0 284.257273 1 173.874753 2 329.795469 3 470.201232 4 579.681422
Step 1: Feature Selection
Once you have explored your data, identify the most important features to solve your problem. Use techniques like correlation analysis, feature importance from tree-based models, or domain knowledge to select a subset of features.
Below is an example of using correlation analysis for feature selection:
# Calculate correlation matrix correlation_matrix = data.corr() # Set a correlation threshold (Example 0.7) threshold = 0.7 # Identify highly correlated feature pairs correlated_features = set() for i in range(len(correlation_matrix.columns)): for j in range(i): if abs(correlation_matrix.iloc[i, j]) > threshold: colname = correlation_matrix.columns[i] correlated_features.add(colname) # Drop highly correlated features correlated_data = data.drop(correlated_features, axis=1) print(correlated_data.head())
Number_of_Riders Number_of_Drivers Location_Category \ 0 90 45 Urban 1 58 39 Suburban 2 42 31 Rural 3 89 28 Rural 4 78 22 Rural Customer_Loyalty_Status Number_of_Past_Rides Average_Ratings \ 0 Silver 13 4.47 1 Silver 72 4.06 2 Silver 0 3.99 3 Regular 67 4.31 4 Regular 74 3.77 Time_of_Booking Vehicle_Type Expected_Ride_Duration 0 Night Premium 90 1 Evening Economy 43 2 Afternoon Premium 76 3 Afternoon Premium 134 4 Afternoon Economy 149
Here, we are performing feature selection to identify and remove highly correlated features from the dataset. We start by calculating the correlation matrix, which measures the linear relationships between pairs of numerical features. Then, we set a correlation threshold of 0.7, indicating the maximum allowed correlation coefficient. We iterate through the upper triangular part of the correlation matrix to find feature pairs where the absolute correlation coefficient exceeds the threshold. These pairs of highly correlated features are stored in the correlated_features set. Finally, we remove these highly correlated features from the dataset to reduce multicollinearity, which can enhance the performance and interpretability of machine learning models.
We can also select features based on domain knowledge, as shown below:
# Choose relevant columns based on domain knowledge selected_features = ['Number_of_Riders', 'Number_of_Drivers', 'Location_Category', 'Number_of_Past_Rides', 'Average_Ratings', 'Vehicle_Type', 'Expected_Ride_Duration', 'Historical_Cost_of_Ride'] # Create a new DataFrame with selected features domain_based_features = data[selected_features] print(domain_based_features.head())
Number_of_Riders Number_of_Drivers Location_Category \ 0 90 45 Urban 1 58 39 Suburban 2 42 31 Rural 3 89 28 Rural 4 78 22 Rural Number_of_Past_Rides Average_Ratings Vehicle_Type Expected_Ride_Duration \ 0 13 4.47 Premium 90 1 72 4.06 Economy 43 2 0 3.99 Premium 76 3 67 4.31 Premium 134 4 74 3.77 Economy 149 Historical_Cost_of_Ride 0 284.257273 1 173.874753 2 329.795469 3 470.201232 4 579.681422
Step 2: Feature Creation
Create new features based on existing ones. It may involve mathematical transformations (e.g., logarithmic, square root), combining multiple features, or generating interaction terms.
Here are examples of feature creation based on the data we are working with:
data = domain_based_features.copy() # Create a feature for the ratio of riders to drivers data['Riders_to_Drivers_Ratio'] = data['Number_of_Riders'] / data['Number_of_Drivers'] # Create a feature for the cost per past ride data['Cost_Per_Past_Ride'] = data['Historical_Cost_of_Ride'] / data['Number_of_Past_Rides']
Here, we are creating two new features in the dataset. The first feature, Riders_to_Drivers_Ratio, calculates the ratio of the number of riders to the number of drivers, which can provide insights into the balance between supply and demand in the context of dynamic pricing.
The second feature, Cost_Per_Past_Ride, computes the cost per past ride by dividing the historical cost of rides by the number of past rides. This new feature helps in understanding the average cost incurred for each ride, which can be valuable information for pricing strategies and analysis.
Step 3: Handle Text and Categorical Data
If your data has text or categorical data, you can convert text data into numerical representations using techniques like TF-IDF or word embeddings. For categorical variables, employ one-hot encoding or ordinal encoding.
We have categorical data in the Location_Category and Vehicle_Type columns. We can convert them into a numerical format that machine learning models can understand. Here’s how we can convert them into numerical format using one-hot encoding:
# Perform one-hot encoding for categorical columns data = pd.get_dummies(data, columns=['Location_Category', 'Vehicle_Type'])
This step converts categorical variables into binary (0 or 1) columns, making them suitable for Machine Learning models.
So, these were the feature engineering steps that could be performed on the dataset we are working with. Other feature engineering steps include:
- Handling Time Series Data: If your data involves time series, extract meaningful time-based features such as day of the week, month, or seasonality.
- Dimensionality Reduction (if necessary): If you have a high-dimensional dataset, consider dimensionality reduction techniques like Principal Component Analysis (PCA) to reduce the number of features while preserving important information.
Summary
So, this is how you can perform feature engineering using Python step by step. Feature engineering is a critical step in solving a Data Science problem that involves creating or selecting the most relevant and informative features (input variables) from raw data. I hope you liked this article on a practical guide to Feature Engineering using Python. Feel free to ask valuable questions in the comments section below.