Categories
By Aman Kharwal

Pipelines in Machine Learning

Machine Learning Pipelines performs a complete workflow with an ordered sequence of the process involved in a Machine Learning task. In most of the functions in Machine Learning, the data that you work with is barely in a format for training the model with it’s the best performance. There are several steps in the process of training a machine learning model, like encoding, categorical variables, feature scaling, and normalization. The preprocessing package of Scikit-Learn provides all these functions that can be easily used as transformations.

But, in a typical workflow of a Machine Learning task, you need to apply all the processes of transformations at least two times. The first time when you train the model and then when you use the trained model on the new data. On the other hand, you can create a function to apply all the transformations and reuse on the original data by calling the function, but you would still need to run this first and call the model separately. So to tackle this, we have Machine Learning Pipelines that is a method to simplify this process. The most essential benefits that Machine Learning Pipelines provides are:

  • Machine Learning Pipelines will make the workflow of your task very much easier to read and understand.
  • The Pipelines in Machine Learning enforce robust implementation of the process involved in your task.
  • In the end, it will make your work more reproducible.
Pipelines
Source – datanami

In this article, I will take you through the implementation of Machine Learning Pipelines in a Machine Learning Project. First, I will transform the dataset according to our needs; then, I will move towards the implementation of the Machine Learning Pipelines.

Data Preparation (Transformation)

First I will transform the data by using the pandas package in Python. The data that I have used in this article can be easily downloaded from here.

import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train = train.drop('Loan_ID', axis=1)
train.dtypes
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

Before building a Machine Learning Pipeline, I will split the training data into train and test sets to validate the performance of our model.

X = train.drop('Loan_Status', axis=1)
y = train['Loan_Status']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Building Machine Learning Pipelines

The first step in building a pipeline is to define the type of each transformer. In simple words it means to create transformers according to the type of their variables.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

Now I will use a Column Transformer to apply all the transformations to their respective columns in the dataframe.

numeric_features = train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = train.select_dtypes(include=['object']).drop(['Loan_Status'], axis=1).columns
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

Fitting Classifiers in Machine Learning Pipelines

The next step is to build a pipeline that can easily combine the transformations created above with a Classifier. In this task I will choose a Random Forest Classifier.

from sklearn.ensemble import RandomForestClassifier
rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

Now you can easily call the fit() method on raw data, all the preprocessing process will be applied by doing so:

rf.fit(X_train, y_train)
Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=nan,
                                                                                 strategy='median',
                                                                                 verbose=0)),
                                                                  ('scaler',
                                                                   StandardScaler(copy=True,
                                                                                  with_mean...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
         verbose=False)

Now to predict it on new data, it is straightforward. You just need to call the predict() method and all the process of preprocessing will be applied to it:

y_pred = rf.predict(X_test)

Model Selection with Machine Learning Pipelines

The Pipelines can also be used in the process of Model Selection. Below I will loop the code through a number of classification models provided by Scikit-Learn, for applying the transformations and training the Machine Learning model.

from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]
for classifier in classifiers:
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', classifier)])
    pipe.fit(X_train, y_train)   
    print(classifier)
    print("model score: %.3f" % pipe.score(X_test, y_test))
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')
model score: 0.780
SVC(C=0.025, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001,
    verbose=False)
model score: 0.659
NuSVC(break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
      max_iter=-1, nu=0.5, probability=True, random_state=None, shrinking=True,
      tol=0.001, verbose=False)
model score: 0.797
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
model score: 0.724
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
model score: 0.780
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=None)
model score: 0.805
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
model score: 0.789

The Pipelines can also be used in finding the best performing parameters using the grid search algorithm. If you don’t know how grid search works, you can learn it from here. Now I will apply the pipeline with the grid search algorithm:

param_grid = { 
    'classifier__n_estimators': [200, 500],
    'classifier__max_features': ['auto', 'sqrt', 'log2'],
    'classifier__max_depth' : [4,5,6,7,8],
    'classifier__criterion' :['gini', 'entropy']}
from sklearn.model_selection import GridSearchCV
CV = GridSearchCV(rf, param_grid, n_jobs= 1)
                  
CV.fit(X_train, y_train)  
print(CV.best_params_)    
print(CV.best_score_)

{‘classifier__criterion’: ‘gini’, ‘classifier__max_depth’: 4, ‘classifier__max_features’: ‘auto’, ‘classifier__n_estimators’: 200} 0.8124922696351268

I work on a lot of Machine Learning Projects. At the initial phase of my career, I used to ignore pipelines in my tasks. But since I started using the pipelines in my models, I find it easy to work whenever I see the same kind of dataset. I hope you liked this article on Machine Learning Pipelines. Feel free to ask your valuable questions in the comments section below.

Receive Daily Newsletters

Leave a Reply