Machine Learning Pipelines performs a complete workflow with an ordered sequence of the process involved in a Machine Learning task. In most of the functions in Machine Learning, the data that you work with is barely in a format for training the model with it’s the best performance.
There are several steps in the process of training a machine learning model, like encoding, categorical variables, feature scaling, and normalization. The preprocessing package of Scikit-Learn provides all these functions that can be easily used as transformations.
But, in a typical workflow of a Machine Learning task, you need to apply all the processes of transformations at least two times. The first time when you train the model and then when you use the trained model on the new data.
On the other hand, you can create a function to apply all the transformations and reuse on the original data by calling the function, but you would still need to run this first and call the model separately. So to tackle this, we have Machine Learning Pipelines that is a method to simplify this process. The most essential benefits that Machine Learning Pipelines provides are:
- Machine Learning Pipelines will make the workflow of your task very much easier to read and understand.
- The Pipelines in Machine Learning enforce robust implementation of the process involved in your task.
- In the end, it will make your work more reproducible.

In this article, I will take you through the implementation of Machine Learning Pipelines in a Machine Learning Project. First, I will transform the dataset according to our needs; then, I will move towards the implementation of the Machine Learning Pipelines.
Data Preparation (Transformation)
First I will transform the data by using the pandas package in Python. The data that I have used in this article can be easily downloaded from here.
import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train = train.drop('Loan_ID', axis=1)
train.dtypes
Code language: Python (python)
Gender object Married object Dependents object Education object Self_Employed object ApplicantIncome int64 CoapplicantIncome float64 LoanAmount float64 Loan_Amount_Term float64 Credit_History float64 Property_Area object Loan_Status object dtype: object
Before building a Machine Learning Pipeline, I will split the training data into train and test sets to validate the performance of our model.
X = train.drop('Loan_Status', axis=1)
y = train['Loan_Status']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Code language: Python (python)
Building Machine Learning Pipelines
The first step in building a pipeline is to define the type of each transformer. In simple words it means to create transformers according to the type of their variables.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
Code language: Python (python)
Now I will use a Column Transformer to apply all the transformations to their respective columns in the dataframe.
numeric_features = train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = train.select_dtypes(include=['object']).drop(['Loan_Status'], axis=1).columns
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
Code language: Python (python)
Fitting Classifiers in Machine Learning Pipelines
The next step is to build a pipeline that can easily combine the transformations created above with a Classifier. In this task I will choose a Random Forest Classifier.
from sklearn.ensemble import RandomForestClassifier
rf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier())])
Code language: Python (python)
Now you can easily call the fit() method on raw data, all the preprocessing process will be applied by doing so:
rf.fit(X_train, y_train)
Code language: Python (python)
Pipeline(memory=None, steps=[('preprocessor', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3, transformer_weights=None, transformers=[('num', Pipeline(memory=None, steps=[('imputer', SimpleImputer(add_indicator=False, copy=True, fill_value=None, missing_values=nan, strategy='median', verbose=0)), ('scaler', StandardScaler(copy=True, with_mean... RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False))], verbose=False)
Now to predict it on new data, it is straightforward. You just need to call the predict() method and all the process of preprocessing will be applied to it:
y_pred = rf.predict(X_test)
Code language: Python (python)
Model Selection with Machine Learning Pipelines
The Pipelines can also be used in the process of Model Selection. Below I will loop the code through a number of classification models provided by Scikit-Learn, for applying the transformations and training the Machine Learning model.
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="rbf", C=0.025, probability=True),
NuSVC(probability=True),
DecisionTreeClassifier(),
RandomForestClassifier(),
AdaBoostClassifier(),
GradientBoostingClassifier()
]
for classifier in classifiers:
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', classifier)])
pipe.fit(X_train, y_train)
print(classifier)
print("model score: %.3f" % pipe.score(X_test, y_test))
Code language: Python (python)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=3, p=2, weights='uniform') model score: 0.780 SVC(C=0.025, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf', max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False) model score: 0.659 NuSVC(break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf', max_iter=-1, nu=0.5, probability=True, random_state=None, shrinking=True, tol=0.001, verbose=False) model score: 0.797 DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best') model score: 0.724 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False) model score: 0.780 AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0, n_estimators=50, random_state=None) model score: 0.805 GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None, learning_rate=0.1, loss='deviance', max_depth=3, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_iter_no_change=None, presort='deprecated', random_state=None, subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0, warm_start=False) model score: 0.789
The Pipelines can also be used in finding the best performing parameters using the grid search algorithm. If you don’t know how grid search works, you can learn it from here. Now I will apply the pipeline with the grid search algorithm:
param_grid = {
'classifier__n_estimators': [200, 500],
'classifier__max_features': ['auto', 'sqrt', 'log2'],
'classifier__max_depth' : [4,5,6,7,8],
'classifier__criterion' :['gini', 'entropy']}
from sklearn.model_selection import GridSearchCV
CV = GridSearchCV(rf, param_grid, n_jobs= 1)
CV.fit(X_train, y_train)
print(CV.best_params_)
print(CV.best_score_)
Code language: Python (python)
{‘classifier__criterion’: ‘gini’, ‘classifier__max_depth’: 4, ‘classifier__max_features’: ‘auto’, ‘classifier__n_estimators’: 200} 0.8124922696351268
I work on a lot of Machine Learning Projects. At the initial phase of my career, I used to ignore pipelines in my tasks. But since I started using the pipelines in my models, I find it easy to work whenever I see the same kind of dataset. I hope you liked this article on Machine Learning Pipelines. Feel free to ask your valuable questions in the comments section below.