Choose the Best Algorithm for Machine Learning Task

Have you ever got confused when you try to choose the best algorithm for a machine learning task? Suppose you are planning to buy a new car. Your planning is never going to be as short as moving to the showroom and coming back home in a car. You will probably plan a lot even before seeing your first option.

Now, you have selected 5 cars in your list, now you are confused about what option you should go for as you will use this for years. Now the best option to select one from the rest is by taking a test drive. This is the same thing we do to choose the best algorithm for Machine Learning task to select between some algorithms that can do the best prediction for us with a great accuracy rate.

How to Choose the Best Algorithm for Machine Learning?

As a practitioner in Machine Learning, you must have already gone through some classification tasks. If you have gone through then you must know that there is no algorithm in Machine Learning that will perform with the best accuracy rate for almost all machine learning tasks.

Also, Read Voting Classifier in Machine Learning.

So there is no possibility to stick and master yourself in your one favourite machine learning algorithm. So we need to get prepared to choose the best algorithm in a Machine Learning task accordingly.

Nested CV Method To Choose the Best Algorithm for Machine Learning

We have a Nested CV (cross-validation) Method to choose the best algorithm in Machine Learning from all other possible algorithms. This method works by calculating the generalization error by using the outer CV, which is the average of the accuracy scores of the test sets of the outer CV. Now let’s go through how we can perform this method practically by using code.

Choosing the Best Algorithm

For choosing the best algorithm for our Machine Learning task, I will use this dataset. This dataset consists of 30,000 titles of articles that are labelled as clickbait and Non-clickbait. The dataset is about detection and prevention of clickbait in online news media. Now let’s start with importing all the libraries and algorithms that we need for this task:

from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV, train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwordsCode language: Python (python)

Now for preparing the text in our dataset, I will convert the text in the dataset to lowercase and will also remove all the punctuation, stopwords and other special characters:

WORD = re.compile(r'\w+')
STOPWORDS = set(stopwords.words('english'))
STOPWORDS = [s for s in STOPWORDS if s not in ('not', 'no')]
def prepare_text(text: str):
    """
    Text preparation: lowercase, remove of punctuation, special characters and stopwords.
    Args:
        text: input sample from data set
        
    Returns: 
        modified initial text
    """
    text = text.lower()
    text = re.sub(r'[/(){}\[\]\|@,;]', r' ', text)
    text = re.sub(r'[^0-9a-z #+_]', r'', text)
    text = re.sub(' +', ' ', text) # remove extra spaces
    text = ' '.join(word for word in WORD.findall(text) if word not in STOPWORDS) # remove stopwords
    return textCode language: Python (python)

Now I will apply the above-defined function to our corresponding input text from the dataset, and I will split the data into training and test sets:

data['prep_text'] = data.title.map(prepare_text)
X = data.prep_text.tolist()
lb = LabelBinarizer()
y = lb.fit_transform(data.label)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=.2,
                                                    random_state=42,
                                                    stratify=y)Code language: Python (python)

Now I will prepare three pipelines for our 3 algorithms to prepare our hyperparameters in a way so that we could use it in future in the same kind of task:

# Pipelines
pipe1 = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('SVC', SVC(random_state=42))])
pipe2 = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('RF', RandomForestClassifier(random_state=42))])
pipe3 = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('LR', LogisticRegression(penalty='l2', random_state=42))])
# Setting parameter grids
param_grid1 = {'SVC__kernel': ['rbf'], 'SVC__gamma': [1e-3, 1e-4], 'SVC__C': [1, 10, 100, 1000]}
param_grid2 = {'RF__n_estimators': range(200, 1200, 200), 'RF__max_depth': np.linspace(1, 32, 32, endpoint=True)}
param_grid3 = {'LR__C': [0.001,0.01,0.1,1,10,100]}Code language: Python (python)

Now, I will implement the Nested CV Method To Choose the Best Algorithm for our machine learning task:

# INNER
rcvs = {}
inner_cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
for pgrid, pipe, algo in zip((param_grid1, param_grid2, param_grid3),
                            (pipe1, pipe2, pipe3),
                            ('SVC', 'RF', 'LR')):
    rcv = RandomizedSearchCV(estimator=pipe,
                             param_distributions=pgrid,
                             scoring='accuracy',
                             n_iter=6,
                             n_jobs=3,
                             cv=inner_cv,
                             verbose=0,
                             refit=True)
    rcvs[algo] = rcv
# OUTER
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for algo, rcv in sorted(rcvs.items()):
    score = cross_val_score(rcv, 
                            X=X_train, 
                            y=y_train, 
                            cv=outer_cv,
                            n_jobs=1)
    print('%s | outer ACC %.2f%% +/- %.2f' % 
          (algo, score.mean() * 100, score.std() * 100))Code language: Python (python)
Image for post

All the three algorithms gave a good accuracy rate, as all the models are scoring over more than 85 per cent accuracy. But the best model among the three is a Logistic Regression algorithm as it is giving the highest accuracy rate.

I hope you found this article valuable to choose the best algorithm for Machine Learning tasks. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to read more amazing articles.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1397

Leave a Reply