Choose the Best Algorithm for Machine Learning Task

Have you ever got confused when you try to choose the best algorithm for a machine learning task? Suppose you are planning to buy a new car. Your planning is never going to be as short as moving to the showroom and coming back home in a car. You will probably plan a lot even before seeing your first option.

Now, you have selected 5 cars in your list, now you are confused about what option you should go for as you will use this for years. Now the best option to select one from the rest is by taking a test drive. This is the same thing we do to choose the best algorithm for Machine Learning task to select between some algorithms that can do the best prediction for us with a great accuracy rate.

How to Choose the Best Algorithm for Machine Learning?

As a practitioner in Machine Learning, you must have already gone through some classification tasks. If you have gone through then you must know that there is no algorithm in Machine Learning that will perform with the best accuracy rate for almost all machine learning tasks.

Also, Read Voting Classifier in Machine Learning.

So there is no possibility to stick and master yourself in your one favourite machine learning algorithm. So we need to get prepared to choose the best algorithm in a Machine Learning task accordingly.

Nested CV Method To Choose the Best Algorithm for Machine Learning

We have a Nested CV (cross-validation) Method to choose the best algorithm in Machine Learning from all other possible algorithms. This method works by calculating the generalization error by using the outer CV, which is the average of the accuracy scores of the test sets of the outer CV. Now let’s go through how we can perform this method practically by using code.

Choosing the Best Algorithm

For choosing the best algorithm for our Machine Learning task, I will use this dataset. This dataset consists of 30,000 titles of articles that are labelled as clickbait and Non-clickbait. The dataset is about detection and prevention of clickbait in online news media. Now let’s start with importing all the libraries and algorithms that we need for this task:

from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV, train_test_split, cross_val_score from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.preprocessing import LabelBinarizer from sklearn.metrics import accuracy_score import numpy as np import pandas as pd import re from nltk.corpus import stopwords

Now for preparing the text in our dataset, I will convert the text in the dataset to lowercase and will also remove all the punctuation, stopwords and other special characters:

WORD = re.compile(r'\w+') STOPWORDS = set(stopwords.words('english')) STOPWORDS = [s for s in STOPWORDS if s not in ('not', 'no')] def prepare_text(text: str): """ Text preparation: lowercase, remove of punctuation, special characters and stopwords. Args: text: input sample from data set Returns: modified initial text """ text = text.lower() text = re.sub(r'[/(){}\[\]\|@,;]', r' ', text) text = re.sub(r'[^0-9a-z #+_]', r'', text) text = re.sub(' +', ' ', text) # remove extra spaces text = ' '.join(word for word in WORD.findall(text) if word not in STOPWORDS) # remove stopwords return text

Now I will apply the above-defined function to our corresponding input text from the dataset, and I will split the data into training and test sets:

data['prep_text'] = data.title.map(prepare_text) X = data.prep_text.tolist() lb = LabelBinarizer() y = lb.fit_transform(data.label) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42, stratify=y)

Now I will prepare three pipelines for our 3 algorithms to prepare our hyperparameters in a way so that we could use it in future in the same kind of task:

# Pipelines pipe1 = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('SVC', SVC(random_state=42))]) pipe2 = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('RF', RandomForestClassifier(random_state=42))]) pipe3 = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('LR', LogisticRegression(penalty='l2', random_state=42))]) # Setting parameter grids param_grid1 = {'SVC__kernel': ['rbf'], 'SVC__gamma': [1e-3, 1e-4], 'SVC__C': [1, 10, 100, 1000]} param_grid2 = {'RF__n_estimators': range(200, 1200, 200), 'RF__max_depth': np.linspace(1, 32, 32, endpoint=True)} param_grid3 = {'LR__C': [0.001,0.01,0.1,1,10,100]}

Now, I will implement the Nested CV Method To Choose the Best Algorithm for our machine learning task:

# INNER rcvs = {} inner_cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=42) for pgrid, pipe, algo in zip((param_grid1, param_grid2, param_grid3), (pipe1, pipe2, pipe3), ('SVC', 'RF', 'LR')): rcv = RandomizedSearchCV(estimator=pipe, param_distributions=pgrid, scoring='accuracy', n_iter=6, n_jobs=3, cv=inner_cv, verbose=0, refit=True) rcvs[algo] = rcv # OUTER outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) for algo, rcv in sorted(rcvs.items()): score = cross_val_score(rcv, X=X_train, y=y_train, cv=outer_cv, n_jobs=1) print('%s | outer ACC %.2f%% +/- %.2f' % (algo, score.mean() * 100, score.std() * 100))
Image for post

All the three algorithms gave a good accuracy rate, as all the models are scoring over more than 85 per cent accuracy. But the best model among the three is a Logistic Regression algorithm as it is giving the highest accuracy rate.

I hope you found this article valuable to choose the best algorithm for Machine Learning tasks. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to read more amazing articles.

Follow Us:

Default image
Aman Kharwal

I am a programmer from India, and I am here to guide you with Data Science, Machine Learning, Python, and C++ for free. I hope you will learn a lot in your journey towards Coding, Machine Learning and Artificial Intelligence with me.

Leave a Reply