Machine Learning tutorial on k Nearest Neighbor with Python

Imagine that you’re trying to predict how I’m going to vote in the next parliament election. If you know nothing else about me (and if you have the data), one sensible approach is to look at how my neighbors are planning to vote.

Living in New Delhi, as I do, my neighbor are invariably planning to vote for the Democratic candidate, which suggests that Democratic candidate is a good guess for me as well.

Now imagine you know more about me than just geography—perhaps you know my age, my income, how many kids I have, and so on.

To the extent my behavior is influenced (or characterized) by those things, looking just at my neighbors who are close to me among all those dimensions seems likely to be an even better predictor than looking at all my neighbors.

This is the idea behind nearest neighbor classification.

In this Data Science Tutorial I will create a simple K Nearest Neighbor model with python, to give an example of this prediction model.

K Nearest Neighbor

Let’s start with importing the libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Download the data set

df = pd.read_csv('KNN_Project_Data.csv')
df.head()

EDA

Since this data is artificial, we’ll just do a large pairplot with seaborn.

Use seaborn on the dataframe to create a pairplot with the hue indicated by the TARGET CLASS column.

sns.pairplot(df, hue = 'TARGET CLASS')

Standardize the Variables

Time to standardize the variables. Import StandardScaler from Scikit learn.

from sklearn.preprocessing import StandardScaler
myscaler = StandardScaler()

Fit scaler to the features.

myscaler.fit(X = df.drop('TARGET CLASS', axis = 1))
X = myscaler.transform(X = df.drop('TARGET CLASS', axis = 1))

Convert the scaled features to a dataframe and check the head of this dataframe to make sure the scaling worked.

tdf = pd.DataFrame(X, columns=df.columns[:-1])
tdf.head()

Train Test Split

Use train_test_split to split your data into a training set and a testing set.

from sklearn.model_selection import train_test_split
y = df['TARGET CLASS']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)

Using KNN

from sklearn.neighbors import KNeighborsClassifier
myKNN = KNeighborsClassifier(n_neighbors = 1)
myKNN.fit(X_train, y_train)
#Output
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

Predictions and Evaluations

Use the predict method to predict values using your KNN model and X_test

y_predict = myKNN.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test,y_predict))
#Output
[[109  43]
 [ 41 107]]
print(classification_report(y_test,y_predict))
#Output
              precision    recall  f1-score   support

           0       0.73      0.72      0.72       152
           1       0.71      0.72      0.72       148

    accuracy                           0.72       300
   macro avg       0.72      0.72      0.72       300
weighted avg       0.72      0.72      0.72       300

Choosing a K Value

Let’s go ahead and use the elbow method to pick a good K Value!

Create a for loop that trains various K Nearest Neighbors models with different k values, then keep track of the error_rate for each of these models with a list.

err_rates = []
for idx in range(1,40):
    knn = KNeighborsClassifier(n_neighbors = idx)
    knn.fit(X_train, y_train)
    pred_idx = knn.predict(X_test)
    err_rates.append(np.mean(y_test != pred_idx))

Now create the following plot using the information from your for loop.

plt.style.use('ggplot')
plt.subplots(figsize = (10,6))
plt.plot(range(1,40), err_rates, linestyle = 'dashed', color = 'blue', marker = 'o', markerfacecolor = 'red')
plt.xlabel('K-value')
plt.ylabel('Error Rate')
plt.title('Error Rate vs K-value')

Retrain with new K Value

Retrain your model with the best K value (up to you to decide what you want) and re-do the classification report and the confusion matrix.

myKNN = KNeighborsClassifier(n_neighbors = 31)
myKNN.fit(X_train,y_train)
y_predict = myKNN.predict(X_test)

print('WITH K=31')
print('')
print(confusion_matrix(y_test,y_predict))
print('')
print(classification_report(y_test,y_predict))
#Output
WITH K=31

[[123  29]
 [ 19 129]]

              precision    recall  f1-score   support

           0       0.87      0.81      0.84       152
           1       0.82      0.87      0.84       148

    accuracy                           0.84       300
   macro avg       0.84      0.84      0.84       300
weighted avg       0.84      0.84      0.84       300

Follow us on Instagram for all your Queries

Aman Kharwal
Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Articles: 1609

Leave a Reply

Discover more from thecleverprogrammer

Subscribe now to keep reading and get access to the full archive.

Continue reading