Imagine that you’re trying to predict how I’m going to vote in the next parliament election. If you know nothing else about me (and if you have the data), one sensible approach is to look at how my neighbors are planning to vote.
Living in New Delhi, as I do, my neighbor are invariably planning to vote for the Democratic candidate, which suggests that Democratic candidate is a good guess for me as well.
Now imagine you know more about me than just geography—perhaps you know my age, my income, how many kids I have, and so on.
To the extent my behavior is influenced (or characterized) by those things, looking just at my neighbors who are close to me among all those dimensions seems likely to be an even better predictor than looking at all my neighbors.
This is the idea behind nearest neighbor classification.
In this Data Science Tutorial I will create a simple K Nearest Neighbor model with python, to give an example of this prediction model.
K Nearest Neighbor
Let’s start with importing the libraries:
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
Download the data set
df = pd.read_csv('KNN_Project_Data.csv') df.head()
Since this data is artificial, we’ll just do a large pairplot with seaborn.
Use seaborn on the dataframe to create a pairplot with the hue indicated by the TARGET CLASS column.
sns.pairplot(df, hue = 'TARGET CLASS')
Standardize the Variables
Time to standardize the variables. Import StandardScaler from Scikit learn.
from sklearn.preprocessing import StandardScaler myscaler = StandardScaler()
Fit scaler to the features.
myscaler.fit(X = df.drop('TARGET CLASS', axis = 1)) X = myscaler.transform(X = df.drop('TARGET CLASS', axis = 1))
Convert the scaled features to a dataframe and check the head of this dataframe to make sure the scaling worked.
tdf = pd.DataFrame(X, columns=df.columns[:-1]) tdf.head()
Train Test Split
Use train_test_split to split your data into a training set and a testing set.
from sklearn.model_selection import train_test_split y = df['TARGET CLASS'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)
from sklearn.neighbors import KNeighborsClassifier myKNN = KNeighborsClassifier(n_neighbors = 1) myKNN.fit(X_train, y_train)
#Output KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=1, p=2, weights='uniform')
Predictions and Evaluations
Use the predict method to predict values using your KNN model and X_test
y_predict = myKNN.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report print(confusion_matrix(y_test,y_predict))
#Output [[109 43] [ 41 107]]
#Output precision recall f1-score support 0 0.73 0.72 0.72 152 1 0.71 0.72 0.72 148 accuracy 0.72 300 macro avg 0.72 0.72 0.72 300 weighted avg 0.72 0.72 0.72 300
Choosing a K Value
Let’s go ahead and use the elbow method to pick a good K Value!
Create a for loop that trains various K Nearest Neighbors models with different k values, then keep track of the error_rate for each of these models with a list.
err_rates =  for idx in range(1,40): knn = KNeighborsClassifier(n_neighbors = idx) knn.fit(X_train, y_train) pred_idx = knn.predict(X_test) err_rates.append(np.mean(y_test != pred_idx))
Now create the following plot using the information from your for loop.
plt.style.use('ggplot') plt.subplots(figsize = (10,6)) plt.plot(range(1,40), err_rates, linestyle = 'dashed', color = 'blue', marker = 'o', markerfacecolor = 'red') plt.xlabel('K-value') plt.ylabel('Error Rate') plt.title('Error Rate vs K-value')
Retrain with new K Value
Retrain your model with the best K value (up to you to decide what you want) and re-do the classification report and the confusion matrix.
myKNN = KNeighborsClassifier(n_neighbors = 31) myKNN.fit(X_train,y_train) y_predict = myKNN.predict(X_test) print('WITH K=31') print('') print(confusion_matrix(y_test,y_predict)) print('') print(classification_report(y_test,y_predict))
#Output WITH K=31 [[123 29] [ 19 129]] precision recall f1-score support 0 0.87 0.81 0.84 152 1 0.82 0.87 0.84 148 accuracy 0.84 300 macro avg 0.84 0.84 0.84 300 weighted avg 0.84 0.84 0.84 300