Predict Diabetes with Machine Learning

According to the report of Centers of Disease Control and Prevention about one in seven adults in the United States have Diabetes. But by next few years this rate can move higher. With this in mind today, In this article, I will show you how you can use machine learning to Predict Diabetes using Python.

Let’s straightaway get into the data, you can download the data I have used in this article to Predict Diabetes below.

Now let’s import the data and gets started:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
diabetes = pd.read_csv('diabetes.csv')
print(diabetes.columns)Code language: Python (python)
Index(['Pregnancies', 'Glucose', 'BloodPressure', 
'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 
'Age', 'Outcome'], dtype='object')
diabetes.head()Code language: Python (python)
Image for post

The diabetes data set consists of 768 data points, with 9 features each:

print("dimension of diabetes data: {}".format(diabetes.shape))Code language: Python (python)

dimension of diabetes data: (768, 9)

“Outcome” is the feature we are going to predict, 0 means No diabetes, 1 means diabetes. Of these 768 data points, 500 are labeled as 0 and 268 as 1:

print(diabetes.groupby('Outcome').size())Code language: Python (python)
Image for post
import seaborn as sns
sns.countplot(diabetes['Outcome'],label="Count")Code language: Python (python)
Image for post language: Python (python)
Image for post

K-Nearest Neighbors to Predict Diabetes

The k-Nearest Neighbors algorithm is arguably the simplest machine learning algorithm. Building the model consists only of storing the training data set. To make a prediction for a new point in the dataset, the algorithm finds the closest data points in the training data set — its “nearest neighbors.”

First, Let’s investigate whether we can confirm the connection between model complexity and accuracy:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(diabetes.loc[:, diabetes.columns != 'Outcome'], diabetes['Outcome'], stratify=diabetes['Outcome'], random_state=66)
from sklearn.neighbors import KNeighborsClassifier
training_accuracy = []
test_accuracy = []
# try n_neighbors from 1 to 10
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
    # build the model
    knn = KNeighborsClassifier(n_neighbors=n_neighbors), y_train)
    # record training set accuracy
    training_accuracy.append(knn.score(X_train, y_train))
    # record test set accuracy
    test_accuracy.append(knn.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.legend()Code language: Python (python)
predict diabetes

Let’s check the accuracy score of the k-nearest neighbors algorithm to predict diabetes.

knn = KNeighborsClassifier(n_neighbors=9), y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'.format(knn.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'.format(knn.score(X_test, y_test)))Code language: Python (python)

Accuracy of K-NN classifier on training set: 0.79
Accuracy of K-NN classifier on test set: 0.78

Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=0), y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))Code language: Python (python)

Accuracy on training set: 1.000
Accuracy on test set: 0.714

The accuracy on the training set with Decision Tree Classifier is 100%, while the test set accuracy is much worse. This is an indicative that the tree is overfitting and not generalizing well to new data. Therefore, we need to apply pre-pruning to the tree.

Now, I will do this again by doing set max_depth=3, limiting the depth of the tree decreases overfitting. This leads to a lower accuracy on the training set, but an improvement on the test set.

tree = DecisionTreeClassifier(max_depth=3, random_state=0), y_train)
print("Accuracy on training set: {:.3f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(tree.score(X_test, y_test)))Code language: Python (python)

Accuracy on training set: 0.773
Accuracy on test set: 0.740

Feature Importance in Decision Trees

Feature importance shows how important each feature is for the decision a decision tree classifier makes. It is a number between 0 and 1 for each feature, where 0 means “not used at all” and 1 means “perfectly predicts the target”. The feature importance always sum to 1:

print("Feature importances:\n{}".format(tree.feature_importances_))Code language: Python (python)

Feature importances: [ 0.04554275 0.6830362 0. 0. 0. 0.27142106 0. 0. ]

Now lets visualize the feature importance of decision tree to predict diabetes.

def plot_feature_importances_diabetes(model):
    n_features = 8
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), diabetes_features)
    plt.xlabel("Feature importance")
    plt.ylim(-1, n_features)
plot_feature_importances_diabetes(tree)Code language: Python (python)
predict diabetes

So the Glucose feature is used the most to predict diabetes.

Deep Learning to Predict Diabetes

Lets train a deep learning model to predict diabetes:

from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(random_state=42), y_train)
print("Accuracy on training set: {:.2f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(mlp.score(X_test, y_test)))Code language: Python (python)

Accuracy on training set: 0.71
Accuracy on test set: 0.67

The accuracy of the Multilayer perceptrons (MLP) is not as good as the other models at all, this is likely due to scaling of the data. Deep learning algorithms also expect all input features to vary in a similar way, and ideally to have a mean of 0, and a variance of 1. Now I will re-scale our data so that it fulfills these requirements to predict diabetes with a good accuracy.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
mlp = MLPClassifier(random_state=0), y_train)
print("Accuracy on training set: {:.3f}".format(
    mlp.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test_scaled, y_test)))Code language: Python (python)

Accuracy on training set: 0.823
Accuracy on test set: 0.802

Now let’s increase the number of iterations, alpha parameter and add stronger parameters to the weights of the model:

mlp = MLPClassifier(max_iter=1000, alpha=1, random_state=0), y_train)
print("Accuracy on training set: {:.3f}".format(
    mlp.score(X_train_scaled, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test_scaled, y_test)))Code language: Python (python)

Accuracy on training set: 0.795
Accuracy on test set: 0.792

The result is good, but we are not able to increase the test accuracy further. Therefore, our best model so far is default deep learning model after scaling. Now I will plot a heat map of the first layer weights in a neural network learned on the to predict diabetes using the data set.

plt.figure(figsize=(20, 5))
plt.imshow(mlp.coefs_[0], interpolation='none', cmap='viridis')
plt.yticks(range(8), diabetes_features)
plt.xlabel("Columns in weight matrix")
plt.ylabel("Input feature")
plt.colorbar()Code language: Python (python)
predict diabetes

Also, Read – K-Means in Machine Learning.

I hope you liked this article to predict diabetes with Machine Learning. Feel free to ask your valuable questions in the comments section. Don’t forget to subscribe for the daily newsletters below to receive daily posts notifications if you like my work.

Follow Us:

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1535


  1. I love this article because many of my family members have struggled with diabetes. How can I take this a step further and create a form for people to fill out with the needed information of these 8 features and then the output states the chances of them getting diabetes along with the accuracy percentage?

Leave a Reply