
Heart disease describes a range of conditions that affect your heart. Diseases under the heart disease umbrella include blood vessel diseases, such as coronary artery disease, heart rhythm problems (arrhythmia) and heart defects you’re born with (congenital heart defects), among others.
Heart disease is one of the biggest causes of morbidity and mortality among the population of the world. Prediction of cardiovascular disease is regarded as one of the most important subjects in the section of clinical data science. The amount of data in the healthcare industry is huge.
In this Data Science Project I will be applying Machine Learning techniques to classify whether a person is suffering from Heart Disease or not.
You can download the data set we need for this project from here:
Importing the required modules :
import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib import rcParams import seaborn as sns import warnings warnings.filterwarnings('ignore')
Here we will be experimenting using KNeighborsClassifier :
from sklearn.neighbors import KNeighborsClassifier
Now let’s dive into the data
df = pd.read_csv('dataset.csv') print(df.head())

print(df.info())
Output <class 'pandas.core.frame.DataFrame'> RangeIndex: 303 entries, 0 to 302 Data columns (total 14 columns): age 303 non-null int64 sex 303 non-null int64 cp 303 non-null int64 trestbps 303 non-null int64 chol 303 non-null int64 fbs 303 non-null int64 restecg 303 non-null int64 thalach 303 non-null int64 exang 303 non-null int64 oldpeak 303 non-null float64 slope 303 non-null int64 ca 303 non-null int64 thal 303 non-null int64 target 303 non-null int64 dtypes: float64(1), int64(13) memory usage: 33.2 KB
print(df.describe())

Feature Selection
To get correlation of each feature in the data set
import seaborn as sns corrmat = df.corr() top_corr_features = corrmat.index plt.figure(figsize=(16,16)) #plot heat map g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn") plt.show()

It’s always a good practice to work with a data set where the target classes are of approximately equal size. Thus, let’s check for the same :
sns.set_style('whitegrid') sns.countplot(x='target',data=df,palette='RdBu_r') plt.show()

Data Processing
After exploring the data set, I observed that I need to convert some categorical variables into dummy variables and scale all the values before training the Machine Learning models.
First, I’ll use the get_dummies
method to create dummy columns for categorical variables.
dataset = pd.get_dummies(df, columns = ['sex', 'cp', 'fbs','restecg', 'exang', 'slope', 'ca', 'thal']) from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler standardScaler = StandardScaler() columns_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak'] dataset[columns_to_scale] = standardScaler.fit_transform(dataset[columns_to_scale]) dataset.head()

y = dataset['target'] X = dataset.drop(['target'], axis = 1)
from sklearn.model_selection import cross_val_score knn_scores = [] for k in range(1,21): knn_classifier = KNeighborsClassifier(n_neighbors = k) score=cross_val_score(knn_classifier,X,y,cv=10) knn_scores.append(score.mean())
plt.plot([k for k in range(1, 21)], knn_scores, color = 'red') for i in range(1,21): plt.text(i, knn_scores[i-1], (i, knn_scores[i-1])) plt.xticks([i for i in range(1, 21)]) plt.xlabel('Number of Neighbors (K)') plt.ylabel('Scores') plt.title('K Neighbors Classifier scores for different K values') plt.show()

knn_classifier = KNeighborsClassifier(n_neighbors = 12) score=cross_val_score(knn_classifier,X,y,cv=10) score.mean()
#Output- 0.8448387096774195
Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier randomforest_classifier= RandomForestClassifier(n_estimators=10) score=cross_val_score(randomforest_classifier,X,y,cv=10) score.mean()
#Output- 0.8113978494623655