Health Insurance Premium Prediction with Machine Learning

Health Insurance is a type of insurance that covers medical expenses. A person who has taken a health insurance policy gets health insurance cover by paying a particular premium amount. There are a lot of factors that determine the premium of health insurance. So if you want to learn how we can use machine learning for predicting the premium of health insurance, then this article is for you. In this article, I will take you through the task of health insurance premium prediction with machine learning using Python.

Health Insurance Premium Prediction

The amount of the premium for a health insurance policy depends from person to person, as many factors affect the amount of the premium for a health insurance policy. Let’s say age, a young person is very less likely to have major health problems compared to an older person. Thus, treating an older person will be expensive compared to a young one. That is why an older person is required to pay a high premium compared to a young person.

Just like age, many other factors affect the premium for a health insurance policy. Hope you now have understood what health insurance is and how the premium for a health insurance policy is determined. In the section below, I will take you through the task of health insurance premium prediction with machine learning using Python.

Health Insurance Premium Prediction using Python

The dataset that I am using for the task of health insurance premium prediction is collected from Kaggle. It contains data about:

  1. the age of the person
  2. gender of the person
  3. Body Mass Index of the person
  4. how many children the person is having
  5. whether the person smokes or not
  6. the region where the person lives
  7. and the charges of the insurance premium

So let’s import the dataset and the necessary Python libraries that we need for this task:

import numpy as np
import pandas as pd
data = pd.read_csv("Health_insurance.csv")
data.head()
   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520

Before moving forward, let’s have a look at whether this dataset contains any null values or not:

data.isnull().sum()
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

The dataset is therefore ready to be used. After getting the first impressions of this data, I noticed theĀ “smoker”Ā column, which indicates whether the person smokes or not. This is an important feature of this dataset because a person who smokes is more likely to have major health problems compared to a person who does not smoke. So let’s look at the distribution of people who smoke and who do not:

import plotly.express as px
data = data
figure = px.histogram(data, x = "sex", color = "smoker", title= "Number of Smokers")
figure.show()
Health Insurance Premium Prediction

According to the above visualisation, 547 females, 517 males don’t smoke, and 115 females, 159 males do smoke. It is important to use this feature while training a machine learning model, so now I will replace the values of theĀ “sex”Ā andĀ “smoker”Ā columns with 0 and 1 as both these columns contain string values:

data["sex"] = data["sex"].map({"female": 0, "male": 1})
data["smoker"] = data["smoker"].map({"no": 0, "yes": 1})
print(data.head())
   age  sex     bmi  children  smoker     region      charges
0   19    0  27.900         0       1  southwest  16884.92400
1   18    1  33.770         1       0  southeast   1725.55230
2   28    1  33.000         3       0  southeast   4449.46200
3   33    1  22.705         0       0  northwest  21984.47061
4   32    1  28.880         0       0  northwest   3866.85520

Now let’s have a look at the distribution of the regions where people are living according to the dataset:

import plotly.express as px
pie = data["region"].value_counts()
regions = pie.index
population = pie.values
fig = px.pie(data, values=population, names=regions)
fig.show()
Image for post: health insurance data

Now let’s have a look at the correlation between the features of this dataset:

print(data.corr())
               age       sex       bmi  children    smoker   charges
age       1.000000 -0.020856  0.109272  0.042469 -0.025019  0.299008
sex      -0.020856  1.000000  0.046371  0.017163  0.076185  0.057292
bmi       0.109272  0.046371  1.000000  0.012759  0.003750  0.198341
children  0.042469  0.017163  0.012759  1.000000  0.007673  0.067998
smoker   -0.025019  0.076185  0.003750  0.007673  1.000000  0.787251
charges   0.299008  0.057292  0.198341  0.067998  0.787251  1.000000

Health Insurance Premium Prediction Model

Now let’s move on to training a machine learning model for the task of predicting health insurance premiums. First, I’ll split the data into training and test sets:

x = np.array(data[["age", "sex", "bmi", "smoker"]])
y = np.array(data["charges"])

from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42)

After using different machine learning algorithms, I found the random forest algorithm as the best performing algorithm for this task. So here I will train the model by using the random forest regression algorithm:

from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()
forest.fit(xtrain, ytrain)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

Now let’s have a look at the predicted values of the model:

ypred = forest.predict(xtest)
data = pd.DataFrame(data={"Predicted Premium Amount": ypred})
print(data.head())
   Predicted Premium Amount
0              11331.111753
1               5366.132261
2              28257.205036
3               9793.356425
4              34720.204296

So this is how you can train a machine learning model for the task of health insurance premium prediction using Python.

Summary

The premium amount of a health insurance policy depends on person to person as many factors affect the premium amount of a health insurance policy. I hope you liked this article on health insurance premium prediction with machine learning using Python. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of datašŸ“ˆ.

Articles: 1501

Leave a Reply