Health Insurance is a type of insurance that covers medical expenses. A person who has taken a health insurance policy gets health insurance cover by paying a particular premium amount. There are a lot of factors that determine the premium of health insurance. So if you want to learn how we can use machine learning for predicting the premium of health insurance, then this article is for you. In this article, I will take you through the task of health insurance premium prediction with machine learning using Python.
Health Insurance Premium Prediction
The amount of the premium for a health insurance policy depends from person to person, as many factors affect the amount of the premium for a health insurance policy. Let’s say age, a young person is very less likely to have major health problems compared to an older person. Thus, treating an older person will be expensive compared to a young one. That is why an older person is required to pay a high premium compared to a young person.
Just like age, many other factors affect the premium for a health insurance policy. Hope you now have understood what health insurance is and how the premium for a health insurance policy is determined. In the section below, I will take you through the task of health insurance premium prediction with machine learning using Python.
Health Insurance Premium Prediction using Python
The dataset that I am using for the task of health insurance premium prediction is collected from Kaggle. It contains data about:
- the age of the person
- gender of the person
- Body Mass Index of the person
- how many children the person is having
- whether the person smokes or not
- the region where the person lives
- and the charges of the insurance premium
So letās import the dataset and the necessary Python libraries that we need for this task:
import numpy as np import pandas as pd data = pd.read_csv("Health_insurance.csv") data.head()
age sex bmi children smoker region charges 0 19 female 27.900 0 yes southwest 16884.92400 1 18 male 33.770 1 no southeast 1725.55230 2 28 male 33.000 3 no southeast 4449.46200 3 33 male 22.705 0 no northwest 21984.47061 4 32 male 28.880 0 no northwest 3866.85520
Before moving forward, letās have a look at whether this dataset contains any null values or not:
data.isnull().sum()
age 0 sex 0 bmi 0 children 0 smoker 0 region 0 charges 0 dtype: int64
The dataset is therefore ready to be used. After getting the first impressions of this data, I noticed theĀ “smoker”Ā column, which indicates whether the person smokes or not. This is an important feature of this dataset because a person who smokes is more likely to have major health problems compared to a person who does not smoke. So let’s look at the distribution of people who smoke and who do not:
import plotly.express as px data = data figure = px.histogram(data, x = "sex", color = "smoker", title= "Number of Smokers") figure.show()

According to the above visualisation, 547 females, 517 males donāt smoke, and 115 females, 159 males do smoke. It is important to use this feature while training a machine learning model, so now I will replace the values of theĀ “sex”Ā andĀ “smoker”Ā columns with 0 and 1 as both these columns contain string values:
data["sex"] = data["sex"].map({"female": 0, "male": 1}) data["smoker"] = data["smoker"].map({"no": 0, "yes": 1}) print(data.head())
age sex bmi children smoker region charges 0 19 0 27.900 0 1 southwest 16884.92400 1 18 1 33.770 1 0 southeast 1725.55230 2 28 1 33.000 3 0 southeast 4449.46200 3 33 1 22.705 0 0 northwest 21984.47061 4 32 1 28.880 0 0 northwest 3866.85520
Now letās have a look at the distribution of the regions where people are living according to the dataset:
import plotly.express as px pie = data["region"].value_counts() regions = pie.index population = pie.values fig = px.pie(data, values=population, names=regions) fig.show()

Now letās have a look at the correlation between the features of this dataset:
print(data.corr())
age sex bmi children smoker charges age 1.000000 -0.020856 0.109272 0.042469 -0.025019 0.299008 sex -0.020856 1.000000 0.046371 0.017163 0.076185 0.057292 bmi 0.109272 0.046371 1.000000 0.012759 0.003750 0.198341 children 0.042469 0.017163 0.012759 1.000000 0.007673 0.067998 smoker -0.025019 0.076185 0.003750 0.007673 1.000000 0.787251 charges 0.299008 0.057292 0.198341 0.067998 0.787251 1.000000
Health Insurance Premium Prediction Model
Now let’s move on to training a machine learning model for the task of predicting health insurance premiums. First, I’ll split the data into training and test sets:
x = np.array(data[["age", "sex", "bmi", "smoker"]]) y = np.array(data["charges"]) from sklearn.model_selection import train_test_split xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42)
After using different machine learning algorithms, I found the random forest algorithm as the best performing algorithm for this task. So here I will train the model by using the random forest regression algorithm:
from sklearn.ensemble import RandomForestRegressor forest = RandomForestRegressor() forest.fit(xtrain, ytrain)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
Now letās have a look at the predicted values of the model:
ypred = forest.predict(xtest) data = pd.DataFrame(data={"Predicted Premium Amount": ypred}) print(data.head())
Predicted Premium Amount 0 11331.111753 1 5366.132261 2 28257.205036 3 9793.356425 4 34720.204296
So this is how you can train a machine learning model for the task of health insurance premium prediction using Python.
Summary
The premium amount of a health insurance policy depends on person to person as many factors affect the premium amount of a health insurance policy. I hope you liked this article on health insurance premium prediction with machine learning using Python. Feel free to ask your valuable questions in the comments section below.