# Health Insurance Premium Prediction with Machine Learning

The amount of the premium for a health insurance policy depends from person to person, as many factors affect the amount of the premium for a health insurance policy. Let’s say age, a young person is very less likely to have major health problems compared to an older person. Thus, treating an older person will be expensive compared to a young one. That is why an older person is required to pay a high premium compared to a young person.

Just like age, many other factors affect the premium for a health insurance policy. Hope you now have understood what health insurance is and how the premium for a health insurance policy is determined. In the section below, I will take you through the task of health insurance premium prediction with machine learning using Python.

## Health Insurance Premium Prediction using Python

The dataset that I am using for the task of health insurance premium prediction is collected from Kaggle. It contains data about:

1. the age of the person
2. gender of the person
3. Body Mass Index of the person
4. how many children the person is having
5. whether the person smokes or not
6. the region where the person lives
7. and the charges of the insurance premium

So letās import the dataset and the necessary Python libraries that we need for this task:

```import numpy as np
import pandas as pd
```   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520```

Before moving forward, letās have a look at whether this dataset contains any null values or not:

`data.isnull().sum()`
```age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64```

The dataset is therefore ready to be used. After getting the first impressions of this data, I noticed theĀ “smoker”Ā column, which indicates whether the person smokes or not. This is an important feature of this dataset because a person who smokes is more likely to have major health problems compared to a person who does not smoke. So let’s look at the distribution of people who smoke and who do not:

```import plotly.express as px
data = data
figure = px.histogram(data, x = "sex", color = "smoker", title= "Number of Smokers")
figure.show()```

According to the above visualisation, 547 females, 517 males donāt smoke, and 115 females, 159 males do smoke. It is important to use this feature while training a machine learning model, so now I will replace the values of theĀ “sex”Ā andĀ “smoker”Ā columns with 0 and 1 as both these columns contain string values:

```data["sex"] = data["sex"].map({"female": 0, "male": 1})
data["smoker"] = data["smoker"].map({"no": 0, "yes": 1})
```   age  sex     bmi  children  smoker     region      charges
0   19    0  27.900         0       1  southwest  16884.92400
1   18    1  33.770         1       0  southeast   1725.55230
2   28    1  33.000         3       0  southeast   4449.46200
3   33    1  22.705         0       0  northwest  21984.47061
4   32    1  28.880         0       0  northwest   3866.85520```

Now letās have a look at the distribution of the regions where people are living according to the dataset:

```import plotly.express as px
pie = data["region"].value_counts()
regions = pie.index
population = pie.values
fig = px.pie(data, values=population, names=regions)
fig.show()```

Now letās have a look at the correlation between the features of this dataset:

`print(data.corr())`
```               age       sex       bmi  children    smoker   charges
age       1.000000 -0.020856  0.109272  0.042469 -0.025019  0.299008
sex      -0.020856  1.000000  0.046371  0.017163  0.076185  0.057292
bmi       0.109272  0.046371  1.000000  0.012759  0.003750  0.198341
children  0.042469  0.017163  0.012759  1.000000  0.007673  0.067998
smoker   -0.025019  0.076185  0.003750  0.007673  1.000000  0.787251
charges   0.299008  0.057292  0.198341  0.067998  0.787251  1.000000```

## Health Insurance Premium Prediction Model

Now let’s move on to training a machine learning model for the task of predicting health insurance premiums. First, I’ll split the data into training and test sets:

```x = np.array(data[["age", "sex", "bmi", "smoker"]])
y = np.array(data["charges"])

from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42)```

After using different machine learning algorithms, I found the random forest algorithm as the best performing algorithm for this task. So here I will train the model by using the random forest regression algorithm:

```from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()
forest.fit(xtrain, ytrain)```
```RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
max_depth=None, max_features='auto', max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)```

Now letās have a look at the predicted values of the model:

```ypred = forest.predict(xtest)
data = pd.DataFrame(data={"Predicted Premium Amount": ypred})
```   Predicted Premium Amount
0              11331.111753
1               5366.132261
2              28257.205036
3               9793.356425
4              34720.204296```

So this is how you can train a machine learning model for the task of health insurance premium prediction using Python.

### Summary

The premium amount of a health insurance policy depends on person to person as many factors affect the premium amount of a health insurance policy. I hope you liked this article on health insurance premium prediction with machine learning using Python. Feel free to ask your valuable questions in the comments section below.

##### Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of dataš.

Articles: 1501