Many studies have been undertaken in the past on the factors affecting a country’s life expectancy, taking into account demographic variables, income composition and death rates. It was found that the effect of vaccination and the human development index were not taken into account in the past. In this article, I will introduce you to a data science project on life expectancy analysis with Python.
Data Science Project on Life Expectancy Analysis
Life expectancy refers to the number of years a person is expected to live based on the statistical average. It depends on the geographical context of the area. Before the modernization of the world, life expectancy was around 30 years in all parts of the world. Life expectancy increased at the beginning of the 19th century but until there are the same countries while it remains low in the rest of the world.
Also, Read – 100+ Machine Learning Projects Solved and Explained.
This shows that health standards are not the same all over the world. In the 20th century, this global inequality is reduced and similarly, life expectancy is approaching 70 to 75 years and similarly no country in the world today has a low life expectancy than countries with high life expectancy in 1800.
In the section below, I will take you through a Data Science Project on Life Expectancy Analysis with Python. The good thing about this task is that here you will learn about the factors that WHO uses to calculate the Life Expectancy of a country as the data is provided by WHO.
Life Expectancy Analysis with Python
Now let’s get started with the task of Life Expectancy Analysis with Python. I will start this task by importing the necessary Python libraries and the dataset:

Now let’s have a look at some statistics from the data by using the describe function of Pandas:

life_expectancy.columns
Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality', 'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B', 'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure', 'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population', ' thinness 1-19 years', ' thinness 5-9 years', 'Income composition of resources', 'Schooling'], dtype='object')
So there are only two categorical variables in the data which are country and status. Now let’s change the names of all the columns to make them look uniform:
Data Cleaning:
Now let’s move further on the task of Life Expectancy analysis by looking at the null values in the dataset:
life_expectancy.info()
# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Country 2938 non-null object 1 Year 2938 non-null int64 2 Status 2938 non-null object 3 Life_expectancy 2928 non-null float64 4 Adult_mortality 2928 non-null float64 5 Infant_deaths 2938 non-null int64 6 Alcohol 2744 non-null float64 7 Percentage_expenditure 2938 non-null float64 8 HepatitisB 2385 non-null float64 9 Measles 2938 non-null int64 10 BMI 2904 non-null float64 11 Under_five_deaths 2938 non-null int64 12 Polio 2919 non-null float64 13 Total_expenditure 2712 non-null float64 14 Diphtheria 2919 non-null float64 15 HIV/AIDS 2938 non-null float64 16 GDP 2490 non-null float64 17 Population 2286 non-null float64 18 Thinness_1-19_years 2904 non-null float64 19 Thinness_5-9_years 2904 non-null float64 20 Income_composition_of_resources 2771 non-null float64 21 Schooling 2775 non-null float64 dtypes: float64(16), int64(4), object(2) memory usage: 505.1+ KB
The columns that we found with null values are:
- Life_expectancy
- Adult_mortality
- Alcohol
- Hepatitis B
- BMI
- Polio
- Total_expenditure
- Diphtheria
- GDP
- Population
- Thinness_1-19_years
- Thinness_5-9_years
- Income_composition_of_resources
- Schooling
So there are so many columns with the null values. Now let’s have a look at how many null values all these columns are having:
print(life_expectancy.isnull().sum())
Country 0 Year 0 Status 0 Life_expectancy 10 Adult_mortality 10 Infant_deaths 0 Alcohol 194 Percentage_expenditure 0 HepatitisB 553 Measles 0 BMI 34 Under_five_deaths 0 Polio 19 Total_expenditure 226 Diphtheria 19 HIV/AIDS 0 GDP 448 Population 652 Thinness_1-19_years 34 Thinness_5-9_years 34 Income_composition_of_resources 167 Schooling 163 dtype: int64
There are many columns with null values, but the number of missing values is not large enough to remove the columns. So imputing missing values would be a good idea. We also know that all columns with missing values are numeric continuous variables.
Filling in the missing values with a central tendency average would not be a good idea due to the outliers. We can also fill it with the median:
life_expectancy.reset_index(inplace=True) life_expectancy.groupby('Country').apply(lambda group: group.interpolate(method= 'linear')) imputed_data = [] for year in list(life_expectancy.Year.unique()): year_data = life_expectancy[life_expectancy.Year == year].copy() for col in list(year_data.columns)[4:]: year_data[col] = year_data[col].fillna(year_data[col].dropna().median()).copy() imputed_data.append(year_data) life_expectancy = pd.concat(imputed_data).copy()
Removing Outliers:
The next step in the task of Life Expectancy analysis is to deal with outliers, let’s have a look at the outliers and then we will see how we can deal with the outliers:
col_dict = {'Life_expectancy':1,'Adult_mortality':2,'Infant_deaths':3,'Alcohol':4,'Percentage_expenditure':5,'HepatitisB':6,'Measles':7,'BMI':8,'Under_five_deaths':9,'Polio':10,'Total_expenditure':11,'Diphtheria':12,'HIV/AIDS':13,'GDP':14,'Population':15,'Thinness_1-19_years':16,'Thinness_5-9_years':17,'Income_composition_of_resources':18,'Schooling':19} # Detect outliers in each variable using box plots. fig = plt.figure(figsize=(20,30)) for variable,i in col_dict.items(): plt.subplot(5,4,i) plt.boxplot(life_expectancy[variable]) plt.title(variable) plt.grid(True) plt.show()

Infant_Deaths represents several infant deaths per 1,000 population. That is why the number beyond 1000 is unrealistic. We will therefore remove them as outliers. The same is true for measles and deaths under five, as both are a number per 1,000 population.
As we can see, some countries spend up to 20,000% of their GDP on health. Most countries spend less than 2,500% of their GDP on health. Since the values ​​are very important in the Expenditure_Percentage, GDP, and Population columns, it is better to take a logarithmic value or use winsorization if necessary.
The BMI values ​​are very unrealistic because the value plus 40 is considered extreme obesity. The median is over 40 and some countries have an average of around 60 which is not possible. We can delete this whole column.
As almost all other columns have outliers, we can use winsorization:

Life Expectancy Analysis
Now we have done all the data cleaning and we also have removed all the outliers in the dataset. Now let’s see move forward with the task of Life Expectancy Analysis. Let’s start by exploring the data and looking at the correlation:
fig = plt.figure(figsize=(20,20)) for variable,i in col_dict_winz.items(): plt.subplot(5,6,i) plt.hist(life_expectancy[variable]) plt.title(variable) plt.ylabel('') plt.grid(True) plt.show()

life_exp = life_expectancy[['Year', 'Country', 'Status','winz_Life_expectancy','winz_Adult_mortality','Infant_deaths','winz_Alcohol', 'log_Percentage_expenditure','winz_HepatitisB','Measles','Under_five_deaths','winz_Polio', 'winz_Total_expenditure','winz_Diphtheria','winz_HIV/AIDS','log_GDP','log_Population', 'winz_Thinness_1-19_years','winz_Thinness_5-9_years','winz_Income_composition_of_resources', 'winz_Schooling']] plt.figure(figsize=(15,10)) sns.heatmap(life_exp.corr(), annot =True, linewidths = 4)

Observations from the above correlation:
- Adult_mortality has a negative relationship with education, the composition of resource income, and a positive relationship with HIV / AIDS.
- Infant_deaths and Under_five_deaths have a strong positive relationship.
- Schooling and alcohol have a positive relationship.
- Percentage expenditure has a positive relationship with education, the composition of resource income, GDP and life expectancy.
- hepatitis B has a strong positive relationship with polio and diphtheria.
- Polio also has a strong positive relationship with diphtheria, hepatitis B, and life expectancy.
- Diphtheria has a strong positive relationship with polio and life expectancy.
As we can see from the heat map, Life_expectancy has a positive relationship with education, resource income composition, GDP, diphtheria, polio, and percentage spending. Life_expectancy has a negative relationship with Adult_mortality, Thinness_1-19_years, Thinness_5-9_years, HIV / AIDS, Under_five_deaths, and Infant_deaths. Let’s explore them in detail to conclude the task of life expectancy analysis:


We can see from the two graphs above that developed countries have more life expectancy than in developing countries. I hope you liked this article on a data science project on Life Expectancy analysis with Python. Feel free to ask your valuable questions in the comments section below.