Life Expectancy Analysis with Python

Many studies have been undertaken in the past on the factors affecting a country’s life expectancy, taking into account demographic variables, income composition and death rates. It was found that the effect of vaccination and the human development index were not taken into account in the past. In this article, I will introduce you to a data science project on life expectancy analysis with Python.

Data Science Project on Life Expectancy Analysis

Life expectancy refers to the number of years a person is expected to live based on the statistical average. It depends on the geographical context of the area. Before the modernization of the world, life expectancy was around 30 years in all parts of the world. Life expectancy increased at the beginning of the 19th century but until there are the same countries while it remains low in the rest of the world.

Also, Read – 100+ Machine Learning Projects Solved and Explained.

This shows that health standards are not the same all over the world. In the 20th century, this global inequality is reduced and similarly, life expectancy is approaching 70 to 75 years and similarly no country in the world today has a low life expectancy than countries with high life expectancy in 1800.

In the section below, I will take you through a Data Science Project on Life Expectancy Analysis with Python. The good thing about this task is that here you will learn about the factors that WHO uses to calculate the Life Expectancy of a country as the data is provided by WHO.

Life Expectancy Analysis with Python

Now let’s get started with the task of Life Expectancy Analysis with Python. I will start this task by importing the necessary Python libraries and the dataset:

The dataset contains 22 columns

Now let’s have a look at some statistics from the data by using the describe function of Pandas:

Index(['Country', 'Year', 'Status', 'Life expectancy ', 'Adult Mortality',
       'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
       'Measles ', ' BMI ', 'under-five deaths ', 'Polio', 'Total expenditure',
       'Diphtheria ', ' HIV/AIDS', 'GDP', 'Population',
       ' thinness  1-19 years', ' thinness 5-9 years',
       'Income composition of resources', 'Schooling'],

So there are only two categorical variables in the data which are country and status. Now let’s change the names of all the columns to make them look uniform:

Data Cleaning:

Now let’s move further on the task of Life Expectancy analysis by looking at the null values in the dataset:
#   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Country                          2938 non-null   object 
 1   Year                             2938 non-null   int64  
 2   Status                           2938 non-null   object 
 3   Life_expectancy                  2928 non-null   float64
 4   Adult_mortality                  2928 non-null   float64
 5   Infant_deaths                    2938 non-null   int64  
 6   Alcohol                          2744 non-null   float64
 7   Percentage_expenditure           2938 non-null   float64
 8   HepatitisB                       2385 non-null   float64
 9   Measles                          2938 non-null   int64  
 10  BMI                              2904 non-null   float64
 11  Under_five_deaths                2938 non-null   int64  
 12  Polio                            2919 non-null   float64
 13  Total_expenditure                2712 non-null   float64
 14  Diphtheria                       2919 non-null   float64
 15  HIV/AIDS                         2938 non-null   float64
 16  GDP                              2490 non-null   float64
 17  Population                       2286 non-null   float64
 18  Thinness_1-19_years              2904 non-null   float64
 19  Thinness_5-9_years               2904 non-null   float64
 20  Income_composition_of_resources  2771 non-null   float64
 21  Schooling                        2775 non-null   float64
dtypes: float64(16), int64(4), object(2)
memory usage: 505.1+ KB

The columns that we found with null values are:

  1. Life_expectancy
  2. Adult_mortality
  3. Alcohol
  4. Hepatitis B
  5. BMI
  6. Polio
  7. Total_expenditure
  8. Diphtheria
  9. GDP
  10. Population
  11. Thinness_1-19_years
  12. Thinness_5-9_years
  13. Income_composition_of_resources
  14. Schooling

So there are so many columns with the null values. Now let’s have a look at how many null values all these columns are having:

Country                              0
Year                                 0
Status                               0
Life_expectancy                     10
Adult_mortality                     10
Infant_deaths                        0
Alcohol                            194
Percentage_expenditure               0
HepatitisB                         553
Measles                              0
BMI                                 34
Under_five_deaths                    0
Polio                               19
Total_expenditure                  226
Diphtheria                          19
HIV/AIDS                             0
GDP                                448
Population                         652
Thinness_1-19_years                 34
Thinness_5-9_years                  34
Income_composition_of_resources    167
Schooling                          163
dtype: int64

There are many columns with null values, but the number of missing values is not large enough to remove the columns. So imputing missing values would be a good idea. We also know that all columns with missing values are numeric continuous variables.

Filling in the missing values with a central tendency average would not be a good idea due to the outliers. We can also fill it with the median:

life_expectancy.groupby('Country').apply(lambda group: group.interpolate(method= 'linear'))
imputed_data = []
for year in list(life_expectancy.Year.unique()):
    year_data = life_expectancy[life_expectancy.Year == year].copy()
    for col in list(year_data.columns)[4:]:
        year_data[col] = year_data[col].fillna(year_data[col].dropna().median()).copy()
life_expectancy = pd.concat(imputed_data).copy()

Removing Outliers:

The next step in the task of Life Expectancy analysis is to deal with outliers, let’s have a look at the outliers and then we will see how we can deal with the outliers:

col_dict = {'Life_expectancy':1,'Adult_mortality':2,'Infant_deaths':3,'Alcohol':4,'Percentage_expenditure':5,'HepatitisB':6,'Measles':7,'BMI':8,'Under_five_deaths':9,'Polio':10,'Total_expenditure':11,'Diphtheria':12,'HIV/AIDS':13,'GDP':14,'Population':15,'Thinness_1-19_years':16,'Thinness_5-9_years':17,'Income_composition_of_resources':18,'Schooling':19}

# Detect outliers in each variable using box plots.
fig = plt.figure(figsize=(20,30))

for variable,i in col_dict.items():
life expecatncy analysis: outliers

Infant_Deaths represents several infant deaths per 1,000 population. That is why the number beyond 1000 is unrealistic. We will therefore remove them as outliers. The same is true for measles and deaths under five, as both are a number per 1,000 population.

As we can see, some countries spend up to 20,000% of their GDP on health. Most countries spend less than 2,500% of their GDP on health. Since the values ​​are very important in the Expenditure_Percentage, GDP, and Population columns, it is better to take a logarithmic value or use winsorization if necessary.

The BMI values ​​are very unrealistic because the value plus 40 is considered extreme obesity. The median is over 40 and some countries have an average of around 60 which is not possible. We can delete this whole column.

As almost all other columns have outliers, we can use winsorization:

data after removing outliers

Life Expectancy Analysis

Now we have done all the data cleaning and we also have removed all the outliers in the dataset. Now let’s see move forward with the task of Life Expectancy Analysis. Let’s start by exploring the data and looking at the correlation:

fig = plt.figure(figsize=(20,20))
for variable,i in col_dict_winz.items():
life expectancy analysis
life_exp = life_expectancy[['Year', 'Country', 'Status','winz_Life_expectancy','winz_Adult_mortality','Infant_deaths','winz_Alcohol',
sns.heatmap(life_exp.corr(), annot =True, linewidths = 4)

Observations from the above correlation:

  • Adult_mortality has a negative relationship with education, the composition of resource income, and a positive relationship with HIV / AIDS.
  • Infant_deaths and Under_five_deaths have a strong positive relationship.
  • Schooling and alcohol have a positive relationship.
  • Percentage expenditure has a positive relationship with education, the composition of resource income, GDP and life expectancy.
  • hepatitis B has a strong positive relationship with polio and diphtheria.
  • Polio also has a strong positive relationship with diphtheria, hepatitis B, and life expectancy.
  • Diphtheria has a strong positive relationship with polio and life expectancy.

As we can see from the heat map, Life_expectancy has a positive relationship with education, resource income composition, GDP, diphtheria, polio, and percentage spending. Life_expectancy has a negative relationship with Adult_mortality, Thinness_1-19_years, Thinness_5-9_years, HIV / AIDS, Under_five_deaths, and Infant_deaths. Let’s explore them in detail to conclude the task of life expectancy analysis:

life expectancy according to status
life expectancy: developed vs developing

We can see from the two graphs above that developed countries have more life expectancy than in developing countries. I hope you liked this article on a data science project on Life Expectancy analysis with Python. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1535

Leave a Reply