
Estimating Covid-19 Death Rate
In this Data Science tutorial we will do some analysis on the Death rate of the pandemic Covid-19 using python.
You can download the data set we need for this task from here:
import numpy as np import pandas as pd import matplotlib.pyplot as plt from datetime import datetime
Now load the main data table and display it ;
worldometer_df = pd.read_csv('worldometer_snapshots_April18_to_May18.csv') worldometer_df

To display a sub-table of a specific country :
country_name = 'USA' country_df = worldometer_df.loc[worldometer_df['Country'] == country_name, :].reset_index(drop=True) country_df

To display a sub-table of a specific date :
selected_date = datetime.strptime('18/05/2020', '%d/%m/%Y') selected_date_df = worldometer_df.loc[worldometer_df['Date'] == selected_date.strftime('%Y-%m-%d'), :].reset_index(drop=True) selected_date_df

Now lets take the last date and continue our analysis :
last_date = datetime.strptime('18/05/2020', '%d/%m/%Y') last_date_df = worldometer_df.loc[worldometer_df['Date'] == last_date.strftime('%Y-%m-%d'), :].reset_index(drop=True) last_date_df

Now calculate the naive death rate for each country and show histogram :
last_date_df['Case Fatality Ratio'] = last_date_df['Total Deaths'] / last_date_df['Total Cases'] plt.figure(figsize=(12,8)) plt.hist(100 * np.array(last_date_df['Case Fatality Ratio']), bins=np.arange(35)) plt.xlabel('Death Rate (%)', fontsize=16) plt.ylabel('Number of Countries', fontsize=16) plt.title('Histogram of Death Rates for various Countries', fontsize=18) plt.show()

We see a large spread of death rates between countries
This shouldn’t be the case usually, as humans are humans and are likely affected similarly by the disease in various regions of the world The question arises: what can explain this spread?
Filter out countries with small amount of cases :
min_number_of_cases = 1000 greatly_affected_df = last_date_df.loc[last_date_df['Total Cases'] > min_number_of_cases,:] plt.figure(figsize=(12,8)) plt.hist(100 * np.array(greatly_affected_df['Case Fatality Ratio']), bins=np.arange(35)) plt.xlabel('Death Rate (%)', fontsize=16) plt.ylabel('Number of Countries', fontsize=16) plt.title('Histogram of Death Rates for various Countries', fontsize=18) plt.show()

We can see that some of the outliers are removed, but the spread is still large as accounted for.
Plot scatter of death rate as function of testing quality
We know some countries were more responsible regarding their testing strategy and some were less so let’s plot the death rate as function of testing quality :
last_date_df['Num Tests per Positive Case'] = last_date_df['Total Tests'] / last_date_df['Total Cases'] min_number_of_cases = 1000 greatly_affected_df = last_date_df.loc[last_date_df['Total Cases'] > min_number_of_cases,:] x_axis_limit = 80 death_rate_percent = 100 * np.array(greatly_affected_df['Case Fatality Ratio']) num_test_per_positive = np.array(greatly_affected_df['Num Tests per Positive Case']) num_test_per_positive[num_test_per_positive > x_axis_limit] = x_axis_limit total_num_deaths = np.array(greatly_affected_df['Total Deaths']) population = np.array(greatly_affected_df['Population']) plt.figure(figsize=(16,12)) plt.scatter(x=num_test_per_positive, y=death_rate_percent, s=0.5*np.power(np.log(1+population),2), c=np.log10(1+total_num_deaths)) plt.colorbar() plt.ylabel('Death Rate (%)', fontsize=16) plt.xlabel('Number of Tests per Positive Case', fontsize=16) plt.title('Death Rate as function of Testing Quality', fontsize=18) plt.xlim(-1, x_axis_limit + 12) plt.ylim(-0.2,17) # plot on top of the figure the names of the #countries_to_display = greatly_affected_df['Country'].unique().tolist() countries_to_display = ['USA', 'Russia', 'Spain', 'Brazil', 'UK', 'Italy', 'France', 'Germany', 'India', 'Canada', 'Belgium', 'Mexico', 'Netherlands', 'Sweden', 'Portugal', 'UAE', 'Poland', 'Indonesia', 'Romania', 'Israel','Thailand','Kyrgyzstan','El Salvador', 'S. Korea', 'Denmark', 'Serbia', 'Norway', 'Algeria', 'Bahrain','Slovenia', 'Greece','Cuba','Hong Kong','Lithuania', 'Australia', 'Morocco', 'Malaysia', 'Nigeria', 'Moldova', 'Ghana', 'Armenia', 'Bolivia', 'Iraq', 'Hungary', 'Cameroon', 'Azerbaijan'] for country_name in countries_to_display: country_index = greatly_affected_df.index[greatly_affected_df['Country'] == country_name] plt.text(x=num_test_per_positive[country_index] + 0.5, y=death_rate_percent[country_index] + 0.2, s=country_name, fontsize=10) plt.show()

We can clearly see that the better the testing, the lower the variability of the the death between the different countries.
Now let’s look at data from best testing countries :
Lets decide that the cutoff for good testing country is 50 tests per positive cases.
good_testing_threshold = 50 good_testing_df = greatly_affected_df.loc[greatly_affected_df['Num Tests per Positive Case'] > good_testing_threshold,:] good_testing_df

Lets calculate the Death Rate for these countries
estimated_death_rate_percent = 100 * good_testing_df['Total Deaths'].sum() / good_testing_df['Total Cases'].sum() print('Death Rate only for "good testing countries" is %.2f%s' %(estimated_death_rate_percent,'%'))
#Output- Death Rate only for “good testing countries” is 1.36%