Analyze Healthcare Data with Python

In this article, I will take you through how we can analyze Healthcare data with Python. The process of data analysis remains almost the same in most of the cases, but there are some domains which are very much categorical. One such domain is healthcare, so here you will learn how you can analyze healthcare data with Python.

The data I will be using in this article is from India. The data comes from NTR Vaidya Seva (or Arogya Seva) is the flagship health care program of the government of Andhra Pradesh, India, in which lower middle class and low-income citizens of the state of Andhra Pradesh can get free health care for many major illnesses and ailments. A similar program also exists in neighbouring Telangana state. You can easily download this dataset from here.

Also, Read – Why Python not as the First Programming Language?

Analyze Healthcare Data

Now, let’s import all the necessary libraries that we need to analyze the healthcare data with python:

# import requisite libraries 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as snsCode language: PHP (php)

Now let’s read the data and have a quick look at some initial rows from the data:

data = pd.read_csv("ntrarogyaseva.csv")
data.head()
Code language: JavaScript (javascript)

To have a quick look at the statistics we just need to use a describe function:

# print summary statistics
data.describe()Code language: CSS (css)

Now to analyze this healthcare data in a better way we need to first look at how is the data distributed into columns. So let’s have a quick look at the columns of the dataset:

# display all the column names in the data
data.columnsCode language: PHP (php)

Index(['   ', 'AGE', 'SEX', 'CASTE_NAME', 'CATEGORY_CODE', 'CATEGORY_NAME',
       'SURGERY_CODE', 'SURGERY', 'VILLAGE', 'MANDAL_NAME', 'DISTRICT_NAME',
       'PREAUTH_DATE', 'PREAUTH_AMT', 'CLAIM_DATE', 'CLAIM_AMOUNT',
       'HOSP_NAME', 'HOSP_TYPE', 'HOSP_LOCATION', 'HOSP_DISTRICT',
       'SURGERY_DATE', 'DISCHARGE_DATE', 'Mortality Y / N', 'MORTALITY_DATE',
       'SRC_REGISTRATION'],
      dtype='object')

Data Exploration

value_counts () is a Pandas function that can be used to print data distributions (in the specified column). Let’s start by checking the gender statistics of the data:

# Display the counts of each value in the SEX column
data['SEX'].value_counts()Code language: PHP (php)

Male             260718
Female           178947
Male(Child)       25068
Female(Child)     14925
FEMALE               21
MALE                  9
Name: SEX, dtype: int64

It appears that there are duplicate values in this column. Male and MALE are not two different sexes. We can substitute the column names to resolve this issue. I will also rename Male (Child) -> Boy and Female (Child) -> Girl for convenience:

# mappings to standardize and clean the values
mappings = {'MALE' : 'Male', 'FEMALE' : 'Female', 'Male(Child)' : 'Boy', 'Female(Child)' : 'Girl'}
# replace values using the defined mappings
data['SEX'] = data['SEX'].replace(mappings)
data['SEX'].value_counts()Code language: PHP (php)

Male      260727
Female    178968
Boy        25068
Girl       14925
Name: SEX, dtype: int64

Viewing the above distribution can be done easily using Pandas’ built-in plot feature:

# plot the value counts of sex 
data['SEX'].value_counts().plot.bar()Code language: CSS (css)

Now let’s have a look at the age distribution by using the mean, median and mode:

# print the mean, median and mode of the age distribution
print("Mean: {}".format(data['AGE'].mean()))
print("Median: {}".format(data['AGE'].median()))
print("Mode: {}".format(data['AGE'].mode()))Code language: CSS (css)

Mean: 44.91226380480646
Median: 47.0
Mode: 0    0
dtype: int64

Top 10 current ages of data. Do not hesitate to play by replacing 10 with the number of your choice:

# print the top 10 ages
data['AGE'].value_counts().head(10)Code language: CSS (css)

0     17513
50    16191
55    15184
45    15052
60    13732
46    12858
56    12590
51    12470
40    11962
65    11878
Name: AGE, dtype: int64

Boxplots are commonly used to visualize a distribution when bar charts or point clouds are too difficult to understand:

# better looking boxplot (using seaborn) for age variable
sns.boxplot(data['AGE'])Code language: CSS (css)

Analyze Healthcare Data Deeply

What if I wanted to analyze only the records relating to Krishna district? I should select a subset of data to continue. Fortunately, Pandas can help us do this too, in two steps: 1. Condition to be satisfied: data [‘DISTRICT_NAME’] == ‘Krishna’ 2. Insertion of the condition in the dataframe: data [data [‘DISTRICT_NAME’] == “Krishna”]:

# subset involving only records of Krishna district
data[data['DISTRICT_NAME']=='Krishna'].head()Code language: PHP (php)

Now, if we want the most common surgery, at the district level, this can be done by going through all the district names and selecting the data subset for that district:

# Most common surgery by district
for i in data['DISTRICT_NAME'].unique():
    print("District: {}nDisease and Count: {}".format(i,data[data['DISTRICT_NAME']==i]['SURGERY'].value_counts().head(1)))Code language: PHP (php)

District: Srikakulam
Disease and Count: Maintenance Hemodialysis For Crf    3970
Name: SURGERY, dtype: int64
District: Kurnool
Disease and Count: Surgical Correction Of Longbone Fracture    2943
Name: SURGERY, dtype: int64
District: Vizianagaram
Disease and Count: Surgical Correction Of Longbone Fracture    2754
Name: SURGERY, dtype: int64
District: Guntur
Disease and Count: Surgical Correction Of Longbone Fracture    5259
Name: SURGERY, dtype: int64
District: Vishakhapatnam
Disease and Count: Maintenance Hemodialysis For Crf    5270
Name: SURGERY, dtype: int64
District: West Godavari
Disease and Count: Maintenance Hemodialysis For Crf    5478
Name: SURGERY, dtype: int64
District: Krishna
Disease and Count: Maintenance Hemodialysis For Crf    6026
Name: SURGERY, dtype: int64
District: East Godavari
Disease and Count: Surgical Correction Of Longbone Fracture    6998
Name: SURGERY, dtype: int64
District: Prakasam
Disease and Count: Maintenance Hemodialysis For Crf    6215
Name: SURGERY, dtype: int64
District: Nellore
Disease and Count: Maintenance Hemodialysis For Crf    10824
Name: SURGERY, dtype: int64
District: YSR Kadapa
Disease and Count: Surgical Correction Of Longbone Fracture    4532
Name: SURGERY, dtype: int64
District: Chittoor
Disease and Count: Maintenance Hemodialysis For Crf    5221
Name: SURGERY, dtype: int64
District: Anantapur
Disease and Count: Surgical Correction Of Longbone Fracture    5265
Name: SURGERY, dtype: int64

We note that only two surgeries dominate all the districts: Dialysis (7 districts) Long bone fracture (6 districts).

Now, let’s have a look at the average claim amount district wise:

# Average claim amount for surgery by district
for i in data['DISTRICT_NAME'].unique():
    print("District: {}nAverage Claim Amount: ₹{}".format(i,data[data['DISTRICT_NAME']==i]['CLAIM_AMOUNT'].mean()))Code language: PHP (php)

District: Srikakulam
Average Claim Amount: ₹25593.712618634367
District: Kurnool
Average Claim Amount: ₹28598.91853309593
District: Vizianagaram
Average Claim Amount: ₹25097.78006899492
District: Guntur
Average Claim Amount: ₹31048.73950729927
District: Vishakhapatnam
Average Claim Amount: ₹25977.94638304871
District: West Godavari
Average Claim Amount: ₹27936.70608610806
District: Krishna
Average Claim Amount: ₹31015.383233247547
District: East Godavari
Average Claim Amount: ₹26166.136719737173
District: Prakasam
Average Claim Amount: ₹28655.81036215859
District: Nellore
Average Claim Amount: ₹26105.122376744654
District: YSR Kadapa
Average Claim Amount: ₹27945.216899192998
District: Chittoor
Average Claim Amount: ₹25708.102690948628
District: Anantapur
Average Claim Amount: ₹27664.166978581827

Now let’s look at the surgery statistics to analyze this healthcare data. I will use the Pandas GroupBy concept to collect statistics by grouping data by category of surgery. The Pandas groupby works similarly to the SQL command of the same name:

# group by surgery category to get mean statistics
data.groupby('CATEGORY_NAME').mean()Code language: PHP (php)

Cochlear implant surgery appears to be the most expensive surgery, costing an average of ₹ 520,000. Prostheses cost ₹ 1,200, the cheapest. The youngest age group is also that of cochlear implant surgery: 1.58 years, while neurology has an average age of 56 years.

Also, Read – Machine Learning project on Predicting Migration.

So this is how you can analyze healthcare data. Feel free to play by manipulating the parameters that I have used. I hope you liked this article on how to analyze healthcare data with Python. Feel free to ask your valuable questions in the comments section below. You can also follow me on Medium to learn every topic of Python and Machine Learning.