
A report by the Health Effects Institute on air pollution in India (2018) reports that air pollution was responsible for 1.1 million deaths in India in 2015.
Being a Data Scientist, I decided to analyze the air quality data of my own country to find some underlying principles or patterns which might give me an insight into how severe the problem is and I must say the results were worth sharing.
So, here In this Data Science Project, we will do some analysis on the air quality of India.
Let’s start with importing the libraries:
import seaborn as sns import numpy as np import matplotlib.pyplot as plt import pandas as pd
Download the data set
df = pd.read_csv('dataset.csv')
Let us get some insights about the data — the number of entries in each column, the type of entry in each column, etc.
df.head()

df.info()
#Output <class 'pandas.core.frame.DataFrame'> RangeIndex: 435742 entries, 0 to 435741 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 stn_code 291665 non-null object 1 sampling_date 435739 non-null object 2 state 435742 non-null object 3 location 435739 non-null object 4 agency 286261 non-null object 5 type 430349 non-null object 6 so2 401096 non-null float64 7 no2 419509 non-null float64 8 rspm 395520 non-null float64 9 spm 198355 non-null float64 10 location_monitoring_station 408251 non-null object 11 pm2_5 9314 non-null float64 12 date 435735 non-null object dtypes: float64(5), object(8) memory usage: 43.2+ MB
Now, let us check the null values.
df.isnull().sum()
#Output stn_code 144077 sampling_date 3 state 0 location 3 agency 149481 type 5393 so2 34646 no2 16233 rspm 40222 spm 237387 location_monitoring_station 27491 pm2_5 426428 date 7 dtype: int64
It seems that we have a lot of null values in some columns.
Looking at the figure, we see that pm2_5 have very few non-null values, and it might not be able to contribute much.
stn_code, agency, spm also are filled with null values.
Now let us consider the type feature.
It represents the type of area where the data was recorded like industrial, residential, etc.
Let us see how many types of area were considered :
df['type'].value_counts()
#Output Residential, Rural and other Areas 179014 Industrial Area 96091 Residential and others 86791 Industrial Areas 51747 Sensitive Area 8980 Sensitive Areas 5536 RIRUO 1304 Sensitive 495 Industrial 233 Residential 158 Name: type, dtype: int64
df = df.dropna(axis = 0, subset = ['type']) df = df.dropna(axis = 0, subset = ['location']) df = df.dropna(axis = 0, subset = ['so2']) df.isnull().sum()
#Output stn_code 119813 sampling_date 0 state 0 location 0 agency 125169 type 0 so2 0 no2 1981 rspm 29643 spm 228178 location_monitoring_station 20567 pm2_5 386966 date 4 dtype: int64
del df['agency'] del df['location_monitoring_station'] del df['stn_code'] del df['sampling_date']
a = list(df['type']) for i in range(0, len(df)): if str(a[i][0]) == 'R' and a[i][1] == 'e': a[i] = 'Residential' elif str(a[i][0]) == 'I': a[i] = 'Industrial' else: a[i] = 'Other'
df['type'] = a df['type'].value_counts()
#Output Residential 244017 Industrial 137420 Other 14724 Name: type, dtype: int64
Let’s Visualize the data
df[['so2', 'state']].groupby(['state']).median().sort_values("so2", ascending = False).plot.bar()

df[['no2', 'state']].groupby(['state']).median().sort_values("no2", ascending = False).plot.bar(color = 'r')

df[['rspm', 'state']].groupby(['state']).median().sort_values("rspm", ascending = False).plot.bar(color = 'r')

df[['spm', 'state']].groupby(['state']).median().sort_values("spm", ascending = False).plot.bar(color = 'r')

df[['pm2_5', 'state']].groupby(['state']).median().sort_values("pm2_5", ascending = False).plot.bar(color = 'r')

Statistical Analysis
Now let us do some statistical analysis for the dataset and check whether these features have some relations.
We will start by plotting the scatter plot for each feature :
sns.set() cols = ['so2', 'no2', 'rspm', 'spm', 'pm2_5'] sns.pairplot(df[cols], size = 2.5) plt.show()

corrmat = df.corr() f, ax = plt.subplots(figsize = (15, 10)) sns.heatmap(corrmat, vmax = 1, square = True, annot = True)

df['date'] = pd.to_datetime(df['date'], format = '%m/%d/%Y') df['year'] = df['date'].dt.year # year df['year'] = df['year'].fillna(0.0).astype(int) df = df[(df['year']>0)] f, ax = plt.subplots(figsize = (10,10)) ax.set_title('{} by state and year'.format('so2')) sns.heatmap(df.pivot_table('so2', index = 'state', columns = ['year'], aggfunc = 'median', margins=True), annot = True, cmap = 'YlGnBu', linewidths = 1, ax = ax, cbar_kws = {'label': 'Average taken Annually'})

No2 analysis using the heatmap :
f, ax = plt.subplots(figsize = (10,10)) ax.set_title('{} by state and year'.format('rspm')) sns.heatmap(df.pivot_table('rspm', index='state', columns = ['year'], aggfunc = 'median', margins = True), annot = True, cmap = "YlGnBu", linewidths = 1, ax = ax, cbar_kws = {'label': 'Annual Average'}) f, ax = plt.subplots(figsize = (10, 10)) ax.set_title('{} by state and year'.format('spm')) sns.heatmap(df.pivot_table('spm', index ='state', columns = ['year'], aggfunc = 'median', margins = True) , cmap = "YlGnBu", linewidths = 0.5, ax = ax, cbar_kws = {'label': 'Annual Average'})
