
Anybody who is a cricket Fan should surely try to analyse this data set as it would help you in learning with a fun factor. I have tried my best to keep this article as simple as possible so that even a beginner can understand it easily.
At the same time I have made efforts to analyse the data set in different aspects effectively.
So let’s start our exploratory data analysis on IPL
Let’s start with importing the required libraries
# This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load in import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as mlt import seaborn as sns mlt.style.use('fivethirtyeight')
You can download the required data sets from below:
matches=pd.read_csv('matches.csv') delivery=pd.read_csv('deliveries.csv')
Some Cleaning And Transformation
matches.drop(['umpire3'],axis=1,inplace=True) #since all the values are NaN delivery.fillna(0,inplace=True) #filling all the NaN values with 0 matches['team1'].unique()
#Output array(['Sunrisers Hyderabad', 'Mumbai Indians', 'Gujarat Lions', 'Rising Pune Supergiant', 'Royal Challengers Bangalore', 'Kolkata Knight Riders', 'Delhi Daredevils', 'Kings XI Punjab', 'Chennai Super Kings', 'Rajasthan Royals', 'Deccan Chargers', 'Kochi Tuskers Kerala', 'Pune Warriors', 'Rising Pune Supergiants'], dtype=object)
Replacing the Team Names with their abbreviations
matches.replace(['Mumbai Indians','Kolkata Knight Riders','Royal Challengers Bangalore','Deccan Chargers','Chennai Super Kings', 'Rajasthan Royals','Delhi Daredevils','Gujarat Lions','Kings XI Punjab', 'Sunrisers Hyderabad','Rising Pune Supergiants','Kochi Tuskers Kerala','Pune Warriors','Rising Pune Supergiant'] ,['MI','KKR','RCB','DC','CSK','RR','DD','GL','KXIP','SRH','RPS','KTK','PW','RPS'],inplace=True) delivery.replace(['Mumbai Indians','Kolkata Knight Riders','Royal Challengers Bangalore','Deccan Chargers','Chennai Super Kings', 'Rajasthan Royals','Delhi Daredevils','Gujarat Lions','Kings XI Punjab', 'Sunrisers Hyderabad','Rising Pune Supergiants','Kochi Tuskers Kerala','Pune Warriors','Rising Pune Supergiant'] ,['MI','KKR','RCB','DC','CSK','RR','DD','GL','KXIP','SRH','RPS','KTK','PW','RPS'],inplace=True)
Some Basic Analysis
print('Total Matches Played:',matches.shape[0]) print(' \n Venues Played At:',matches['city'].unique()) print(' \n Teams :',matches['team1'].unique())
#Output Total Matches Played: 636 Venues Played At: ['Hyderabad' 'Pune' 'Rajkot' 'Indore' 'Bangalore' 'Mumbai' 'Kolkata' 'Delhi' 'Chandigarh' 'Kanpur' 'Jaipur' 'Chennai' 'Cape Town' 'Port Elizabeth' 'Durban' 'Centurion' 'East London' 'Johannesburg' 'Kimberley' 'Bloemfontein' 'Ahmedabad' 'Cuttack' 'Nagpur' 'Dharamsala' 'Kochi' 'Visakhapatnam' 'Raipur' 'Ranchi' 'Abu Dhabi' 'Sharjah' nan] Teams : ['SRH' 'MI' 'GL' 'RPS' 'RCB' 'KKR' 'DD' 'KXIP' 'CSK' 'RR' 'DC' 'KTK' 'PW']
print('Total venues played at:',matches['city'].nunique()) print('\nTotal umpires ',matches['umpire1'].nunique())
#Output Total venues played at: 30 Total umpires 44
print((matches['player_of_match'].value_counts()).idxmax(),' : has most man of the match awards') print(((matches['winner']).value_counts()).idxmax(),': has the highest number of match wins')
#Output CH Gayle : has most man of the match awards MI : has the highest number of match wins
df=matches.iloc[[matches['win_by_runs'].idxmax()]] df[['season','team1','team2','winner','win_by_runs']]
Toss Decisions across Seasons
mlt.subplots(figsize=(10,6)) sns.countplot(x='season',hue='toss_decision',data=matches) mlt.show()

The decision for batting or fielding varies largely across the seasons. In some seasons, the probability that toss winners opt for batting is high, while it is not the case in other seasons. In 2016 though, the majority of toss winners opted for batting.
Maximum Toss Winners
mlt.subplots(figsize=(10,6)) ax=matches['toss_winner'].value_counts().plot.bar(width=0.9,color=sns.color_palette('RdYlGn',20)) for p in ax.patches: ax.annotate(format(p.get_height()), (p.get_x()+0.15, p.get_height()+1)) mlt.show()

Mumbai Indians seem to be very lucky having the higest win in tosses follwed by Kolkata Knight Riders. Pune Supergiants have the lowest wins as they have played the lowest matches also.
This does not show the higher chances of winning the toss as the number of matches played by each team is uneven.
Is Toss Winner Also the Match Winner?
df=matches[matches['toss_winner']==matches['winner']] slices=[len(df),(577-len(df))] labels=['yes','no'] mlt.pie(slices,labels=labels,startangle=90,shadow=True,explode=(0,0.05),autopct='%1.1f%%',colors=['r','g']) fig = mlt.gcf() fig.set_size_inches(6,6) mlt.show()

Thus the toss winner is not necessarily the match winner. The match winning probability for toss winnong team is about 50%-50%.
Matches played across each season
mlt.subplots(figsize=(10,6)) sns.countplot(x='season',data=matches,palette=sns.color_palette('winter')) #countplot automatically counts the frequency of an item mlt.show()

Runs Across the Seasons
batsmen = matches[['id','season']].merge(delivery, left_on = 'id', right_on = 'match_id', how = 'left').drop('id', axis = 1) #merging the matches and delivery dataframe by referencing the id and match_id columns respectively season=batsmen.groupby(['season'])['total_runs'].sum().reset_index() season.set_index('season').plot(marker='o') mlt.gcf().set_size_inches(10,6) mlt.title('Total Runs Across the Seasons') mlt.show()

There was a decline in total runs from 2008 to 2009. But there after there was a substantial increase in runs in every season until 2013, but from next season there was a slump in the total runs.
But the number of matches are not equal in all seasons. We should check the average runs per match in each season:
avgruns_each_season=matches.groupby(['season']).count().id.reset_index() avgruns_each_season.rename(columns={'id':'matches'},inplace=1) avgruns_each_season['total_runs']=season['total_runs'] avgruns_each_season['average_runs_per_match']=avgruns_each_season['total_runs']/avgruns_each_season['matches'] avgruns_each_season.set_index('season')['average_runs_per_match'].plot(marker='o') mlt.gcf().set_size_inches(10,6) mlt.title('Average Runs per match across Seasons') mlt.show()

Sixes and Fours Across the Season
Season_boundaries=batsmen.groupby("season")["batsman_runs"].agg(lambda x: (x==6).sum()).reset_index() a=batsmen.groupby("season")["batsman_runs"].agg(lambda x: (x==4).sum()).reset_index() Season_boundaries=Season_boundaries.merge(a,left_on='season',right_on='season',how='left') Season_boundaries=Season_boundaries.rename(columns={'batsman_runs_x':'6"s','batsman_runs_y':'4"s'}) Season_boundaries.set_index('season')[['6"s','4"s']].plot(marker='o') fig=mlt.gcf() fig.set_size_inches(10,6) mlt.show()

Runs Per Over By Teams Across Seasons
runs_per_over = delivery.pivot_table(index=['over'],columns='batting_team',values='total_runs',aggfunc=sum) runs_per_over[(matches_played_byteams[matches_played_byteams['Total Matches']>50].index)].plot(color=["b", "r", "#Ffb6b2", "g",'brown','y','#6666ff','black','#FFA500']) #plotting graphs for teams that have played more than 100 matches x=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20] mlt.xticks(x) mlt.ylabel('total runs scored') fig=mlt.gcf() fig.set_size_inches(16,10) mlt.show()

Maximum runs are being scored in the last 5 overs of the match. MI and RCB have shown a increasing trend in the runs scored throughout the match.
Favorite Grounds
mlt.subplots(figsize=(10,15)) ax = matches['venue'].value_counts().sort_values(ascending=True).plot.barh(width=.9,color=sns.color_palette('inferno',40)) ax.set_xlabel('Grounds') ax.set_ylabel('count') mlt.show()

Maximum Man Of Matches
mlt.subplots(figsize=(10,6)) #the code used is very basic but gets the job done easily ax = matches['player_of_match'].value_counts().head(10).plot.bar(width=.8, color=sns.color_palette('inferno',10)) #counts the values corresponding # to each batsman and then filters out the top 10 batsman and then plots a bargraph ax.set_xlabel('player_of_match') ax.set_ylabel('count') for p in ax.patches: ax.annotate(format(p.get_height()), (p.get_x()+0.15, p.get_height()+0.25)) mlt.show()

Top 10 Batsman
mlt.subplots(figsize=(10,6)) max_runs=delivery.groupby(['batsman'])['batsman_runs'].sum() ax=max_runs.sort_values(ascending=False)[:10].plot.bar(width=0.8,color=sns.color_palette('winter_r',20)) for p in ax.patches: ax.annotate(format(p.get_height()), (p.get_x()+0.1, p.get_height()+50),fontsize=15) mlt.show()

Top Batsman’s with 1’s, 2’s, 3’s, 4’s
toppers=delivery.groupby(['batsman','batsman_runs'])['total_runs'].count().reset_index() toppers=toppers.pivot('batsman','batsman_runs','total_runs') fig,ax=mlt.subplots(2,2,figsize=(18,12)) toppers[1].sort_values(ascending=False)[:5].plot(kind='barh',ax=ax[0,0],color='#45ff45',width=0.8) ax[0,0].set_title("Most 1's") ax[0,0].set_ylabel('') toppers[2].sort_values(ascending=False)[:5].plot(kind='barh',ax=ax[0,1],color='#df6dfd',width=0.8) ax[0,1].set_title("Most 2's") ax[0,1].set_ylabel('') toppers[4].sort_values(ascending=False)[:5].plot(kind='barh',ax=ax[1,0],color='#fbca5f',width=0.8) ax[1,0].set_title("Most 4's") ax[1,0].set_ylabel('') toppers[6].sort_values(ascending=False)[:5].plot(kind='barh',ax=ax[1,1],color='#ffff00',width=0.8) ax[1,1].set_title("Most 6's") ax[1,1].set_ylabel('') mlt.show()

Observations:
- Kohli has scored the maximum 1’s
- Dhoni has the maximum 2’s . Those Strong Legs :p
- Gambhir has the maximum 4’s.
- C Gayle has the maximum 6’s and he leads by a big margin.
Top Individual Scores
top_scores = delivery.groupby(["match_id", "batsman","batting_team"])["batsman_runs"].sum().reset_index() #top_scores=top_scores[top_scores['batsman_runs']>100] top_scores.sort_values('batsman_runs', ascending=0).head(10) top_scores.nlargest(10,'batsman_runs')

Here too the Jamaican leads the table. Not only Gayle but there are many RCB players on the top scores list. Looks like RCB is a very formidable batting side.
Individual Scores By Top Batsman each Inning
swarm=['CH Gayle','V Kohli','G Gambhir','SK Raina','YK Pathan','MS Dhoni','AB de Villiers','DA Warner'] scores = delivery.groupby(["match_id", "batsman","batting_team"])["batsman_runs"].sum().reset_index() scores=scores[top_scores['batsman'].isin(swarm)] sns.swarmplot(x='batsman',y='batsman_runs',data=scores,hue='batting_team',palette='Set1') fig=mlt.gcf() fig.set_size_inches(14,8) mlt.ylim(-10,200) mlt.show()

Observations:
- Chris Gayle has the highest Individual Score of 175 and Highest Number of Centuries i.e 5
- MS Dhoni and Gautam Gambhir have never scored a Century.
- V Kohli has played only for 1 IPL Team in all seasons i.e RCB
Runs Scored By Batsman Across Seasons
a=batsmen.groupby(['season','batsman'])['batsman_runs'].sum().reset_index() a=a.groupby(['season','batsman'])['batsman_runs'].sum().unstack().T a['Total']=a.sum(axis=1) a=a.sort_values(by='Total',ascending=0)[:5] a.drop('Total',axis=1,inplace=True) a.T.plot(color=['red','blue','#772272','green','#f0ff00'],marker='o') fig=mlt.gcf() fig.set_size_inches(16,6) mlt.show()

David Warner’s form looks to be improving season by season. There has been a sharp decline in Kohli’s Runs in the last season.
I hope you will like this data analysis on IPL, you can explore more data sets the same way.