Olympic Games Analysis with Python

Today, we will explore a dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016.

The data have been scraped from www.sports-reference.com in May 2018.

Download the data sets

Let’s Start with importing the libraries

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

Collecting information about both the data sets

We are going to:

  1. Review the first lines of the data;
  2. Use the describe and info functions to collect statistical information, datatypes, column names and other information.
data = pd.read_csv('athlete_events.csv')
data.head()
data.describe()
data.info()
#Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB
regions = pd.read_csv('datasets_31029_40943_noc_regions.csv')
regions.head()

Joining the data frames

merged = pd.merge(data, regions, on='NOC', how='left')
merged.head()

Distribution of the age of gold medalists

Let’s start creating a new data frame including only gold medalists.

goldMedals = merged[(merged.Medal == 'Gold')]
goldMedals.head()

I would like to have a plot of the Age to see the distribution but I need to check first if the Age column contains NaN values.

goldMedals.isnull().any()
#Output
ID        False
Name      False
Sex       False
Age        True
Height     True
Weight     True
Team      False
NOC       False
Games     False
Year      False
Season    False
City      False
Sport     False
Event     False
Medal     False
region     True
notes      True
dtype: bool

Let’s take only the values that are different from NaN.

goldMedals = goldMedals[np.isfinite(goldMedals['Age'])]

We can now create a countplot to see the result of our work:

plt.figure(figsize=(20, 10))
plt.tight_layout()
sns.countplot(goldMedals['Age'])
plt.title('Distribution of Gold Medals')

It seems that we have people with Age greater that 50 with a gold medal: Let’s know more about those people!

goldMedals['ID'][goldMedals['Age'] > 50].count()

#Output
65

65 people: Great! But which disciplines allows you to land a gold medal after your fifties?

We will now create a new dataframe called masterDisciplines in which we will insert this new set of people and then create a visualization with it.

masterDisciplines = goldMedals['Sport'][goldMedals['Age'] > 50]
plt.figure(figsize=(20, 10))
plt.tight_layout()
sns.countplot(masterDisciplines)
plt.title('Gold Medals for Athletes Over 50')

It seems that our senior gold medalists are shooters, archers, sailors and, above all, horse riders!

It makes sense: I cannot imagine a sprinter making 100 meters in 10 seconds at 55, but who knows!

Women in Athletics

Studying the data we can try to understand how much medals we have only for women in the recent history of the Summer Games. Let’s create a filtered dataset :

womenInOlympics = merged[(merged.Sex == 'F') & (merged.Season == 'Summer')]
womenInOlympics.head(10)

To plot the curve over time, let’s create a plot in which we put the year (on the x-axis) and count of the number of medals per edition of the games (consider that we will have more medals for the same athlete).

sns.set(style="darkgrid")
plt.figure(figsize=(20, 10))
sns.countplot(x='Year', data=womenInOlympics)
plt.title('Women medals per edition of the Games')

Usually I cross-check the data: below I tried to review only the medalists for the 1900 Summer edition to see if the visualization is correct.

womenInOlympics.loc[womenInOlympics['Year'] == 1900].head(10)

Okay, let’s count the rows (same code as above adding the count() function and filtering only for ID).

womenInOlympics['ID'].loc[womenInOlympics['Year'] == 1900].count()

#Output
33

So we have 33 records (with repetitions, for example ‘Marion Jones (-Farquhar)’ won a medal both for Tennis Women’s Singles and Tennis Mixed Doubles – To be sure I cross-checked also with Wikipedia and the outcome seems correct).

Medals per country

Let’s now review the top 5 gold medal countries:

goldMedals.region.value_counts().reset_index(name='Medal').head()
#Output
index	   Medal
0	USA	    2627
1	Russia	1599
2	Germany	1293
3	UK	    657
4	Italy	567

Let’s plot this:

totalGoldMedals = goldMedals.region.value_counts().reset_index(name='Medal').head(5)
g = sns.catplot(x="index", y="Medal", data=totalGoldMedals,
                height=6, kind="bar", palette="muted")
g.despine(left=True)
g.set_xlabels("Top 5 countries")
g.set_ylabels("Number of Medals")
plt.title('Medals per Country')

The USA seems to be the most winning country.

But which are the most awarded disciplines of American Athletes?

Disciplines with the greatest number of Gold Medals

Let’s create a dataframe to filter the gold medals only for the USA.

goldMedalsUSA = goldMedals.loc[goldMedals['NOC'] == 'USA']

Done! Now, we can count the medals per discipline:

goldMedalsUSA.Event.value_counts().reset_index(name='Medal').head(20)

Let’s slice the dataframe using only the data of male athletes to better review it:

basketballGoldUSA = goldMedalsUSA.loc[(goldMedalsUSA['Sport'] == 'Basketball') & (goldMedalsUSA['Sex'] == 'M')].sort_values(['Year'])
basketballGoldUSA.head(15)

What we supposed is true: the medals are not grouped by Edition/Team but we were counting the gold medals of each member of the team!

Let’s proceed grouping by year the athletes – the idea is to create a new dataframe to make a pre-filter using only the first record for each member of the team.

groupedBasketUSA = basketballGoldUSA.groupby(['Year']).first()
groupedBasketUSA
groupedBasketUSA['ID'].count()

What is the median height/weight of an Olympic medalist?

Let’s try to plot a scatterplot of height vs weight to see the distribution of values (without grouping by discipline).

First of all, we have to take again the goldMedals dataframe

goldMedals.head()

We can see that we have NaN values both in height and weight columns.

At this point, we can act as follows:

  1. Using only the rows that has a value in the Height and Weight columns;
  2. Replace the value with the mean of the column.

Solution 2 in my opinion it is not the best way to go: we are talking about data of athletes of different ages and different disciplines (that have done different training).

Let’s go with solution 1.

The first thing to do is to collect general information about the dataframe that we have to use: goldMedals.

goldMedals.info()
#Output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 13224 entries, 3 to 271076
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      13224 non-null  int64  
 1   Name    13224 non-null  object 
 2   Sex     13224 non-null  object 
 3   Age     13224 non-null  float64
 4   Height  10532 non-null  float64
 5   Weight  10248 non-null  float64
 6   Team    13224 non-null  object 
 7   NOC     13224 non-null  object 
 8   Games   13224 non-null  object 
 9   Year    13224 non-null  int64  
 10  Season  13224 non-null  object 
 11  City    13224 non-null  object 
 12  Sport   13224 non-null  object 
 13  Event   13224 non-null  object 
 14  Medal   13224 non-null  object 
 15  region  13223 non-null  object 
 16  notes   171 non-null    object 
dtypes: float64(3), int64(2), object(12)
memory usage: 2.4+ MB

Okay, we have more than 13.000 rows.

We will now create a dataframe filtering only the rows that has the column Height and Weight populated.

notNullMedals = goldMedals[(goldMedals['Height'].notnull()) & (goldMedals['Weight'].notnull())]
plt.figure(figsize=(12, 10))
ax = sns.scatterplot(x="Height", y="Weight", data=notNullMedals)
plt.title('Height vs Weight of Olympic Medalists')

The vast majority of the samples show a linear relation between height and weight (the more the weight, the more the height).

We have exceptions and I am willing to know more!

For example, let’s see which is the athlete that weighs more than 160 kilograms.

notNullMedals.loc[notNullMedals['Weight'] > 160]

Evolution of the Olympics over time

A great thank you to Rodolfo Mendes for giving me the idea for this paragraph.

We will now try to answer the following questions:

  • How the number of athletes/countries varied along time ?
  • How the proportion of Men/Women varied with time ?
  • How about mean age, weight and height along time ?

Variation of male/female athletes over time (Summer Games) *

We will now create two dataframes dividing the population of our dataset using Sex and Season (we would like to review only the summer games)

MenOverTime = merged[(merged.Sex == 'M') & (merged.Season == 'Summer')]
WomenOverTime = merged[(merged.Sex == 'F') & (merged.Season == 'Summer')]

Done, let’s check the head of one of the new dataframes to see the result:

MenOverTime.head()

Okay, at this time we are ready to create the plots. The first one is for men, the second for women:

part = MenOverTime.groupby('Year')['Sex'].value_counts()
plt.figure(figsize=(20, 10))
part.loc[:,'M'].plot()
plt.title('Variation of Male Athletes over time')
part = WomenOverTime.groupby('Year')['Sex'].value_counts()
plt.figure(figsize=(20, 10))
part.loc[:,'F'].plot()
plt.title('Variation of Female Athletes over time')

What I immediately saw is that for women:

  1. We have a steep increase in the population;
  2. The grow is constant.

On the other hand, the grow for men seems less strong:

  1. After the 1990 we can see a relevant decrease in the number of male athletes at the summer games;
  2. The growth has slowly restarted recently.
plt.figure(figsize=(20, 10))
sns.boxplot('Year', 'Age', data=MenOverTime)
plt.title('Variation of Age for Male Athletes over time')

What is strange for me is the age of some athletes in the games between the 1924 and the 1948: let’s check all the people with age greater than 80.

MenOverTime.loc[MenOverTime['Age'] > 80].head(10)
plt.figure(figsize=(20, 10))
sns.boxplot('Year', 'Age', data=WomenOverTime)
plt.title('Variation of Age for Female Athletes over time')

Interesting points for me:

  • Generally, the age distribution starts has a lower minimum and a lower maximum;
  • In 1904 the age distribution is strongly different from the other Olympics: let’s know more about this point:
WomenOverTime.loc[WomenOverTime['Year'] == 1904]

We will now try using a pointplot to visualize the variation in weight over athletes.

The first graph will show data for men, the second for women:

plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Weight', data=MenOverTime)
plt.title('Variation of Weight for Male Athletes over time')
plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Weight', data=WomenOverTime)
plt.title('Variation of Weight for Female Athletes over time')

What we can see is that it seems that we do not have data for women before 1924. Let’s try filtering all the women athletes for that period to review this point:

womenInOlympics.loc[womenInOlympics['Year'] < 1924].head(20)

Using the same pointplot (with a different palette) we can plot the weight change along time.

The first graph will show the information for men, the second for women:

plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Height', data=MenOverTime, palette='Set2')
plt.title('Variation of Height for Male Athletes over time')
plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Height', data=WomenOverTime, palette='Set2')
plt.title('Variation of Height for Female Athletes over time')

What we may see:

  • For both men and women, the height is incrementing over time but it is decreasing between the 2012 and the 2016.
  • For women we have a peak between 1928 and 1948, let’s deepen this point:
WomenOverTime.loc[(WomenOverTime['Year'] > 1924) & (WomenOverTime['Year'] < 1952)].head(10)

Let’s see the age over time for Italian athletes.

I will start reviewing the dataset MenOverTime to refresh the columns:

MenOverTime.head()

Let’s create a sliced dataframe including only male athletes from Italy

itMenOverTime = MenOverTime.loc[MenOverTime['region'] == 'Italy']

Okay, now we can plot the change over time:

sns.set(style="darkgrid")
plt.figure(figsize=(20, 10))
sns.countplot(x='Year', data=itMenOverTime, palette='Set2')
plt.title('Variation of Age for Italian Male Athletes over time')

Okay, we can quickly do the same operation for women:

itWomenOverTime = WomenOverTime.loc[WomenOverTime['region'] == 'Italy']
sns.set(style="darkgrid")
plt.figure(figsize=(20, 10))
sns.countplot(x='Year', data=itWomenOverTime, palette='Set2')
plt.title('Variation of Age for Italian Female Athletes over time')

Let’s first of all isolate all the discipline of the Olympics dataframe.

My idea is to see if Gymnastics is called differently or if there is any type.

MenOverTime['Sport'].unique().tolist()
#Output
['Basketball',
 'Judo',
 'Football',
 'Tug-Of-War',
 'Swimming',
 'Badminton',
 'Gymnastics',
 'Athletics',
 'Art Competitions',
 'Wrestling',
 'Water Polo',
 'Sailing',
 'Rowing',
 'Fencing',
 'Equestrianism',
 'Shooting',
 'Boxing',
 'Taekwondo',
 'Cycling',
 'Weightlifting',
 'Diving',
 'Canoeing',
 'Handball',
 'Tennis',
 'Modern Pentathlon',
 'Hockey',
 'Volleyball',
 'Baseball',
 'Table Tennis',
 'Archery',
 'Trampolining',
 'Beach Volleyball',
 'Golf',
 'Rugby Sevens',
 'Triathlon',
 'Rugby',
 'Lacrosse',
 'Polo',
 'Cricket',
 'Ice Hockey',
 'Racquets',
 'Motorboating',
 'Croquet',
 'Figure Skating',
 'Jeu De Paume',
 'Roque',
 'Basque Pelota',
 'Alpinism',
 'Aeronautics']

Okay, the string to use to filter is ‘Gymnastics’: let’s create two new dataframes for men and women.

gymMenOverTime = MenOverTime.loc[MenOverTime['Sport'] == 'Gymnastics']
gymWomenOverTime = WomenOverTime.loc[WomenOverTime['Sport'] == 'Gymnastics']

Okay: let’s now create our plot for male and female athletes and then we can make our observations

plt.figure(figsize=(20, 10))
sns.barplot('Year', 'Weight', data=gymMenOverTime)
plt.title('Weight over year for Male Gymnasts')
plt.figure(figsize=(20, 10))
sns.barplot('Year', 'Height', data=gymMenOverTime)
plt.title('Height over year for Male Gymnasts')
plt.figure(figsize=(20, 10))
sns.barplot('Year', 'Weight', data=gymWomenOverTime)
plt.title('Weight over year for Female Gymnasts')
plt.figure(figsize=(20, 10))
sns.barplot('Year', 'Height', data=gymWomenOverTime)
plt.title('Height over year for Female Gymnasts')

A few things I noticed:

  • The weight for female Gymnasts has go down for 60 to 50 kilograms on average;
  • The weight for men has been more or less stable since 1964;
  • The height is more stable for both men and women.

Also, men weight data from 1924 seems missing: let’s check.

gymMenOverTime['Weight'].loc[gymMenOverTime['Year'] == 1924].isnull().all()

Weightlifting

Let’s work on an analysis similar to what we have done for Gymnastics also for the Lifters.

We can start creating a new, dedicated dataframe.

wlMenOverTime = MenOverTime.loc[MenOverTime['Sport'] == 'Weightlifting']
wlWomenOverTime = WomenOverTime.loc[WomenOverTime['Sport'] == 'Weightlifting']

Okay: let’s now create our plot for male and female athletes and then we can make our observations

plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Weight', data=wlMenOverTime, palette='Set2')
plt.title('Weight over year for Male Lifters')
plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Height', data=wlMenOverTime, palette='Set2')
plt.title('Height over year for Male Lifters')
plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Weight', data=wlWomenOverTime)
plt.title('Weight over year for Female Lifters')
plt.figure(figsize=(20, 10))
sns.pointplot('Year', 'Height', data=wlWomenOverTime)
plt.title('Height over year for Female Lifters')

Follow us on Instagram for all your Queries

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1534

Leave a Reply