One of the biggest reasons athletes make so much money is that we love to watch their games. In this article, I will take you through a Data Science project on the Highest-Paid Athletes Analysis with Python.
Highest-Paid Athletes Analysis with Python
I will start this task of analyzing the highest-paid athletes by importing the necessary Python libraries and the dataset:
Rank | Name | Pay | Salary/Winnings | Endorsements | Sport | Year | |
---|---|---|---|---|---|---|---|
0 | #1 | Lionel Messi | $127 M | $92 M | $35 M | Soccer | 2019 |
1 | #2 | Cristiano Ronaldo | $109 M | $65 M | $44 M | Soccer | 2019 |
2 | #3 | Neymar | $105 M | $75 M | $30 M | Soccer | 2019 |
3 | #4 | Canelo Alvarez | $94 M | $92 M | $2 M | Boxing | 2019 |
4 | #5 | Roger Federer | $93.4 M | $7.4 M | $86 M | Tennis | 2019 |
So the dataset contains 7 columns and 795 rows, let me describe the data in short:
- Rank: annual ranking based on salary
- Name: athleteās name
- Salary: salary and endorsement are chargeable
- Salary / Winnings: Athleteās salary
- Endorsements: revenue from advertising, social media, sponsors, etc.
- Sport: Type of athleteās sport
- Year: Year of payroll
Also, Read ā 100+ Machine Learning Projects Solved and Explained.
The dataset we are using is from Forbes. Some columns are not consistent across the dataset because Forbes changed their mind about whether to put ā#ā before the rank value over time. Letās fix this one and remove the ādollar signsā and āMā. Letās also change āFootballā to āFootballā and āFootballā to āAmerican Footballā:
Now letās see the breakdown of athletes in the data set based on their sport type:
df.groupby("Name").first()["Sport"].value_counts().plot(kind="pie",autopct="%.0f%%",figsize=(8,8),wedgeprops=dict(width=0.4),pctdistance=0.8) plt.ylabel(None) plt.title("Breakdown of Athletes by Sport",fontweight="bold") plt.show()

Racing Bar Animation for Highest-Paid Athletes Analysis with Python
Letās visualize the cumulative pays of the athletes in a running bar animation. First, weāll convert the year column to a DateTime object:
df.Year = pd.to_datetime(df.Year,format="%Y")
Next, prepare a pivot table where the columns are the athletes and the index is the years:
racing_bar_data = df.pivot_table(values="Pay",index="Year",columns="Name")
The athletes mentioned below are the only ones who are consistently included in the Top100 list for each year since 2012. The rest of the athletes have NaN values. We will first interpolate the NaNs linearly and use the filling of the remaining NaNs with backfilling:
racing_bar_data.columns[racing_bar_data.isnull().sum() == 0]
Index(['Carmelo Anthony', 'Cristiano Ronaldo', 'Dwight Howard', 'Justin Verlander', 'LeBron James', 'Lionel Messi', 'Phil Mickelson', 'Rafael Nadal', 'Roger Federer', 'Tiger Woods'], dtype='object', name='Name')
Now convert the data to a cumulative payroll sum over several years:
racing_bar_filled = racing_bar_data.interpolate(method="linear").fillna(method="bfill") racing_bar_filled = racing_bar_filled.cumsum()
Now, letās oversample the dataset with interpolation (linear) for a smooth transition in the frames of the animation:
racing_bar_filled = racing_bar_filled.resample("1D").interpolate(method="linear")[::7]
Creating and Saving a Bar Chart Animation with Python
Now letās import the Python packages needed to create and save animations, and run paths and their elements (lines, bars, texts, etc.). The code below will generate an animation for the 10 highest-paid athletes between 2012 and 2019:

I hope you liked this article on a data science project on Highest-paid Athletes analysis with Python. Feel free to ask your valuable questions in the comments section below.