Highest-Paid Athletes Analysis with Python

One of the biggest reasons athletes make so much money is that we love to watch their games. In this article, I will take you through a Data Science project on the Highest-Paid Athletes Analysis with Python.

Highest-Paid Athletes Analysis with Python

I will start this task of analyzing the highest-paid athletes by importing the necessary Python libraries and the dataset:

RankNamePaySalary/WinningsEndorsementsSportYear
0#1Lionel Messi$127 M$92 M$35 MSoccer2019
1#2Cristiano Ronaldo$109 M$65 M$44 MSoccer2019
2#3Neymar$105 M$75 M$30 MSoccer2019
3#4Canelo Alvarez$94 M$92 M$2 MBoxing2019
4#5Roger Federer$93.4 M$7.4 M$86 MTennis2019

So the dataset contains 7 columns and 795 rows, let me describe the data in short:

  1. Rank: annual ranking based on salary
  2. Name: athlete’s name
  3. Salary: salary and endorsement are chargeable
  4. Salary / Winnings: Athlete’s salary
  5. Endorsements: revenue from advertising, social media, sponsors, etc.
  6. Sport: Type of athlete’s sport
  7. Year: Year of payroll

Also, Read – 100+ Machine Learning Projects Solved and Explained.

The dataset we are using is from Forbes. Some columns are not consistent across the dataset because Forbes changed their mind about whether to put “#” before the rank value over time. Let’s fix this one and remove the “dollar signs” and “M”. Let’s also change “Football” to “Football” and “Football” to “American Football”:

Now let’s see the breakdown of athletes in the data set based on their sport type:

df.groupby("Name").first()["Sport"].value_counts().plot(kind="pie",autopct="%.0f%%",figsize=(8,8),wedgeprops=dict(width=0.4),pctdistance=0.8)
plt.ylabel(None)
plt.title("Breakdown of Athletes by Sport",fontweight="bold")
plt.show()
highest paid sports

Racing Bar Animation for Highest-Paid Athletes Analysis with Python

Let’s visualize the cumulative pays of the athletes in a running bar animation. First, we’ll convert the year column to a DateTime object:

df.Year = pd.to_datetime(df.Year,format="%Y")

Next, prepare a pivot table where the columns are the athletes and the index is the years:

racing_bar_data = df.pivot_table(values="Pay",index="Year",columns="Name")

The athletes mentioned below are the only ones who are consistently included in the Top100 list for each year since 2012. The rest of the athletes have NaN values. We will first interpolate the NaNs linearly and use the filling of the remaining NaNs with backfilling:

racing_bar_data.columns[racing_bar_data.isnull().sum() == 0]
Index(['Carmelo Anthony', 'Cristiano Ronaldo', 'Dwight Howard',
       'Justin Verlander', 'LeBron James', 'Lionel Messi', 'Phil Mickelson',
       'Rafael Nadal', 'Roger Federer', 'Tiger Woods'],
      dtype='object', name='Name')

Now convert the data to a cumulative payroll sum over several years:

racing_bar_filled = racing_bar_data.interpolate(method="linear").fillna(method="bfill")
racing_bar_filled = racing_bar_filled.cumsum()

Now, let’s oversample the dataset with interpolation (linear) for a smooth transition in the frames of the animation:

racing_bar_filled = racing_bar_filled.resample("1D").interpolate(method="linear")[::7]

Creating and Saving a Bar Chart Animation with Python

Now let’s import the Python packages needed to create and save animations, and run paths and their elements (lines, bars, texts, etc.). The code below will generate an animation for the 10 highest-paid athletes between 2012 and 2019:

highest paid athletes analysis

I hope you liked this article on a data science project on Highest-paid Athletes analysis with Python. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

Data Strategist at Statso. My aim is to decode data science for the real world in the most simple words.

Articles: 1609

Leave a Reply

Discover more from thecleverprogrammer

Subscribe now to keep reading and get access to the full archive.

Continue reading