Best Streaming Service Analysis with Python

There is a lot of competition between all the major streaming services like Netflix, Prime Video, Hulu, and Disney+. As a Data Scientist, it could be a very amazing task for you to find which is the best streaming service among them. In this article, I’m going to introduce you to a data science project on the best streaming service analysis with Python.

Best Streaming Service Analysis

For analyzing which is the best streaming service, I will be using the ratings of shows on all the major platforms like Netflix, Prime Video, Hulu, and Disney+.

Also, Read – 100+ Machine Learning Projects Solved and Explained.

The dataset that I will use for the task of Best Streaming service analysis contains a comprehensive list of all the TV shows which are available on the 4 platforms that we are comparing in this task.

I am using this dataset to find the best streaming service but as a beginner, you can also use this dataset for the tasks such as:

  1. Analyzing the streaming platforms
  2. Analyzing the IMBD and Rotten Tomatoes ratings of all the shows
  3. Analyzing the target age group of most of the TV shows.

Best Streaming Service Analysis with Python

Now let’s get started with the task of Best Streaming service analysis with Python. I will start this task by importing all the necessary libraries and the dataset:

best streaming service dataset

As we are only analyzing the data so we don’t need to use machine learning algorithms here. Most of the work can be done by visualizing and analyzing the ratings of shows on the streaming platforms.

Data Preparation

Let’s prepare the dataset so that we can easily analyze the data. I will start preparing the data by dropping the duplicate values based on the title of the shows:

tv_shows.drop_duplicates(subset='Title',
                         keep='first',inplace=True)

Now, in the code section below, I will fill the null values in the data with zeroes and then convert them into integer data types:

Visualizing the data will be easies if we get 1s and 0s in the columns named Netflix, Hulu, Disney and Prime Video under a categorical format. There may be a chance that the same show is available in more than one platform:

Now I will merge this data with the data we started with but I will drop some unwanted columns:

Now let’s plat the data where the rantings are more than 1 to see the quantity of the tv shows available on each platform:

quantity of content in streaming platforms

Final Step: Finding Best Streaming Service

Now let’s visualize the data to find the best streaming service based on their ratings. I will first use the violin charts to gauge the content ratings and the freshness of the streaming platform:

best streaming service analysis

Now let’s use a scatter plot to compare the ratings between IMBD and Rotten Tomatoes to compare which streaming platform has the best ratings in both the user rating platforms:

px.scatter(tv_shows_both_ratings, x='IMDb',
           y='Rotten Tomatoes',color='StreamingOn')
best streaming service analysis using ratings

Conclusion:

By using the violin chart we can observe that:

  1. Hulu, Netflix, and Amazon Videos all have important data. As content increases, quality decreases for all three.
  2. Prime Videos has become denser in the top half when looking at IMDB and performs well in cool.
  3. Disney+ being new, has also been very successful in this area.

Using the scatter plot we can observe that it is quite obvious that Amazon Prime performs very well in the fourth quadrant. Even by using the bar plot, we can observe that Amazon prime had a great quantity of content. So looking at all the streaming platforms we can conclude that Amazon Prime is better in both quality and quantity.

I hope you liked this article on Data Science project on Best Streaming Service analysis with Python programming language. Feel free to ask your valuable questions in the comments section below.

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1501

7 Comments

  1. I think the rating data is not independent w.r.t. each platform. For example, in some data records that a rating score share for several platforms, one can get same rating value for both platforms even if platform A performs much better than B does, therefore, there is no technique to get a good inference on which platform performs best via the given data. One should give out some columns like CLICK RATE for each platform w.r.t. each company, each video names, thus we can normalize the rates and multiply the given rating score to get a more reliable metric.

      • Thanks for the quick reply. Actually I used the file path and some error showed up I thought it’s because of the path or something. But now exact same error is showing when both .py and .csv are in the same directory and used your exact code.

        This is the error:

        Traceback (most recent call last):
        File “pandas\_libs\lib.pyx”, line 2305, in pandas._libs.lib.maybe_convert_numeric
        ValueError: Unable to parse string “100/100”

        During handling of the above exception, another exception occurred:

        Traceback (most recent call last):
        File “C:\Users\Chinmay\Downloads\chin.py”, line 18, in
        tv_shows[‘Rotten Tomatoes’] = pd.to_numeric(tv_shows[‘Rotten Tomatoes’])
        File “C:\Users\Chinmay\PycharmProjects\pythonProject\venv\lib\site-packages\pandas\core\tools\numeric.py”, line 183, in to_numeric
        values, _ = lib.maybe_convert_numeric(
        File “pandas\_libs\lib.pyx”, line 2347, in pandas._libs.lib.maybe_convert_numeric
        ValueError: Unable to parse string “100/100” at position 0

        Process finished with exit code 1

        Thank You very much.

      • Tried reading your file this way
        pd.read_csv(‘file.csv’, error_bad_lines=False).

        Still same error.

        This particular line of code seems to be the culprit.

        tv_shows[‘Rotten Tomatoes’] = pd.to_numeric(tv_shows[‘Rotten Tomatoes’])

        I added parameter errors=’coerce’ to convert bad non-numeric values to NaN but got different and more errors.

Leave a Reply