Netflix is a subscription-based streaming platform that allows users to watch movies and TV shows without advertisements. One of the reasons behind the popularity of Netflix is its recommendation system. Its recommendation system recommends movies and TV shows based on the user’s interest. If you are a Data Science student and want to learn how to create a Netflix recommendation system, this article is for you. This article will take you through how to build a Netflix recommendation system using Python.
Here’s How Netflix Recommendation System Works
The recommendation system of Netflix shows you movies and TV shows according to your interests. Netflix has a lot of data because of its user base. Its recommendation system predicts a personalised catalogue for you based on factors like:
- your viewing history
- the viewing history of other users with similar tastes and preferences as yours
- genres, category, description, and more information about the content that you watched in the past
The genre of the content is one of the most valuable factors that helps Netflix recommend more content even to new users. I hope you have understood how Netflix recommends content to its users. You can learn more about it here. In the section below, I will take you through how to build a Netflix recommendation system using Python.
Netflix Recommendation System using Python
The dataset I am using to build a Netflix recommendation system using Python is downloaded from Kaggle. The dataset contains information about all the movies and TV shows on Netflix as of 2021. You can download the dataset from here.
Now let’s import the necessary Python libraries and the dataset we need for this task:
import numpy as np import pandas as pd from sklearn.feature_extraction import text from sklearn.metrics.pairwise import cosine_similarity data = pd.read_csv("netflixData.csv") print(data.head())
Show Id Title \ 0 cc1b6ed9-cf9e-4057-8303-34577fb54477 (Un)Well 1 e2ef4e91-fb25-42ab-b485-be8e3b23dedb #Alive 2 b01b73b7-81f6-47a7-86d8-acb63080d525 #AnneFrank - Parallel Stories 3 b6611af0-f53c-4a08-9ffa-9716dc57eb9c #blackAF 4 7f2d4170-bab8-4d75-adc2-197f7124c070 #cats_the_mewvie Description \ 0 This docuseries takes a deep dive into the luc... 1 As a grisly virus rampages a city, a lone man ... 2 Through her diary, Anne Frank's story is retol... 3 Kenya Barris and his family navigate relations... 4 This pawesome documentary explores how our fel... Director \ 0 NaN 1 Cho Il 2 Sabina Fedeli, Anna Migotto 3 NaN 4 Michael Margolis Genres \ 0 Reality TV 1 Horror Movies, International Movies, Thrillers 2 Documentaries, International Movies 3 TV Comedies 4 Documentaries, International Movies Cast Production Country \ 0 NaN United States 1 Yoo Ah-in, Park Shin-hye South Korea 2 Helen Mirren, Gengher Gatti Italy 3 Kenya Barris, Rashida Jones, Iman Benson, Genn... United States 4 NaN Canada Release Date Rating Duration Imdb Score Content Type Date Added 0 2020.0 TV-MA 1 Season 6.6/10 TV Show NaN 1 2020.0 TV-MA 99 min 6.2/10 Movie September 8, 2020 2 2019.0 TV-14 95 min 6.4/10 Movie July 1, 2020 3 2020.0 TV-MA 1 Season 6.6/10 TV Show NaN 4 2020.0 TV-14 90 min 5.1/10 Movie February 5, 2020
In the first impressions on the dataset, I can see that the Title column needs preparation as it contains # before the name of the movies or tv shows. I will get back to it. For now, let’s have a look at whether the data contains null values or not:
print(data.isnull().sum())
Show Id 0 Title 0 Description 0 Director 2064 Genres 0 Cast 530 Production Country 559 Release Date 3 Rating 4 Duration 3 Imdb Score 608 Content Type 0 Date Added 1335 dtype: int64
The dataset contains null values, but before removing the null values, let’s select the columns that we can use to build a Netflix recommendation system:
data = data[["Title", "Description", "Content Type", "Genres"]] print(data.head())
Title \ 0 (Un)Well 1 #Alive 2 #AnneFrank - Parallel Stories 3 #blackAF 4 #cats_the_mewvie Description Content Type \ 0 This docuseries takes a deep dive into the luc... TV Show 1 As a grisly virus rampages a city, a lone man ... Movie 2 Through her diary, Anne Frank's story is retol... Movie 3 Kenya Barris and his family navigate relations... TV Show 4 This pawesome documentary explores how our fel... Movie Genres 0 Reality TV 1 Horror Movies, International Movies, Thrillers 2 Documentaries, International Movies 3 TV Comedies 4 Documentaries, International Movies
As the name suggests:
- The title column contains the titles of movies and TV shows on Netflix
- Description column describes the plot of the TV shows and movies
- The Content Type column tells us if it’s a movie or a TV show
- The Genre column contains all the genres of the TV show or the movie
Now let’s drop the rows containing null values and move further:
data = data.dropna()
Now I will clean the Title column as it contains some data preparation:
import nltk import re nltk.download('stopwords') stemmer = nltk.SnowballStemmer("english") from nltk.corpus import stopwords import string stopword=set(stopwords.words('english')) def clean(text): text = str(text).lower() text = re.sub('\[.*?\]', '', text) text = re.sub('https?://\S+|www\.\S+', '', text) text = re.sub('<.*?>+', '', text) text = re.sub('[%s]' % re.escape(string.punctuation), '', text) text = re.sub('\n', '', text) text = re.sub('\w*\d\w*', '', text) text = [word for word in text.split(' ') if word not in stopword] text=" ".join(text) text = [stemmer.stem(word) for word in text.split(' ')] text=" ".join(text) return text data["Title"] = data["Title"].apply(clean)
Now let’s have a look at some samples of the Titles before moving forward:
print(data.Title.sample(10))
3111 miniforc super dino power 1822 girl reveng 910 casino tycoon 4075 sand castl 2760 lock 3406 nightflyer 536 bangkok love stori object affect 4365 special 1733 full 2343 jeff dunham map Name: Title, dtype: object
Now I will use the Genres column as the feature to recommend similar content to the user. I will use the concept of cosine similarity here (used to find similarities in two documents):
feature = data["Genres"].tolist() tfidf = text.TfidfVectorizer(input=feature, stop_words="english") tfidf_matrix = tfidf.fit_transform(feature) similarity = cosine_similarity(tfidf_matrix)
Now I will set the Title column as an index so that we can find similar content by giving the title of the movie or TV show as an input:
indices = pd.Series(data.index, index=data['Title']).drop_duplicates()
Now here’s how to write a function to recommend Movies and TV shows on Netflix:
def netFlix_recommendation(title, similarity = similarity): index = indices[title] similarity_scores = list(enumerate(similarity[index])) similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True) similarity_scores = similarity_scores[0:10] movieindices = [i[0] for i in similarity_scores] return data['Title'].iloc[movieindices] print(netFlix_recommendation("girlfriend"))
3 blackaf 285 washington 417 arrest develop 434 astronomi club sketch show 451 aunti donna big ol hous fun 656 big mouth 752 bojack horseman 805 brew brother 935 champion 937 chappell show Name: Title, dtype: object
So this is how you can build a Netflix Recommendation System using the Python programming language.
Summary
The recommendation system of Netflix predicts a personalised catalogue for you based on factors like your viewing history, the viewing history of other users with similar tastes and preferences, and the genres, category, descriptions, and more information of the content you watched. I hope you liked this article on building a Netflix Recommendation System using Python. Feel free to ask valuable questions in the comments section below.
Excellent one but I noticed the code did not work for some other movie titles
it will work with the movies mentioned in the data, only choose the movie title mentioned in the data or update the data with more titles and their description