
Recommendation systems are among the most popular applications of data science. They are used to predict the Rating or Preference that a user would give to an item.
Almost every major company has applied them in some form or the other: Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on auto play, and Facebook uses it to recommend pages to like and people to follow.
Let’s Build our own recommendation system
In this Data Science project, you will see how to build a basic model of simple as well as content-based recommendation systems.
While these models will be nowhere close to the industry standard in terms of complexity, quality or accuracy, it will help you to get started with building more complex models that produce even better results.
Download the data sets you need to build this movie recommendation model from here:
import pandas as pd import numpy as np credits = pd.read_csv("tmdb_5000_credits.csv") movies = pd.read_csv("tmdb_5000_movies.csv") credits.head()

movies.head()

print("Credits:",credits.shape) print("Movies Dataframe:",movies.shape)
#Output-
[5 rows x 20 columns]
Credits: (4803, 4)
Movies Dataframe: (4803, 20)
credits_column_renamed = credits.rename(index=str, columns={"movie_id": "id"}) movies_merge = movies.merge(credits_column_renamed, on='id') print(movies_merge.head())
movies_cleaned = movies_merge.drop(columns=['homepage', 'title_x', 'title_y', 'status','production_countries']) print(movies_cleaned.head()) print(movies_cleaned.info()) print(movies_cleaned.head(1)['overview'])
Content Based Recommendation System
Now lets make a recommendations based on the movie’s plot summaries given in the overview column. So if our user gives us a movie title, our goal is to recommend movies that share similar plot summaries.
from sklearn.feature_extraction.text import TfidfVectorizer tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}', ngram_range=(1, 3), stop_words = 'english')
# Fitting the TF-IDF on the 'overview' text tfv_matrix = tfv.fit_transform(movies_cleaned_df['overview']) print(tfv_matrix) print(tfv_matrix.shape)
#Output
<4803×10417 sparse matrix of type ”
with 127220 stored elements in Compressed Sparse Row format>
(4803, 10417)
from sklearn.metrics.pairwise import sigmoid_kernel # Compute the sigmoid kernel sig = sigmoid_kernel(tfv_matrix, tfv_matrix) print(sig[0])
#Output-
array([0.76163447, 0.76159416, 0.76159416, …, 0.76159416, 0.76159416, 0.76159416])
Reverse mapping of indices and movie titles
# Reverse mapping of indices and movie titles indices = pd.Series(movies_cleaned.index, index=movies_cleaned['original_title']).drop_duplicates() print(indices) print(indices['Newlyweds']) print(sig[4799]) print(list(enumerate(sig[indices['Newlyweds']]))) print(sorted(list(enumerate(sig[indices['Newlyweds']])), key=lambda x: x[1], reverse=True))
def give_recomendations(title, sig=sig): # Get the index corresponding to original_title idx = indices[title] # Get the pairwsie similarity scores sig_scores = list(enumerate(sig[idx])) # Sort the movies sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True) # Scores of the 10 most similar movies sig_scores = sig_scores[1:11] # Movie indices movie_indices = [i[0] for i in sig_scores] # Top 10 most similar movies return movies_cleaned['original_title'].iloc[movie_indices]
Testing our content-based recommendation system with the seminal film Spy Kids
print(give_recomendations('Avatar'))
#Output-
1341 Obitaemyy Ostrov
634 The Matrix
3604 Apollo 18
2130 The American
775 Supernova
529 Tears of the Sun
151 Beowulf
311 The Adventures of Pluto Nash
847 Semi-Pro
942 The Book of Life
Name: original_title, dtype: object
Hey, I’m new to these kind of projects, would you mind telling me where and what kind of machine learning concept is involved in this project?
It is based on scikit-learn’s tfidvectorizer.
TfidfVectorizer – Transforms text to feature vectors that can be used as input to estimator.
Thanks a lot. Is this considered as supervised or unsupervised learning?
It is supervised
ohh got it. There is a error when i try to execute this line of code, “tfv_matrix = tfv.fit_transform(movies_cleaned_df[‘overview’])”. it says NameError: name ‘movies_cleaned_df’ is not defined.
If i change the name to movies_cleaned then it is giving me a ValueError: np.nan is an invalid document, expected byte or unicode string.
I have sent you the complete project on Movies Recommendation System so that you can rectify your mistakes.
Keep visiting us.
Okay, I will email you this project so that you can rectify
Received. Thanks a lot.
Keep visiting ⭐
There is an error when i try to execute this line of code, “tfv_matrix = tfv.fit_transform(movies_cleaned_df[‘overview’])”. it says NameError: name ‘movies_cleaned_df’ is not defined.
If i change the name to movies_cleaned then it is giving me a ValueError: np.nan is an invalid document, expected byte or unicode string.
If you are following the same code and still getting error then just look at the variable names, like in the code it is movies_cleaned maybe somewhere you have typed movies_cleaned_df by mistake otherwise the code runs fine