Movie Recommendation System with Machine Learning

Recommendation systems are among the most popular applications of data science. They are used to predict the Rating or Preference that a user would give to an item.

Almost every major company has applied them in some form or the other: Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on auto play, and Facebook uses it to recommend pages to like and people to follow.

Let’s Build our own recommendation system

In this Data Science project, you will see how to build a basic model of simple as well as content-based recommendation systems.

While these models will be nowhere close to the industry standard in terms of complexity, quality or accuracy, it will help you to get started with building more complex models that produce even better results.

Download the data sets you need to build this movie recommendation model from here:

import pandas as pd
import numpy as np
credits = pd.read_csv("tmdb_5000_credits.csv")
movies = pd.read_csv("tmdb_5000_movies.csv")
print("Movies Dataframe:",movies.shape)

[5 rows x 20 columns]
Credits: (4803, 4)
Movies Dataframe: (4803, 20)

credits_column_renamed = credits.rename(index=str, columns={"movie_id": "id"})
movies_merge = movies.merge(credits_column_renamed, on='id')
movies_cleaned = movies_merge.drop(columns=['homepage', 'title_x', 'title_y', 'status','production_countries'])

Content Based Recommendation System

Now lets make a recommendations based on the movie’s plot summaries given in the overview column. So if our user gives us a movie title, our goal is to recommend movies that share similar plot summaries.

from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer(min_df=3,  max_features=None,
            strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
            ngram_range=(1, 3),
            stop_words = 'english')
# Fitting the TF-IDF on the 'overview' text
tfv_matrix = tfv.fit_transform(movies_cleaned_df['overview'])

<4803×10417 sparse matrix of type ”
with 127220 stored elements in Compressed Sparse Row format>
(4803, 10417)

from sklearn.metrics.pairwise import sigmoid_kernel

# Compute the sigmoid kernel
sig = sigmoid_kernel(tfv_matrix, tfv_matrix)

array([0.76163447, 0.76159416, 0.76159416, …, 0.76159416, 0.76159416, 0.76159416])

Reverse mapping of indices and movie titles

# Reverse mapping of indices and movie titles
indices = pd.Series(movies_cleaned.index, index=movies_cleaned['original_title']).drop_duplicates()
print(sorted(list(enumerate(sig[indices['Newlyweds']])), key=lambda x: x[1], reverse=True))
def give_recomendations(title, sig=sig):
    # Get the index corresponding to original_title
    idx = indices[title]

    # Get the pairwsie similarity scores
    sig_scores = list(enumerate(sig[idx]))

    # Sort the movies
    sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)

    # Scores of the 10 most similar movies
    sig_scores = sig_scores[1:11]

    # Movie indices
    movie_indices = [i[0] for i in sig_scores]

    # Top 10 most similar movies
    return movies_cleaned['original_title'].iloc[movie_indices]

Testing our content-based recommendation system with the seminal film Spy Kids



1341 Obitaemyy Ostrov
634 The Matrix
3604 Apollo 18
2130 The American
775 Supernova
529 Tears of the Sun
151 Beowulf
311 The Adventures of Pluto Nash
847 Semi-Pro
942 The Book of Life
Name: original_title, dtype: object

Follow us on Instagram for all your Queries

Aman Kharwal
Aman Kharwal

I'm a writer and data scientist on a mission to educate others about the incredible power of data📈.

Articles: 1501


  1. Hey, I’m new to these kind of projects, would you mind telling me where and what kind of machine learning concept is involved in this project?

  2. There is an error when i try to execute this line of code, “tfv_matrix = tfv.fit_transform(movies_cleaned_df[‘overview’])”. it says NameError: name ‘movies_cleaned_df’ is not defined.
    If i change the name to movies_cleaned then it is giving me a ValueError: np.nan is an invalid document, expected byte or unicode string.

    • If you are following the same code and still getting error then just look at the variable names, like in the code it is movies_cleaned maybe somewhere you have typed movies_cleaned_df by mistake otherwise the code runs fine

Leave a Reply