Content-Based Filtering in Machine Learning

Most recommendation systems use content-based filtering and collaborative filtering to show recommendations to the user to provide a better user experience. Content-based filtering generates recommendations based on a user’s behaviour. In this article, I will walk you through what content-based filtering is in machine learning and how to implement it using Python.

What is a Recommendation System?

A recommendation system is used to generate personalized recommendations by understanding a user’s preferences using data such as user history, time of viewing or reading etc. There are many applications based on recommendation systems. Most of the categories of these apps are:

Online Shopping (Amazon, Zomato, etc.)
Audio (Songs, Audiobooks, Podcast, etc.)
Video Recommendations (YouTube, Netflix, Amazon Prime, etc.)

So there are two types of recommendation systems:

Collaborative Filtering
Content-Based Filtering

Collaborative filtering uses the behaviour of other users who have similar interests like you and based on the activities of those users, it shows you perfect recommendations. A recommendation system based on the content-based method will show you recommendations based on your behaviour. In the section below, I’ll walk you through how content-based filtering in machine learning works in detail, and then we’ll see how to implement it using Python.

Also, Read – 200+ Machine Learning Projects Solved and Explained.

Content-Based Filtering

A recommendation system based on content-based filtering provides recommendations to the user by analyzing the description of the content that has been rated by the user. In this method, the algorithm is trained to understand the context of the content and find similarities in other content to recommend the same class of content to a particular user.

Let’s understand the process of content-based filtering by looking at all the steps that are involved in this method for generating recommendations for the user:

It begins by identifying the keywords to understand the context of the content. In this step, it avoids unnecessary words such as stop words.
Then it finds the same kind of context in other content to find the similarities. To determine the similarities between two or more contents, the content-based method uses cosine similarities.
It finds similarities by analyzing the correlation between two or more users.
Then finally it generates recommendations by calculating the weighted average of all user ratings for active users.

Hope you now understand how content-based filtering works. Now in the section below, I will walk you through how to implement it using the Python programming language.

Content-Based Filtering with Python

I hope till now you have understood what are recommendation systems and how content-based method is used to generate recommendations for a user. Now let’s see how to implement content-based method with Python. For this task, I will be using the dataset provided by MovieLens to create a movie recommendation system using content-based filtering with Python.

Let’s start his task by importing the necessary Python libraries and the dataset:

Dataset

   adult                              belongs_to_collection    budget  ...  video vote_average vote_count
0  False  {'id': 10194, 'name': 'Toy Story Collection', ...  30000000  ...  False          7.7     5415.0
1  False                                                NaN  65000000  ...  False          6.9     2413.0
2  False  {'id': 119050, 'name': 'Grumpy Old Men Collect...         0  ...  False          6.5       92.0
3  False                                                NaN  16000000  ...  False          6.1       34.0
4  False  {'id': 96871, 'name': 'Father of the Bride Col...         0  ...  False          5.7      173.0

Now, I’m going to implement all of the steps I talked about in the content-based filtering process mentioned above using Python. Here I will prepare the data first, then select the columns that we will use to understand the context of the content, then we will remove the stop words and finally, we will find the cosine similarities to generate recommendations:

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...
Subdue                         45461
Century of Birthing            45462
Betrayal                       45463
Satan Triumphant               45464
Queerama                       45465
Length: 45466, dtype: int64

Now let’s create a function and have a look at how the recommendation system is working:

23530                      Andy Hardy Meets Debutante
21422                                 A Family Affair
26304                          You're Only Young Once
10301                          The 40 Year Old Virgin
29369                  Andy Hardy's Private Secretary
23843                     Andy Hardy's Blonde Trouble
15348                                     Toy Story 3
43427                Andy Kaufman Plays Carnegie Hall
38476    Superstar: The Life and Times of Andy Warhol
42721    Andy Peters: Exclamation Mark Question Point
8327                                        The Champ
28128                       The Mayor of Casterbridge
21359                        Andy Hardy's Double Life
32086                                Brother's Keeper
Name: title, dtype: object

So, I hope you liked this article on what is the content-based method in machine learning and its implementation using Python. Feel free to ask your valuable questions in the comments section below.