
Recommendation systems are among the most popular applications of data science. They are used to predict the Rating or Preference that a user would give to an item.
Almost every major company has applied them in some form or the other: Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on auto play, and Facebook uses it to recommend pages to like and people to follow.
Let’s Build our own recommendation system
In this Data Science project, you will see how to build a Book Recommendation System model using Machine Learning Techniques.
You can download the data sets we need for this task from here:
Let’s start with this project
import pandas as pd import numpy as np import matplotlib.pyplot as plt books = pd.read_csv('BX-Books.csv', sep=';', error_bad_lines=False, encoding="latin-1") books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL'] users = pd.read_csv('BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1") users.columns = ['userID', 'Location', 'Age'] ratings = pd.read_csv('BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1") ratings.columns = ['userID', 'ISBN', 'bookRating'] print(ratings.shape) print(list(ratings.columns))
#Output
(1149780, 3)
[‘userID’, ‘ISBN’, ‘bookRating’]
plt.rc("font", size=15) ratings.bookRating.value_counts(sort=False).plot(kind='bar') plt.title('Rating Distribution\n') plt.xlabel('Rating') plt.ylabel('Count') plt.show()

print(books.shape) print(list(books.columns))
#Output
(271360, 8)
[‘ISBN’, ‘bookTitle’, ‘bookAuthor’, ‘yearOfPublication’, ‘publisher’, ‘imageUrlS’, ‘imageUrlM’, ‘imageUrlL’]
print(users.shape) print(list(users.columns))
#Output
(278858, 3)
[‘userID’, ‘Location’, ‘Age’]
users.Age.hist(bins=[0, 10, 20, 30, 40, 50, 100]) plt.title('Age Distribution\n') plt.xlabel('Age') plt.ylabel('Count') plt.show()

To ensure statistical significance, users with less than 200 ratings, and books with less than 100 ratings are excluded.
counts1 = ratings['userID'].value_counts() ratings = ratings[ratings['userID'].isin(counts1[counts1 >= 200].index)] counts = ratings['bookRating'].value_counts() ratings = ratings[ratings['bookRating'].isin(counts[counts >= 100].index)]
Collaborative Filtering Using k-Nearest Neighbors (kNN)
kNN is a machine learning algorithm to find clusters of similar users based on common book ratings, and make predictions using the average rating of top-k nearest neighbors.
For example, we first present ratings in a matrix with the matrix having one row for each item (book) and one column for each user.
combine_book_rating = pd.merge(ratings, books, on='ISBN') columns = ['yearOfPublication', 'publisher', 'bookAuthor', 'imageUrlS', 'imageUrlM', 'imageUrlL'] combine_book_rating = combine_book_rating.drop(columns, axis=1) print(combine_book_rating.head())
#Output userID ... bookTitle 0 277427 ... Politically Correct Bedtime Stories: Modern Ta... 1 3363 ... Politically Correct Bedtime Stories: Modern Ta... 2 11676 ... Politically Correct Bedtime Stories: Modern Ta... 3 12538 ... Politically Correct Bedtime Stories: Modern Ta... 4 13552 ... Politically Correct Bedtime Stories: Modern Ta... [5 rows x 4 columns]
Now we will group by book titles and create a new column for total rating count.
combine_book_rating = combine_book_rating.dropna(axis = 0, subset = ['bookTitle']) book_ratingCount = (combine_book_rating. groupby(by = ['bookTitle'])['bookRating']. count(). reset_index(). rename(columns = {'bookRating': 'totalRatingCount'}) [['bookTitle', 'totalRatingCount']] ) print(book_ratingCount.head())
#Output bookTitle totalRatingCount 0 A Light in the Storm: The Civil War Diary of ... 2 1 Always Have Popsicles 1 2 Apple Magic (The Collector's series) 1 3 Beyond IBM: Leadership Marketing and Finance ... 1 4 Clifford Visita El Hospital (Clifford El Gran... 1
Now we will combine the rating data with the total rating count data, this gives us exactly what we need to find out which books are popular and filter out lesser-known books.
rating_with_totalRatingCount = combine_book_rating.merge(book_ratingCount, left_on = 'bookTitle', right_on = 'bookTitle', how = 'left') print(rating_with_totalRatingCount.head()) pd.set_option('display.float_format', lambda x: '%.3f' % x) print(book_ratingCount['totalRatingCount'].describe())
#Output userID ... totalRatingCount 0 277427 ... 82 1 3363 ... 82 2 11676 ... 82 3 12538 ... 82 4 13552 ... 82 [5 rows x 5 columns]
pd.set_option('display.float_format', lambda x: '%.3f' % x) print(book_ratingCount['totalRatingCount'].describe())
#Output count 160576.000 mean 3.044 std 7.428 min 1.000 25% 1.000 50% 1.000 75% 2.000 max 365.000 Name: totalRatingCount, dtype: float64
print(book_ratingCount['totalRatingCount'].quantile(np.arange(.9, 1, .01)))
#Output 0.900 5.000 0.910 6.000 0.920 7.000 0.930 7.000 0.940 8.000 0.950 10.000 0.960 11.000 0.970 14.000 0.980 19.000 0.990 31.000 Name: totalRatingCount, dtype: float64
popularity_threshold = 50 rating_popular_book = rating_with_totalRatingCount.query('totalRatingCount >= @popularity_threshold') print(rating_popular_book.head())
#Output userID ... totalRatingCount 0 277427 ... 82 1 3363 ... 82 2 11676 ... 82 3 12538 ... 82 4 13552 ... 82 [5 rows x 5 columns]
Filter to users in US and Canada only
combined = rating_popular_book.merge(users, left_on = 'userID', right_on = 'userID', how = 'left') us_canada_user_rating = combined[combined['Location'].str.contains("usa|canada")] us_canada_user_rating=us_canada_user_rating.drop('Age', axis=1) print(us_canada_user_rating.head())
#Output userID ISBN ... totalRatingCount Location 0 277427 002542730X ... 82 gilbert, arizona, usa 1 3363 002542730X ... 82 knoxville, tennessee, usa 3 12538 002542730X ... 82 byron, minnesota, usa 4 13552 002542730X ... 82 cordova, tennessee, usa 5 16795 002542730X ... 82 mechanicsville, maryland, usa [5 rows x 6 columns]
Implementing kNN
We convert our table to a 2D matrix, and fill the missing values with zeros (since we will calculate distances between rating vectors).
We then transform the values(ratings) of the matrix dataframe into a scipy sparse matrix for more efficient calculations.
from scipy.sparse import csr_matrix us_canada_user_rating = us_canada_user_rating.drop_duplicates(['userID', 'bookTitle']) us_canada_user_rating_pivot = us_canada_user_rating.pivot(index = 'bookTitle', columns = 'userID', values = 'bookRating').fillna(0) us_canada_user_rating_matrix = csr_matrix(us_canada_user_rating_pivot.values) from sklearn.neighbors import NearestNeighbors model_knn = NearestNeighbors(metric = 'cosine', algorithm = 'brute') model_knn.fit(us_canada_user_rating_matrix) print(model_knn)
#Output NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine', metric_params=None, n_jobs=None, n_neighbors=5, p=2, radius=1.0)
query_index = np.random.choice(us_canada_user_rating_pivot.shape[0]) print(query_index) print(us_canada_user_rating_pivot.iloc[query_index,:].values.reshape(1,-1)) distances, indices = model_knn.kneighbors(us_canada_user_rating_pivot.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 6) us_canada_user_rating_pivot.index[query_index]
#Output [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 7. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 10. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 10. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 6. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 6. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 7. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 9. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 9. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 10. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 7. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 10. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 5. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 8. 0. 0. 0. 0. 0. 0. 3. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 8. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 9. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
for i in range(0, len(distances.flatten())): if i == 0: print('Recommendations for {0}:\n'.format(us_canada_user_rating_pivot.index[query_index])) else: print('{0}: {1}, with distance of {2}:'.format(i, us_canada_user_rating_pivot.index[indices.flatten()[i]], distances.flatten()[i]))
#Output Recommendations for Flesh and Blood: 1: The Murder Book, with distance of 0.596243943613731: 2: Choke, with distance of 0.6321092998573327: 3: Easy Prey, with distance of 0.704010041374638: 4: 2nd Chance, with distance of 0.7292664430521165: 5: The Empty Chair, with distance of 0.7432121818110763:
Hi Arman,
I just began with your article on algorithmic Trading. And then after clicking on previous and next articles, I felt I found a treasure.
Just great.
Vinod Merchant
Thanks Vinod, Keep visiting us⭐⭐