
Product reviews are becoming more important with the evolution of traditional brick and mortar retail stores to online shopping.
Consumers are posting reviews directly on product pages in real time. With the vast amount of consumer reviews, this creates an opportunity to see how the market reacts to a specific product.
We will be attempting to see if we can predict the sentiment of a product review using python and machine learning.
Let’s Import the necessary Modules and take a look at the data:
You can download this dataset from here.
import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns import math import warnings warnings.filterwarnings('ignore') # Hides warning warnings.filterwarnings("ignore", category=DeprecationWarning) warnings.filterwarnings("ignore",category=UserWarning) sns.set_style("whitegrid") # Plotting style np.random.seed(7) # seeding random number generator df = pd.read_csv('amazon.csv') print(df.head())

Describing the Dataset
data = df.copy() data.describe()

data.info()

We need to clean up the name column by referencing asins (unique products) since we have 7000 missing values:
data["asins"].unique()

asins_unique = len(data["asins"].unique()) print("Number of Unique ASINs: " + str(asins_unique))
#Output– Number of Unique ASINs: 42
Visualizing the distributions of numerical variables:
data.hist(bins=50, figsize=(20,15)) plt.show()

Outliers in this case are valuable, so we may want to weight reviews that had more than 50+ people who find them helpful.
Majority of examples were rated highly (looking at rating distribution). There is twice amount of 5 star ratings than the others ratings combined.
Split the data into Train and Test
Before we explore the dataset we will split it into training set and test sets. Eventually our goal is to train a sentiment analysis classifier.
Since the majority of reviews are positive (5 stars), we will need to do a stratified split on the reviews score to ensure that we don’t train the classifier on imbalanced data.
from sklearn.model_selection import StratifiedShuffleSplit print("Before {}".format(len(data))) dataAfter = data.dropna(subset=["reviews.rating"]) # Removes all NAN in reviews.rating print("After {}".format(len(dataAfter))) dataAfter["reviews.rating"] = dataAfter["reviews.rating"].astype(int) split = StratifiedShuffleSplit(n_splits=5, test_size=0.2) for train_index, test_index in split.split(dataAfter, dataAfter["reviews.rating"]): strat_train = dataAfter.reindex(train_index) strat_test = dataAfter.reindex(test_index)
#Output-
Before 34660
After 34627
We need to see if train and test sets were stratified proportionately in comparison to raw data:
print(len(strat_train)) print(len(strat_test)) print(strat_test["reviews.rating"].value_counts()/len(strat_test))

Data Exploration (Training Set)
We will use regular expressions to clean out any unfavorable characters in the dataset, and then preview what the data looks like after cleaning.
reviews = strat_train.copy() reviews.head()

print(len(reviews["name"].unique()), len(reviews["asins"].unique())) print(reviews.info()) print(reviews.groupby("asins")["name"].unique())

Lets see all the different names for this product that have 2 ASINs:
different_names = reviews[reviews["asins"] == "B00L9EPT8O,B01E6AO69U"]["name"].unique() for name in different_names: print(name) print(reviews[reviews["asins"] == "B00L9EPT8O,B01E6AO69U"]["name"].value_counts())
#Output Echo (White),,, Echo (White),,, Amazon Fire Tv,,, Amazon Fire Tv,,, nan Amazon - Amazon Tap Portable Bluetooth and Wi-Fi Speaker - Black,,, Amazon - Amazon Tap Portable Bluetooth and Wi-Fi Speaker - Black,,, Amazon Fire Hd 10 Tablet, Wi-Fi, 16 Gb, Special Offers - Silver Aluminum,,, Amazon Fire Hd 10 Tablet, Wi-Fi, 16 Gb, Special Offers - Silver Aluminum,,, Amazon 9W PowerFast Official OEM USB Charger and Power Adapter for Fire Tablets and Kindle eReaders,,, Amazon 9W PowerFast Official OEM USB Charger and Power Adapter for Fire Tablets and Kindle eReaders,,, Amazon Kindle Fire 5ft USB to Micro-USB Cable (works with most Micro-USB Tablets),,, Amazon Kindle Fire 5ft USB to Micro-USB Cable (works with most Micro-USB Tablets),,, Kindle Dx Leather Cover, Black (fits 9.7 Display, Latest and 2nd Generation Kindle Dxs),, Amazon Fire Hd 6 Standing Protective Case(4th Generation - 2014 Release), Cayenne Red,,, Amazon Fire Hd 6 Standing Protective Case(4th Generation - 2014 Release), Cayenne Red,,, Amazon Fire Hd 6 Standing Protective Case(4th Generation - 2014 Release), Cayenne Red,,, Amazon 5W USB Official OEM Charger and Power Adapter for Fire Tablets and Kindle eReaders,,, New Amazon Kindle Fire Hd 9w Powerfast Adapter Charger + Micro Usb Angle Cable,,, New Amazon Kindle Fire Hd 9w Powerfast Adapter Charger + Micro Usb Angle Cable,,, Amazon 5W USB Official OEM Charger and Power Adapter for Fire Tablets and Kindle eReaders,,, Amazon 5W USB Official OEM Charger and Power Adapter for Fire Tablets and Kindle eReaders,,, Echo (White),,, Fire Tablet, 7 Display, Wi-Fi, 8 GB - Includes Special Offers, Tangerine" Echo (Black),,, Amazon 9W PowerFast Official OEM USB Charger and Power Adapter for Fire Tablets and Kindle eReaders,,, Echo (Black),,, Echo (Black),,, Amazon Fire Tv,,, Kindle Dx Leather Cover, Black (fits 9.7 Display, Latest and 2nd Generation Kindle Dxs)",, New Amazon Kindle Fire Hd 9w Powerfast Adapter Charger + Micro Usb Angle Cable,,, Echo (White),,,\r\nEcho (White),,, 2318 Amazon Fire Tv,,,\r\nAmazon Fire Tv,,, 2029 Amazon - Amazon Tap Portable Bluetooth and Wi-Fi Speaker - Black,,,\r\nAmazon - Amazon Tap Portable Bluetooth and Wi-Fi Speaker - Black,,, 259 Amazon Fire Hd 10 Tablet, Wi-Fi, 16 Gb, Special Offers - Silver Aluminum,,,\r\nAmazon Fire Hd 10 Tablet, Wi-Fi, 16 Gb, Special Offers - Silver Aluminum,,, 106 Amazon 9W PowerFast Official OEM USB Charger and Power Adapter for Fire Tablets and Kindle eReaders,,,\r\nAmazon 9W PowerFast Official OEM USB Charger and Power Adapter for Fire Tablets and Kindle eReaders,,, 28 Kindle Dx Leather Cover, Black (fits 9.7 Display, Latest and 2nd Generation Kindle Dxs),, 7 Amazon 5W USB Official OEM Charger and Power Adapter for Fire Tablets and Kindle eReaders,,,\r\nAmazon 5W USB Official OEM Charger and Power Adapter for Fire Tablets and Kindle eReaders,,, 5 Amazon Fire Hd 6 Standing Protective Case(4th Generation - 2014 Release), Cayenne Red,,,\r\nAmazon Fire Hd 6 Standing Protective Case(4th Generation - 2014 Release), Cayenne Red,,, 5 New Amazon Kindle Fire Hd 9w Powerfast Adapter Charger + Micro Usb Angle Cable,,,\r\nNew Amazon Kindle Fire Hd 9w Powerfast Adapter Charger + Micro Usb Angle Cable,,, 5 Amazon Kindle Fire 5ft USB to Micro-USB Cable (works with most Micro-USB Tablets),,,\r\nAmazon Kindle Fire 5ft USB to Micro-USB Cable (works with most Micro-USB Tablets),,, 4 Echo (Black),,,\r\nEcho (Black),,, 3 Echo (White),,,\r\nFire Tablet, 7 Display, Wi-Fi, 8 GB - Includes Special Offers, Tangerine" 1 Amazon Fire Hd 6 Standing Protective Case(4th Generation - 2014 Release), Cayenne Red,,,\r\nAmazon 5W USB Official OEM Charger and Power Adapter for Fire Tablets and Kindle eReaders,,, 1 Echo (Black),,,\r\nAmazon 9W PowerFast Official OEM USB Charger and Power Adapter for Fire Tablets and Kindle eReaders,,, 1 New Amazon Kindle Fire Hd 9w Powerfast Adapter Charger + Micro Usb Angle Cable,,,\r\n 1 Amazon Fire Tv,,,\r\nKindle Dx Leather Cover, Black (fits 9.7 Display, Latest and 2nd Generation Kindle Dxs)",, 1 Name: name, dtype: int64
The output confirmed that each ASIN can have multiple names. Therefore we should only really concern ourselves with which ASINs do well, not the product names.
fig = plt.figure(figsize=(16,10)) ax1 = plt.subplot(211) ax2 = plt.subplot(212, sharex = ax1) reviews["asins"].value_counts().plot(kind="bar", ax=ax1, title="ASIN Frequency") np.log10(reviews["asins"].value_counts()).plot(kind="bar", ax=ax2, title="ASIN Frequency (Log10 Adjusted)") plt.show()

Entire training dataset average rating
print(reviews["reviews.rating"].mean()) asins_count_ix = reviews["asins"].value_counts().index plt.subplots(2,1,figsize=(16,12)) plt.subplot(2,1,1) reviews["asins"].value_counts().plot(kind="bar", title="ASIN Frequency") plt.subplot(2,1,2) sns.pointplot(x="asins", y="reviews.rating", order=asins_count_ix, data=reviews) plt.xticks(rotation=90) plt.show()

Sentiment Analysis
Using the features in place, we will build a classifier that can determine a review’s sentiment.
def sentiments(rating): if (rating == 5) or (rating == 4): return "Positive" elif rating == 3: return "Neutral" elif (rating == 2) or (rating == 1): return "Negative" # Add sentiments to the data strat_train["Sentiment"] = strat_train["reviews.rating"].apply(sentiments) strat_test["Sentiment"] = strat_test["reviews.rating"].apply(sentiments) print(strat_train["Sentiment"][:20])
#Output-
4349 Positive 30776 Positive 28775 Neutral 1136 Positive 17803 Positive 7336 Positive 32638 Positive 13995 Positive 6728 Negative 22009 Positive 11047 Positive 22754 Positive 5578 Positive 11673 Positive 19168 Positive 14903 Positive 30843 Positive 5440 Positive 28940 Positive 31258 Positive Name: Sentiment, dtype: object
[…] Product reviews are becoming more important with the evolution of traditional brick and mortar retail stores to online shopping. Consumers are posting reviews directly on product pages in real time. See full Project. […]
[…] Amazon Product Reviews Analysis […]