
A supermarket is self-service shop offering a wide variety of food, beverages and household products, organized into sections. It is larger and has a wider selection than earlier grocery stores, but is smaller and more limited in the range of merchandise than a hypermarket or big-box market.
In this Data Science project I have used different techniques to analyse the sales data set of supermarket.
What will you discover from this analysis?
1.Relation of customers with SuperMarket
2.Payment methods used in supermarket.
3.Products relation with quantities.
4.Types of product and their sales.
5.Products and their ratings.
Let’s start by importing Libraries
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt import seaborn as sns
You can download the data set you need for this project from here:
data=pd.read_csv("market.csv") print(data.shape)
#Output
(1000, 17)
data.head()

Data Cleaning
data.isnull().sum()
#Output Invoice ID 0 Branch 0 City 0 Customer type 0 Gender 0 Product line 0 Unit price 0 Quantity 0 Tax 5% 0 Total 0 Date 0 Time 0 Payment 0 cogs 0 gross margin percentage 0 gross income 0 Rating 0 dtype: int64
There are no missing value and the data set is clean so we will continue with data visualization.
Checking information of data set.
data.info()
#Output <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Invoice ID 1000 non-null object 1 Branch 1000 non-null object 2 City 1000 non-null object 3 Customer type 1000 non-null object 4 Gender 1000 non-null object 5 Product line 1000 non-null object 6 Unit price 1000 non-null float64 7 Quantity 1000 non-null int64 8 Tax 5% 1000 non-null float64 9 Total 1000 non-null float64 10 Date 1000 non-null object 11 Time 1000 non-null object 12 Payment 1000 non-null object 13 cogs 1000 non-null float64 14 gross margin percentage 1000 non-null float64 15 gross income 1000 non-null float64 16 Rating 1000 non-null float64 dtypes: float64(7), int64(1), object(9) memory usage: 132.9+ KB
data.describe()

Checking number of rows and columns
print("Dataset contains {} row and {} colums".format(data.shape[0],data.shape[1]))
#Output Dataset contains 1000 row and 17 colums
Visualization
Now we use different visualization tools to check different aspects of Supermarket sales.
Let’s start with gender count
plt.figure(figsize=(14,6)) plt.style.use('fivethirtyeight') ax= sns.countplot('Gender', data=data , palette = 'copper') ax.set_xlabel(xlabel= "Gender",fontsize=18) ax.set_ylabel(ylabel = "Gender count", fontsize = 18) ax.set_title(label = "Gender count in supermarket", fontsize = 20) plt.show()

Here we can see that the number of males and females entering the store is almost equal. But the visualization looks suspicious. Let’s check numeric data.
data.groupby(['Gender']). agg({'Total':'sum'})
#Output Total Gender Female 167882.925 Male 155083.824
The visualization looks good. Let’s carry on.
Customer type
plt.style.use('ggplot') plt.figure(figsize= (14,6)) ax = sns.countplot(x = "Customer type", data = data, palette = "rocket_r") ax.set_title("Type of customers", fontsize = 25) ax.set_xlabel("Customer type", fontsize = 16) ax.set_ylabel("Customer Count", fontsize = 16)

The visualization looks suspicious let’s check numeric data.
data.groupby(['Customer type']). agg({'Total':'sum'})
#Output Total Customer type Member 164223.444 Normal 158743.305
Above we can see the type of customer in all branch combined now let’s check for different branch.
plt.figure(figsize=(14,6)) plt.style.use('classic') ax = sns.countplot(x = "Customer type", hue = "Branch", data = data, palette= "rocket_r") ax.set_title(label = "Customer type in different branch", fontsize = 25) ax.set_xlabel(xlabel = "Branches", fontsize = 16) ax.set_ylabel(ylabel = "Customer Count", fontsize = 16)

Checking the different payment methods used.
plt.figure(figsize = (14,6)) ax = sns.countplot(x = "Payment", data = data, palette = "tab20") ax.set_title(label = "Payment methods of customers ", fontsize= 25) ax.set_xlabel(xlabel = "Payment method", fontsize = 16) ax.set_ylabel(ylabel = " Customer Count", fontsize = 16)

Payment method distribution in all branches
plt.figure(figsize = (14,6)) plt.style.use('classic') ax = sns.countplot(x="Payment", hue = "Branch", data = data, palette= "tab20") ax.set_title(label = "Payment distribution in all branches", fontsize= 25) ax.set_xlabel(xlabel = "Payment method", fontsize = 16) ax.set_ylabel(ylabel = "Peple Count", fontsize = 16)

Now let’s see the rating distribution in 3 branches
plt.figure(figsize=(14,6)) ax = sns.boxplot(x="Branch", y = "Rating" ,data =data, palette= "RdYlBu") ax.set_title("Rating distribution between branches", fontsize = 25) ax.set_xlabel(xlabel = "Branches", fontsize = 16) ax.set_ylabel(ylabel = "Rating distribution", fontsize = 16)

We can see that the average rating of branch A and C is more than seven and branch B is less than 7.
Max sales time
data["Time"]= pd.to_datetime(data["Time"]) data["Hour"]= (data["Time"]).dt.hour plt.figure(figsize=(14,6)) plt.style.use('classic') SalesTime = sns.lineplot(x="Hour", y ="Quantity", data = data).set_title("product sales per Hour")

We can see that the supermarket makes most of it’s sells in 14:00 hrs local time.
Rating vs sales
plt.figure(figsize=(14,6)) plt.style.use('classic') rating_vs_sales = sns.lineplot(x="Total", y= "Rating", data=data)

Using boxen plot
plt.figure(figsize=(10,6)) plt.style.use('classic') ax = sns.boxenplot(x = "Quantity", y = "Product line", data = data,) ax.set_title(label = "Average sales of different lines of products", fontsize = 25) ax.set_xlabel(xlabel = "Qunatity Sales",fontsize = 16) ax.set_ylabel(ylabel = "Product Line", fontsize = 16)

Here we can see that the average sales of different lines of products. Health and beauty making the highest sales whereas Fashon accessories making the lowest sales.
Let’s see the sales count of these products.
plt.figure(figsize=(14,6)) ax = sns.countplot(y='Product line', data=data, order = data['Product line'].value_counts().index) ax.set_title(label = "Sales count of products", fontsize = 25) ax.set_xlabel(xlabel = "Sales count", fontsize = 16) ax.set_ylabel(ylabel= "Product Line", fontsize = 16)

We can see the top sold products form the above figure.
Total sales of product using boxenplot
plt.figure(figsize=(14,6)) plt.style.use('classic') ax = sns.boxenplot(y= "Product line", x= "Total", data = data) ax.set_title(label = " Total sales of product", fontsize = 25) ax.set_xlabel(xlabel = "Total sales", fontsize = 16) ax.set_ylabel(ylabel = "Product Line", fontsize = 16)

Now let’s see average ratings of products.
plt.figure(figsize = (14,6)) plt.style.use('classic') ax = sns.boxenplot(y = "Product line", x = "Rating", data = data) ax.set_title("Average rating of product line", fontsize = 25) ax.set_xlabel("Rating", fontsize = 16) ax.set_ylabel("Product line", fontsize = 16)

Product sales on the basis of gender
plt.style.use('classic') plt.figure(figsize = (14,6)) ax= sns.stripplot(y= "Product line", x = "Total", hue = "Gender", data = data) ax.set_title(label = "Product sales on the basis of gender") ax.set_xlabel(xlabel = " Total sales of products") ax.set_ylabel(ylabel = "Product Line")

Product and gross income
plt.figure(figsize = (14,6)) plt.style.use('classic') ax = sns.relplot(y= "Product line", x = "gross income", data = data) # ax.set_title(label = "Products and Gross income") # ax.set_xlabel(xlabel = "Total gross income") # ax.set_ylabel(ylabel = "Product line")

Best website to learn machine language
Thanks, keep visiting us