Many Business activities are seasonal in nature, where most of the business are dependent on a particular time of festival and holidays. Every business uses sales promotion techniques to increase the demand for their products and services, in order to stay in the market for a longer period. In this article, I am going to do sales forecasting with machine learning by analyzing the historical data with techniques like Time Series Forecasting.
Sales Forecast with Time Series Forecasting
The data I will use here to predict sales, is a weekly sales data of nine stores and three products. At the end of this article, I will predict sales for next 50 weeks, now to move further with time series forecasting you can download this data that I will use below.
Now, lets start with importing the standard libraries and reading the dataset:
import plotly.express as px
from fbprophet import Prophet
from sklearn.metrics import mean_squared_error
from math import sqrt
from statsmodels.distributions.empirical_distribution import ECDF
import datetime
import pandas as pd
import numpy as np
df = pd.read_csv('Sales_Product_Price_by_Store.csv')
df['Date'] = pd.to_datetime(df['Date'])
df['weekly_sales'] = df['Price'] * df['Weekly_Units_Sold']
df.head()
Code language: Python (python)
Store | Product | Date | Is_Holiday | Base Price | Price | Weekly_Units_Sold | weekly_sales | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 2010-02-05 | False | 9.99 | 7.99 | 245 | 1957.55 |
1 | 1 | 1 | 2010-02-12 | True | 9.99 | 7.99 | 453 | 3619.47 |
2 | 1 | 1 | 2010-02-19 | False | 9.99 | 7.99 | 409 | 3267.91 |
3 | 1 | 1 | 2010-02-26 | False | 9.99 | 7.99 | 191 | 1526.09 |
4 | 1 | 1 | 2010-03-05 | False | 9.99 | 9.99 | 145 | 1448.55 |
df.set_index('Date', inplace=True)
df['year'] = df.index.year
df['month'] = df.index.month
df['day'] = df.index.day
df['week_of_year'] = df.index.weekofyear
df.head()
Code language: Python (python)
Store | Product | Is_Holiday | Base Price | Price | Weekly_Units_Sold | weekly_sales | year | month | day | week_of_year | |
---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||
2010-02-05 | 1 | 1 | False | 9.99 | 7.99 | 245 | 1957.55 | 2010 | 2 | 5 | 5 |
2010-02-12 | 1 | 1 | True | 9.99 | 7.99 | 453 | 3619.47 | 2010 | 2 | 12 | 6 |
2010-02-19 | 1 | 1 | False | 9.99 | 7.99 | 409 | 3267.91 | 2010 | 2 | 19 | 7 |
2010-02-26 | 1 | 1 | False | 9.99 | 7.99 | 191 | 1526.09 | 2010 | 2 | 26 | 8 |
2010-03-05 | 1 | 1 | False | 9.99 | 9.99 | 145 | 1448.55 | 2010 | 3 | 5 | 9 |
Exploratory Data Analysis
To get some insights about the continuous variables in data, I will plot and empirical distribution function (ECDF):
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style = "ticks")
c = '#386B7F'
figure, axes = plt.subplots(nrows=2, ncols=2)
figure.tight_layout(pad=2.0)
plt.subplot(211)
cdf = ECDF(df['Weekly_Units_Sold'])
plt.plot(cdf.x, cdf.y, label = "statmodels", color = c);
plt.xlabel('Weekly Units Sold'); plt.ylabel('ECDF');
plt.subplot(212)
cdf = ECDF(df['weekly_sales'])
plt.plot(cdf.x, cdf.y, label = "statmodels", color = c);
plt.xlabel('Weekly sales');
Code language: Python (python)

The figure above clearly shows that, in a best week for sales, a store managed to sell 2500 units, but about 80 percent of the time, the weekly sales did not crossed 500 units.
To see this with numbers let’s look at the statistics of our sales data:
df.groupby('Store')['weekly_sales'].describe()
Code language: Python (python)
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Store | ||||||||
1 | 429.0 | 1789.414172 | 900.074226 | 769.65 | 1208.90 | 1659.17 | 1957.20 | 6816.59 |
2 | 429.0 | 2469.447413 | 1328.162884 | 1143.48 | 1579.21 | 2215.08 | 2756.55 | 9110.00 |
3 | 429.0 | 670.924009 | 366.816321 | 229.77 | 459.77 | 619.69 | 730.78 | 2650.00 |
4 | 429.0 | 3078.462145 | 1746.147872 | 1099.45 | 1818.18 | 2626.61 | 3837.51 | 13753.12 |
5 | 429.0 | 588.922984 | 242.628977 | 285.87 | 461.23 | 519.74 | 613.53 | 2264.97 |
6 | 429.0 | 2066.705082 | 1163.284768 | 890.19 | 1418.58 | 1758.40 | 2156.40 | 7936.03 |
7 | 429.0 | 955.115058 | 489.084883 | 389.61 | 649.35 | 857.61 | 1041.51 | 3270.00 |
8 | 429.0 | 1352.094056 | 811.326288 | 516.53 | 846.23 | 1275.87 | 1491.51 | 6656.67 |
10 | 429.0 | 4093.407249 | 3130.087191 | 1483.65 | 2462.88 | 3707.81 | 4510.47 | 25570.00 |
df.groupby('Store')['Weekly_Units_Sold'].sum()
Code language: Python (python)
Store 1 86699 2 121465 3 31689 4 158718 5 27300 6 97698 7 44027 8 65273 10 200924 Name: Weekly_Units_Sold, dtype: int64
Based on the above statistics, we can clearly see that, the store 10 has the highest average weekly sales, and store 5 has the lower average weekly sales among all the stores. The statistics say that store 10 has the most total weekly sales which simply convey that store 10 is the most crowded store among all the stores.
g = sns.FacetGrid(df, col="Is_Holiday", height=4, aspect=.8)
g.map(sns.barplot, "Product", "Price")
Code language: Python (python)

g = sns.FacetGrid(df, col="Is_Holiday", height=4, aspect=.8)
g.map(sns.barplot, "Product", "Weekly_Units_Sold")
Code language: Python (python)

Product 2 is the cheapest product among these three products, so, it sells the most. Product 3 is the most expensive product among these three. Product price did not change during holidays.
Because we have recorded holidays sales, so we will analyze if holiday also contributed to the sales.
g = sns.FacetGrid(df, row="Is_Holiday",
height=1.7, aspect=4,)
g.map(sns.distplot, "Weekly_Units_Sold", hist=False, rug=True)
Code language: Python (python)

sns.factorplot(data= df,
x= 'Is_Holiday',
y= 'Weekly_Units_Sold',
hue= 'Store')
Code language: Python (python)

sns.factorplot(data= df,
x= 'Is_Holiday',
y= 'Weekly_Units_Sold',
hue= 'Product')
Code language: Python (python)

From the above figures we can see that holidays do not have a positive impact on the business. For most of the stores, weekly unit sales on the holidays is as same as the normal days, while store 10 also face a decrease in sales during the holidays.
Weekly units sold for product 1 had a slightly increase during the holidays, while product 2 and product 3 had a decrease during the holidays.
g = sns.FacetGrid(df, col="Product", row="Is_Holiday", margin_titles=True, height=3)
g.map(plt.scatter, "Price", "Weekly_Units_Sold", color="#338844", edgecolor="white", s=50, lw=1)
g.set(xlim=(0, 30), ylim=(0, 2600));
Code language: Python (python)

Every product has more than one price, both in holidays and normal days. One price is regular price, and another is a promotional price. However, the price gap for product 3 is huge, it was slashed to almost 50% off during promotions.
Product 3 made the most sales during normal days.
g = sns.FacetGrid(df, col="Store", hue="Product", margin_titles=True, col_wrap=3)
g.map(plt.scatter, 'Price', 'Weekly_Units_Sold', alpha=.7)
g.add_legend()
Code language: Python (python)

All the stores have the similar price promotion pattern, for some reason, Store 10 sells the most during the promotions. All the products have the regular price and promotion price. Product 3 has the highest discount and sells the most during the promotions.
df.groupby(['Product', 'promotion'])['Price', 'Weekly_Units_Sold'].mean()
Code language: Python (python)

Now, let’s create a heatmap for concluding our all observations:
corr_all = df.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr_all, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize = (11, 9))
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr_all, mask = mask,
square = True, linewidths = .5, ax = ax, cmap = "BuPu")
plt.show();
Code language: Python (python)

We have a strong positive correlation between price and Base price, weekly units sold and weekly sales, base price and product, price and product. We can also observe a positive correlation between month and week of the year.
Observations of our EDA:
- The most selling and crowded Store is Store 10, and the least crowded store is Store 5.
- In terms of number of units sold, the most selling product is product 2. In terms of sales dollars, Product 3 posts the highest sales during normal days.
- Stores do not necessarily run product promotions during holidays. Holidays do not seem to have an impact on Stores’ performance.
- Product 1 sells a little more during holidays, however, Product 2 and Product 3 sell less on holidays.
- Product 2 seems to be the cheapest product, and Product 3 is the most expensive product.
- Most stores have some kind of seasonality and they make the highest sales around July.
- Product 1 sells a little more in February than the other months, Product 2 sells the most around April and July, and Product 3 sells the most around July.
- In general, product 2 sells the most at Store 10, but in July, Product 3 has the highest sales in this store.
- Each product has its regular price and promotional price. There isn’t significant gap between regular price and promotional price on Product 1 and Product 2, however, Product 3’s promotional price can be slashed to 50% of its original price. Although every store makes this kind of price cut for product 3, Store 10 is the one made the highest sales during the price cut.
- It is nothing unusual to sell more during promotion than the normal days. Store 10’s made Product 3 the best selling product around July.
Time Series Forecasting and Sales Prediction
Now let’s move to the Time Series Forecasting Part of this article, here we will forecast sales, according to our above observations of exploratory data analysis.
# store types
sales_1 = df[df.Store == 1]['weekly_sales']
sales_2 = df[df.Store == 2]['weekly_sales']
sales_3 = df[df.Store == 3]['weekly_sales']
sales_4 = df[df.Store == 4]['weekly_sales']
sales_5 = df[df.Store == 5]['weekly_sales']
sales_6 = df[df.Store == 6]['weekly_sales']
sales_7 = df[df.Store == 7]['weekly_sales']
sales_8 = df[df.Store == 8]['weekly_sales']
sales_10 = df[df.Store == 10]['weekly_sales']
f, (ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8, ax9) = plt.subplots(9, figsize = (20, 15))
# store types
sales_1.plot(color = c, ax = ax1)
sales_2.plot(color = c, ax = ax2)
sales_3.plot(color = c, ax = ax3)
sales_4.plot(color = c, ax = ax4)
sales_5.plot(color = c, ax = ax5)
sales_6.plot(color = c, ax = ax6)
sales_7.plot(color = c, ax = ax7)
sales_8.plot(color = c, ax = ax8)
sales_10.plot(color = c, ax = ax9)
Code language: Python (python)

Time Series Forecasting
Time Series of the weekly sales:
store_10_pro_3 = df[(df.Store == 10) & (df.Product == 3)].loc[:, ['Base Price', 'Price', 'Weekly_Units_Sold', 'weekly_sales']]
store_10_pro_3.reset_index(level=0, inplace=True)
fig = px.line(store_10_pro_3, x='Date', y='weekly_sales')
fig.update_layout(title_text='Time Series of weekly sales')
fig.show()
Code language: Python (python)

Product 2’s seasonality at store 10 is obvious. The sales always peak between July and September during school holiday. Below we are implementing prophet model, forecasting the weekly sales for the future 50 weeks.
model = Prophet(interval_width = 0.95)
model.fit(store_10_pro_3)
future_dates = model.make_future_dataframe(periods = 50, freq='W')
future_dates.tail(7)
Code language: Python (python)
forecast = model.predict(future_dates)
# preditions for last week
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(7)
Code language: Python (python)
ds | yhat | yhat_lower | yhat_upper | |
---|---|---|---|---|
186 | 2013-08-25 | 7160.453669 | 4742.937710 | 9559.615673 |
187 | 2013-09-01 | 5542.434739 | 3249.762712 | 7887.321785 |
188 | 2013-09-08 | 3702.168377 | 1355.902566 | 5824.555193 |
189 | 2013-09-15 | 2427.279755 | 189.552142 | 4693.158976 |
190 | 2013-09-22 | 2386.972428 | 7.973471 | 4673.053027 |
191 | 2013-09-29 | 3020.451351 | 759.252236 | 5227.695107 |
192 | 2013-10-06 | 3157.655085 | 756.079499 | 5603.923897 |
model.plot(forecast)
Code language: Python (python)

model.plot_components(forecast)
Code language: Python (python)

metric_df = forecast.set_index('ds')[['yhat']].join(store_10_pro_3.set_index('ds').y).reset_index()
metric_df.dropna(inplace=True)
error = mean_squared_error(metric_df.y, metric_df.yhat)
print('The RMSE is {}'. format(sqrt(error)))
Code language: Python (python)
The RMSE is 1190.0962582193933
Also, Read – TensorFlow Tutorial for Machine Learning.
I hope you liked this article on Time Series Forecasting on Sales Prediction. Feel free to ask your questions about Time Series Forecasting and Analysis or any other topic that you want in the comments section below.
I am working through this one and I got caught up with this:
model = Prophet(interval_width = 0.95)
model.fit(store_10_pro_3)
future_dates = model.make_future_dataframe(periods = 50, freq=’W’)
future_dates.tail(7)
My output says:
“Dataframe must have columns “ds” and “y” with the dates and values respectively”
I checked around online and attempted to add a new variable to “rename” the columns but it did not work.
You can refer the same from this notebook – https://colab.research.google.com/drive/1p_MQBz3SdC5LXB54qwvaVw_tz3cwh0A_?usp=sharing
I suggest you to use google colab.