Time Series Forecasting

Many Business activities are seasonal in nature, where most of the business are dependent on a particular time of festival and holidays. Every business uses sales promotion techniques to increase the demand for their products and services, in order to stay in the market for a longer period. In this article, I am going to do sales forecasting with machine learning by analyzing the historical data with techniques like Time Series Forecasting.

Sales Forecast with Time Series Forecasting

The data I will use here to predict sales, is a weekly sales data of nine stores and three products. At the end of this article, I will predict sales for next 50 weeks, now to move further with time series forecasting you can download this data that I will use below.

Now, lets start with importing the standard libraries and reading the dataset:

import plotly.express as px from fbprophet import Prophet from sklearn.metrics import mean_squared_error from math import sqrt from statsmodels.distributions.empirical_distribution import ECDF import datetime import pandas as pd import numpy as np df = pd.read_csv('Sales_Product_Price_by_Store.csv') df['Date'] = pd.to_datetime(df['Date']) df['weekly_sales'] = df['Price'] * df['Weekly_Units_Sold'] df.head()
StoreProductDateIs_HolidayBase PricePriceWeekly_Units_Soldweekly_sales
0112010-02-05False9.997.992451957.55
1112010-02-12True9.997.994533619.47
2112010-02-19False9.997.994093267.91
3112010-02-26False9.997.991911526.09
4112010-03-05False9.999.991451448.55
df.set_index('Date', inplace=True) df['year'] = df.index.year df['month'] = df.index.month df['day'] = df.index.day df['week_of_year'] = df.index.weekofyear df.head()
StoreProductIs_HolidayBase PricePriceWeekly_Units_Soldweekly_salesyearmonthdayweek_of_year
Date
2010-02-0511False9.997.992451957.552010255
2010-02-1211True9.997.994533619.4720102126
2010-02-1911False9.997.994093267.9120102197
2010-02-2611False9.997.991911526.0920102268
2010-03-0511False9.999.991451448.552010359

Exploratory Data Analysis

To get some insights about the continuous variables in data, I will plot and empirical distribution function (ECDF):

import matplotlib.pyplot as plt import seaborn as sns sns.set(style = "ticks") c = '#386B7F' figure, axes = plt.subplots(nrows=2, ncols=2) figure.tight_layout(pad=2.0) plt.subplot(211) cdf = ECDF(df['Weekly_Units_Sold']) plt.plot(cdf.x, cdf.y, label = "statmodels", color = c); plt.xlabel('Weekly Units Sold'); plt.ylabel('ECDF'); plt.subplot(212) cdf = ECDF(df['weekly_sales']) plt.plot(cdf.x, cdf.y, label = "statmodels", color = c); plt.xlabel('Weekly sales');

ECDF

The figure above clearly shows that, in a best week for sales, a store managed to sell 2500 units, but about 80 percent of the time, the weekly sales did not crossed 500 units.

To see this with numbers let’s look at the statistics of our sales data:

df.groupby('Store')['weekly_sales'].describe()

count
meanstdmin25%50%75%max
Store
1429.01789.414172900.074226769.651208.901659.171957.206816.59
2429.02469.4474131328.1628841143.481579.212215.082756.559110.00
3429.0670.924009366.816321229.77459.77619.69730.782650.00
4429.03078.4621451746.1478721099.451818.182626.613837.5113753.12
5429.0588.922984242.628977285.87461.23519.74613.532264.97
6429.02066.7050821163.284768890.191418.581758.402156.407936.03
7429.0955.115058489.084883389.61649.35857.611041.513270.00
8429.01352.094056811.326288516.53846.231275.871491.516656.67
10429.04093.4072493130.0871911483.652462.883707.814510.4725570.00
df.groupby('Store')['Weekly_Units_Sold'].sum()
Store
1      86699
2     121465
3      31689
4     158718
5      27300
6      97698
7      44027
8      65273
10    200924
Name: Weekly_Units_Sold, dtype: int64

Based on the above statistics, we can clearly see that, the store 10 has the highest average weekly sales, and store 5 has the lower average weekly sales among all the stores. The statistics say that store 10 has the most total weekly sales which simply convey that store 10 is the most crowded store among all the stores.

g = sns.FacetGrid(df, col="Is_Holiday", height=4, aspect=.8) g.map(sns.barplot, "Product", "Price")
facetgrid plot
g = sns.FacetGrid(df, col="Is_Holiday", height=4, aspect=.8) g.map(sns.barplot, "Product", "Weekly_Units_Sold")
weekly sales

Product 2 is the cheapest product among these three products, so, it sells the most. Product 3 is the most expensive product among these three. Product price did not change during holidays.

Because we have recorded holidays sales, so we will analyze if holiday also contributed to the sales.

g = sns.FacetGrid(df, row="Is_Holiday", height=1.7, aspect=4,) g.map(sns.distplot, "Weekly_Units_Sold", hist=False, rug=True)
Time Series Forecasting
sns.factorplot(data= df, x= 'Is_Holiday', y= 'Weekly_Units_Sold', hue= 'Store')
factorplot
sns.factorplot(data= df, x= 'Is_Holiday', y= 'Weekly_Units_Sold', hue= 'Product')
seaborn

From the above figures we can see that holidays do not have a positive impact on the business. For most of the stores, weekly unit sales on the holidays is as same as the normal days, while store 10 also face a decrease in sales during the holidays.

Weekly units sold for product 1 had a slightly increase during the holidays, while product 2 and product 3 had a decrease during the holidays.

g = sns.FacetGrid(df, col="Product", row="Is_Holiday", margin_titles=True, height=3) g.map(plt.scatter, "Price", "Weekly_Units_Sold", color="#338844", edgecolor="white", s=50, lw=1) g.set(xlim=(0, 30), ylim=(0, 2600));
Time Series Forecasting

Every product has more than one price, both in holidays and normal days. One price is regular price, and another is a promotional price. However, the price gap for product 3 is huge, it was slashed to almost 50% off during promotions.

Product 3 made the most sales during normal days.

g = sns.FacetGrid(df, col="Store", hue="Product", margin_titles=True, col_wrap=3) g.map(plt.scatter, 'Price', 'Weekly_Units_Sold', alpha=.7) g.add_legend()
product sales comparison

All the stores have the similar price promotion pattern, for some reason, Store 10 sells the most during the promotions. All the products have the regular price and promotion price. Product 3 has the highest discount and sells the most during the promotions.

df.groupby(['Product', 'promotion'])['Price', 'Weekly_Units_Sold'].mean()
output

Now, let’s create a heatmap for concluding our all observations:

corr_all = df.corr() # Generate a mask for the upper triangle mask = np.zeros_like(corr_all, dtype = np.bool) mask[np.triu_indices_from(mask)] = True # Set up the matplotlib figure f, ax = plt.subplots(figsize = (11, 9)) # Draw the heatmap with the mask and correct aspect ratio sns.heatmap(corr_all, mask = mask, square = True, linewidths = .5, ax = ax, cmap = "BuPu") plt.show();
heatmap

We have a strong positive correlation between price and Base price, weekly units sold and weekly sales, base price and product, price and product. We can also observe a positive correlation between month and week of the year.

Observations of our EDA:

  • The most selling and crowded Store is Store 10, and the least crowded store is Store 5.
  • In terms of number of units sold, the most selling product is product 2. In terms of sales dollars, Product 3 posts the highest sales during normal days.
  • Stores do not necessarily run product promotions during holidays. Holidays do not seem to have an impact on Stores’ performance.
  • Product 1 sells a little more during holidays, however, Product 2 and Product 3 sell less on holidays.
  • Product 2 seems to be the cheapest product, and Product 3 is the most expensive product.
  • Most stores have some kind of seasonality and they make the highest sales around July.
  • Product 1 sells a little more in February than the other months, Product 2 sells the most around April and July, and Product 3 sells the most around July.
  • In general, product 2 sells the most at Store 10, but in July, Product 3 has the highest sales in this store.
  • Each product has its regular price and promotional price. There isn’t significant gap between regular price and promotional price on Product 1 and Product 2, however, Product 3’s promotional price can be slashed to 50% of its original price. Although every store makes this kind of price cut for product 3, Store 10 is the one made the highest sales during the price cut.
  • It is nothing unusual to sell more during promotion than the normal days. Store 10’s made Product 3 the best selling product around July.

Time Series Forecasting and Sales Prediction

Now let’s move to the Time Series Forecasting Part of this article, here we will forecast sales, according to our above observations of exploratory data analysis.

# store types sales_1 = df[df.Store == 1]['weekly_sales'] sales_2 = df[df.Store == 2]['weekly_sales'] sales_3 = df[df.Store == 3]['weekly_sales'] sales_4 = df[df.Store == 4]['weekly_sales'] sales_5 = df[df.Store == 5]['weekly_sales'] sales_6 = df[df.Store == 6]['weekly_sales'] sales_7 = df[df.Store == 7]['weekly_sales'] sales_8 = df[df.Store == 8]['weekly_sales'] sales_10 = df[df.Store == 10]['weekly_sales'] f, (ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8, ax9) = plt.subplots(9, figsize = (20, 15)) # store types sales_1.plot(color = c, ax = ax1) sales_2.plot(color = c, ax = ax2) sales_3.plot(color = c, ax = ax3) sales_4.plot(color = c, ax = ax4) sales_5.plot(color = c, ax = ax5) sales_6.plot(color = c, ax = ax6) sales_7.plot(color = c, ax = ax7) sales_8.plot(color = c, ax = ax8) sales_10.plot(color = c, ax = ax9)
Time Series Forecasting

Time Series Forecasting

Time Series of the weekly sales:

store_10_pro_3 = df[(df.Store == 10) & (df.Product == 3)].loc[:, ['Base Price', 'Price', 'Weekly_Units_Sold', 'weekly_sales']] store_10_pro_3.reset_index(level=0, inplace=True) fig = px.line(store_10_pro_3, x='Date', y='weekly_sales') fig.update_layout(title_text='Time Series of weekly sales') fig.show()
Time Series Forecasting

Product 2’s seasonality at store 10 is obvious. The sales always peak between July and September during school holiday. Below we are implementing prophet model, forecasting the weekly sales for the future 50 weeks.

model = Prophet(interval_width = 0.95) model.fit(store_10_pro_3) future_dates = model.make_future_dataframe(periods = 50, freq='W') future_dates.tail(7)
forecast = model.predict(future_dates) # preditions for last week forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(7)
dsyhatyhat_loweryhat_upper
1862013-08-257160.4536694742.9377109559.615673
1872013-09-015542.4347393249.7627127887.321785
1882013-09-083702.1683771355.9025665824.555193
1892013-09-152427.279755189.5521424693.158976
1902013-09-222386.9724287.9734714673.053027
1912013-09-293020.451351759.2522365227.695107
1922013-10-063157.655085756.0794995603.923897
model.plot(forecast)
Time Series Forecasting
model.plot_components(forecast)
Time Series Forecasting
metric_df = forecast.set_index('ds')[['yhat']].join(store_10_pro_3.set_index('ds').y).reset_index() metric_df.dropna(inplace=True) error = mean_squared_error(metric_df.y, metric_df.yhat) print('The RMSE is {}'. format(sqrt(error)))
The RMSE is 1190.0962582193933

Also, Read – TensorFlow Tutorial for Machine Learning.

I hope you liked this article on Time Series Forecasting on Sales Prediction. Feel free to ask your questions about Time Series Forecasting and Analysis or any other topic that you want in the comments section below.

Receive Daily Newsletters

2 Comments

  1. I am working through this one and I got caught up with this:

    model = Prophet(interval_width = 0.95)
    model.fit(store_10_pro_3)

    future_dates = model.make_future_dataframe(periods = 50, freq=’W’)

    future_dates.tail(7)

    My output says:
    “Dataframe must have columns “ds” and “y” with the dates and values respectively”

    I checked around online and attempted to add a new variable to “rename” the columns but it did not work.

Leave a Reply