An End-to-End Project On Time Series Analysis and Forecasting With Python
An End-to-End Project On Time Series Analysis and Forecasting With Python
Time series analysis comprises methods for analyzing time series data in
order to extract meaningful statistics and other characteristics of the data.
Time series forecasting is the use of a model to predict future values based
on previously observed values.
Time series are widely used for non-stationary data, like economic,
weather, stock price, and retail sales in this post. We will demonstrate
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 1 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
di?erent approaches for forecasting retail sales time series. Let’s get started!
The Data
We are using Superstore sales data that can be downloaded from here.
import warnings
import itertools
import numpy as np
import matplotlib.pyplot as plt
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight')
import pandas as pd
import statsmodels.api as sm
import matplotlib
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['text.color'] = 'k'
There are several categories in the Superstore sales data, we start from time
series analysis and forecasting for furniture sales.
df = pd.read_excel("Superstore.xls")
furniture = df.loc[df['Category'] == 'Furniture']
Data Preprocessing
This step includes removing columns we do not need, check missing values,
aggregate sales by date and so on.
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer
ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal
Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product
Name', 'Quantity', 'Discount', 'Profit']
furniture.drop(cols, axis=1, inplace=True)
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 2 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
furniture.isnull().sum()
Figure 1
Figure 2
Our current datetime data can be tricky to work with, therefore, we will use
the averages daily sales value for that month instead, and we are using the
start of each month as the timestamp.
y = furniture['Sales'].resample('MS').mean()
y['2017':]
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 3 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
Figure 3
y.plot(figsize=(15, 6))
plt.show()
Figure 4
Some distinguishable patterns appear when we plot the data. The time-
series has seasonality pattern, such as sales are always low at the beginning
of the year and high at the end of the year. There is always an upward trend
within any single year with a couple of low months in the mid of the year.
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 4 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
Top highlight
distinct components: trend, seasonality, and noise.
Figure 5
The plot above clearly shows that the sales of furniture is unstable, along
with its obvious seasonality.
ARIMA models are denoted with the notation ARIMA(p, d, q) . These three
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in
list(itertools.product(p, d, q))]
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 5 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
Figure 6
This step is parameter Selection for our furniture’s sales ARIMA Time Series
Model. Our goal here is to use a “grid search” to Snd the optimal set of
parameters that yields the best performance for our model.
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 6 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
Figure 7
The above output suggests that SARIMAX(1, 1, 1)x(1, 1, 0, 12) yields the
lowest AIC value of 297.78. Therefore we should consider this to be optimal
option.
mod = sm.tsa.statespace.SARIMAX(y,
order=(1, 1, 1),
seasonal_order=(1, 1, 0, 12),
enforce_stationarity=False,
enforce_invertibility=False)
results = mod.fit()
print(results.summary().tables[1])
Figure 8
results.plot_diagnostics(figsize=(16, 8))
plt.show()
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 7 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
Figure 9
It is not perfect, however, our model diagnostics suggests that the model
residuals are near normally distributed.
Validating forecasts
To help us understand the accuracy of our forecasts, we compare predicted
sales to real sales of the time series, and we set forecasts to start at 2017–
01–01 to the end of the data.
pred = results.get_prediction(start=pd.to_datetime('2017-01-01'),
dynamic=False)
pred_ci = pred.conf_int()
ax = y['2014':].plot(label='observed')
pred.predicted_mean.plot(ax=ax, label='One-step ahead Forecast',
alpha=.7, figsize=(14, 7))
ax.fill_between(pred_ci.index,
pred_ci.iloc[:, 0],
pred_ci.iloc[:, 1], color='k', alpha=.2)
ax.set_xlabel('Date')
ax.set_ylabel('Furniture Sales')
plt.legend()
plt.show()
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 8 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
Figure 10
The line plot is showing the observed values compared to the rolling
forecast predictions. Overall, our forecasts align with the true values very
well, showing an upward trend starts from the beginning of the year and
captured the seasonality toward the end of the year.
y_forecasted = pred.predicted_mean
y_truth = y['2017-01-01':]
Follow
The Mean Squared Error of our forecasts is 22993.58
5.2K
print('The Root Mean Squared Error of our forecasts is
{}'.format(round(np.sqrt(mse), 2)))
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 9 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
the smaller the MSE, the closer we are to Snding the line of best St.
Root Mean Square Error (RMSE) tells us that our model was able to forecast
the average daily furniture sales in the test set within 151.64 of the real
sales. Our furniture daily sales range from around 400 to over 1200. In my
opinion, this is a pretty good model so far.
pred_uc = results.get_forecast(steps=100)
pred_ci = pred_uc.conf_int()
plt.legend()
plt.show()
Figure 11
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 10 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
The above time series analysis for furniture makes me curious about other
categories, and how do they compare with each other over time. Therefore,
we are going to compare time series of furniture and o]ce supplier.
Data Exploration
We are going to compare two categories’ sales in the same time period. This
means combine two data frames into one and plot these two categories’
time series into one plot.
cols = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer
ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal
Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product
Name', 'Quantity', 'Discount', 'Profit']
furniture.drop(cols, axis=1, inplace=True)
office.drop(cols, axis=1, inplace=True)
y_furniture = furniture['Sales'].resample('MS').mean()
y_office = office['Sales'].resample('MS').mean()
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 11 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
Figure 12
plt.figure(figsize=(20, 8))
plt.plot(store['Order Date'], store['furniture_sales'], 'b-', label
= 'furniture')
plt.plot(store['Order Date'], store['office_sales'], 'r-', label =
'office supplies')
plt.xlabel('Date'); plt.ylabel('Sales'); plt.title('Sales of
Furniture and Office Supplies')
plt.legend();
Figure 13
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 12 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
OIce supplies Krst time produced higher sales than furniture is 2014–
07–01.
furniture_forecast =
furniture_model.make_future_dataframe(periods=36, freq='MS')
furniture_forecast = furniture_model.predict(furniture_forecast)
office_forecast = office_model.make_future_dataframe(periods=36,
freq='MS')
office_forecast = office_model.predict(office_forecast)
plt.figure(figsize=(18, 6))
furniture_model.plot(furniture_forecast, xlabel = 'Date', ylabel =
'Sales')
plt.title('Furniture Sales');
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 13 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
Figure 14
plt.figure(figsize=(18, 6))
office_model.plot(office_forecast, xlabel = 'Date', ylabel =
'Sales')
plt.title('Office Supplies Sales');
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 14 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
Figure 15
Compare Forecasts
We already have the forecasts for three years for these two categories into
the future. We will now join them together to compare their future
forecasts.
merge_furniture_forecast = furniture_forecast.copy()
merge_office_forecast = office_forecast.copy()
merge_furniture_forecast.columns = furniture_names
merge_office_forecast.columns = office_names
forecast = forecast.rename(columns={'furniture_ds':
'Date'}).drop('office_ds', axis=1)
forecast.head()
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 15 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
Figure 16
plt.figure(figsize=(10, 7))
plt.plot(forecast['Date'], forecast['furniture_trend'], 'b-')
plt.plot(forecast['Date'], forecast['office_trend'], 'r-')
plt.legend(); plt.xlabel('Date'); plt.ylabel('Sales')
plt.title('Furniture vs. Office Supplies Sales Trend');
Figure 17
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 16 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
plt.figure(figsize=(10, 7))
plt.plot(forecast['Date'], forecast['furniture_yhat'], 'b-')
plt.plot(forecast['Date'], forecast['office_yhat'], 'r-')
plt.legend(); plt.xlabel('Date'); plt.ylabel('Sales')
plt.title('Furniture vs. Office Supplies Estimate');
Figure 18
furniture_model.plot_components(furniture_forecast);
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 17 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
Figure 19
office_model.plot_components(office_forecast);
Figure 20
Good to see that the sales for both furniture and o]ce supplies have been
linearly increasing over time and will be keep growing, although o]ce
supplies’ growth seems slightly stronger.
The worst month for furniture is April, the worst month for o]ce supplies is
February. The best month for furniture is December, and the best month for
o]ce supplies is October.
There are many time-series analysis we can explore from now on, such as
forecast with uncertainty bounds, change point and anomaly detection,
forecast time-series with external data source. We have only just started.
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 18 of 19
An End-to-End Project on Time Series Analysis and Forecasting with Python 19/08/19, 8(50 PM
References:
https://towardsdatascience.com/an-end-to-end-project-on-time-series-analysis-and-forecasting-with-python-4835e6bf050b Page 19 of 19