0% found this document useful (0 votes)
26 views41 pages

PriyankaSharma TSF Rose

The document analyzes rose wine sales data from 1980 to 1995 to build forecasting models. It performs EDA on the time series data, splits it into training and test sets, and builds exponential smoothing and ARIMA models to forecast sales. The best models are selected based on error metrics to predict future sales.

Uploaded by

Priyanka Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views41 pages

PriyankaSharma TSF Rose

The document analyzes rose wine sales data from 1980 to 1995 to build forecasting models. It performs EDA on the time series data, splits it into training and test sets, and builds exponential smoothing and ARIMA models to forecast sales. The best models are selected based on error metrics to predict future sales.

Uploaded by

Priyanka Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Time Series Forecasting-Rose Wine

Priyanka Sharma

PGP-DSBA
1

INDEX

S.No Title Page


no

1 Read the data as an appropriate Time Series data and plot the 4-5
data.
2 Perform appropriate Exploratory Data Analysis to understand the 6-12
data and also perform decomposition.
3 Split the data into training and test. The test data should start in 13-14
1991.
4 4. Build all the exponential smoothing models on the training data 14-21
and evaluate the model using RMSE on the test data.
5 5. Check for the stationarity of the data on which the model is 22-23
being built on using appropriate statistical tests and also mention
the hypothesis for the statistical test. If the data is found to be

non-stationary, take appropriate steps to make it stationary. Check


the new data for stationarity and comment. Note: Stationarity
should be checked at alpha = 0.05.
6 Build an automated version of the ARIMA/SARIMA model in which 24-27
the parameters are selected using the lowest Akaike Information
Criteria (AIC) on the training data and evaluate this model on the
test data using RMSE.
7 Build ARIMA/SARIMA models based on the cut-off points of ACF 28-32
and PACF on the training data and evaluate this model on the test
data using RMSE.
8 Build a table (create a data frame) with all the models built along 33-34
with their corresponding parameters and the respective RMSE
values on the test data.
9 Based on the model-building exercise, build the most optimum 34-35
model(s) on the complete data and predict 12 months into the
future with appropriate confidence intervals/bands.
10 Comment on the model thus built and report your findings and 36
suggest the measures that the company should be taking for
future sales.
2

List of Plots:

1. Line plot of dataset


2. Boxplot of dataset
3. Line plot of sales
4. Boxplot of yearly data
5. Boxplot of monthly data
6. Boxplot of weekday vise
7. Graph of monthly sales over the year
8. Correlation
9. ECDF plot
10. Decomposition addictive
11. Decomposition multiplicative
12. Train and test dataset
13. Linear regression
14. Naive approach
15. Simple average
16. Moving average
17. Simple exponential smoothing
18. Double exponential smoothing
19. Triple exponential smoothing
20. Dickey fuller test
21. Dickey fuller test after diff
22. SARIMA plots
23. PACF and ACF plot
24. PACF and ACF plot train dataset
25. Manual ARIMA plot
26. Manual SARIMA plot
27. Prediction plot
3

Problem Statement:

ABC Estate Wines has been a leader in the rose wine industry for many years,
offering

high-quality wines to consumers all around the world. As the company


continues to expand its reach and grow its customer base, it is essential to
analyze market trends and forecast future sales to ensure continued success.

In this report, we will focus on analyzing the sales data for rose wine in the
20th century. As an analyst for ABC Estate Wines, I have been tasked with
reviewing this data to identify patterns, trends, and opportunities for growth
in the wine market. This knowledge will help us to make informed decisions
about how to position our products in the market, optimize our sales
strategies, and forecast future sales trends.

Overall, this report aims to provide valuable insights into the wine market and
how ABC Estate Wines can continue to succeed in this highly competitive
industry.
4

1. Read the data as an appropriate Time Series data and plot the data.
Data Dictionary:

Table 1: data dictionary


column details
YearMonth Dates of sales
Sparkling Sales of rose wine

Data set is read using the pandas library.

Rows of data set;

Table 2: rows of dataset


Top Few Rows : Last Few Rows :

Number of Rows and Columns of Dataset:

The dataset has 187 rows and 1 column.


5

Plot of the dataset:

Plot 1 : dataset

Post Ingestion of Dataset:

We have divided the dataset further by extraction month and year columns
from the YearMonth column and renamed the sparkling column name to Sales
for better analysis of the dataset.

Rows of new data set;

Table 3: new rows of dataset


Top Few Rows : Last Few Rows :

Number of Rows and Columns of Dataset: The dataset has 187 rows and 3 columns.
6

2.Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition.

Data Type;

Index:

DateTime Sales:
integer Month:
integer Year:
integer

Statistical summary:

Table 4: statistical summary

Null Value:

There are 2 null values present in sales the dataset.

We found the values for the months of July & August were missing for the year
1994.
7

We tried following approaches to impute the data, these were as below.

Mean - Before & After

Treating null values is very important to do further analysis.

In this approach, instead of taking means for the 7th months across all the
years, we just took mean of the 7th months values from a year before and a
year after the missing value.

Similar steps were taken for 8th month.

Boxplot of dataset:

Plot 2: boxplot of data

The box plot shows:

● Sales boxplot has outliers we can treat them but we are choosing not
to treat them as they do not give much effect on the time series
model.
8

Line plot of sales:

Plot 3: line plot of sales

The line plot shows the patterns of trend and seasonality and also shows that
there was a peak in the year 1981.

Boxplot Yearly:

Plot 4: boxplot yearly

This yearly box plot shows there is consistency over the years and there
was a peak in 1980-1981. Outliers are present in almost all years.
9

Boxplot Monthly:

Plot 5: boxplot monthly

The plot shows that sales are highest in the month of December and lowest
in the month of January. Sales are consistent from January to July then from
august the sales start to increase. Outliers are present in June, July, august,
September and December.

Boxplot Weekday vise:

Plot 6: boxplot weekday vise

Tuesday has more sales than other days and Wednesday has the lowest sales
of the week. Outliers are present on all days except Friday and Thursday.
10

Graph of Monthly Sales over the years:

Plot 7: graph of monthly sales over the years

This plot shows that December has the highest sales over the years and the
year 1981 was the year with the highest number of sales.

CORRELATION PLOT

Plot 8: correlation plot


11

This heat map shows that there was little correlation between Sales and the
Years data, there significantly more correlation between the month and Sales
columns. Clearly indicating a seasonal pattern in our Sales data. Certain months
have higher sales, while certain months have lesser.

PLOT ECDF: EMPIRICAL CUMULATIVE DISTRIBUTION FUNCTION

This graph shows the distribution of


data. Plot 9: ECDF plot

This plot shows:

● 50% sales has been less 100


● Highest vales is 250
● Aprox 90% sales has been less than 150

Decomposition -Additive Plot 10 : decomposition additive


12

The plots show:

● Peak year 1981


● It also shows that the trend has declined over the year after 1981
● Residue is spread and is not in a straight line.
● Both trend and seasonality are present.

Decomposition-Multiplicative

Plot 11: decomposition multiplicative

The plots show:

● Peak year 1981


● It also shows that the trend has declined over the year after 1981.
● Residue is spread and is in approx a straight line.
● Both trend and seasonality are present.
● Reside is 0 to 1, while for additive is 0 to 50.
● So multiplicative model is selected owing to a more stable residual
plot and lower range of residuals.
13

2. Split the data into training and test. The test data should start in 1991.

Plot 12: training and test dataset

Data split from 1980-1990 is training data, then 1991 to 1995 is training data.

Rows and Columns:

train dataset has 132 rows and 3


columns. test dataset has 55 and 3
columns.

Few Rows of datasets:

Table 5: train and test dataset rows


14

Train dataset Test dataset

4. Build all the exponential smoothing models on the training data and evaluate the model using RMSE on
the test data. Other models such as regression,naïve forecast models, and simple average models.
should also be built on the training data and check the performance on the test data using RMSE.
a. Model 1: Linear Regression
b. Model 2: Naive Approach
c. Model 3: Simple Average
d. Model 4: Moving Average(MA)
e. Model 5: Simple Exponential Smoothing
f. Model 6: Double Exponential Smoothing (Holt's Model)
g. Model 7: Triple Exponential Smoothing (Holt - Winter's Model)
15

MODEL 1: LINEAR REGRESSION

Plot 13: linear regression

The green line indicates the predictions made by the model, while the orange values are the actual test
values. It is clear the predicted values are very far off from the actual values

Model was evaluated using the RMSE metric. Below is the RMSE calculated for this
model.

Linear Regression 51.080941


16

MODEL 2: NAIVE APPROACH: PLOT 14: NAIVE APPROACH

The green line indicates the predictions made by the model, while the orange values are the actual test values. It
is clear the predicted values are very far off from the actual values

Model was evaluated using the RMSE metric. Below is the RMSE calculated for this
model.

Naive Model 79.304391


17

METHOD 3: SIMPLE AVERAGE PLOT 15: SIMPLE AVERAGE

The green line indicates the predictions made by the model, while the orange values are the actual test
values. It is clear the predicted values are very far off from the actual values

Model was evaluated using the RMSE metric. Below is the RMSE calculated for this
model.

Simple Average Model 53.049755


18

METHOD 4: MOVING AVERAGE(MA)

Plot 16: moving average

Model was evaluated using the RMSE metric. Below is the RMSE calculated for this
model.
2 pointTrailingMovingAverage 11.589082
4 pointTrailingMovingAverage 14.506190
6 pointTrailingMovingAverage 14.558008
9 pointTrailingMovingAverage 14.797139

We created multiple moving average models with rolling windows varying from 2 to 9. Rolling average
is a better method than simple average as it takes into account only the previous n values to make the
prediction, where n is the rolling window defined. This takes into account the recent trends and is in
general more accurate. Higher the rolling window, smoother will be its curve, since more values are
being taken into account.
19

METHOD 5: SIMPLE EXPONENTIAL SMOOTHING

Plot 17: simple exponential smoothing

Model was evaluated using the RMSE metric. Below is the RMSE calculated for this
model.

Alpha=0.1,SimpleExponentialSmoothing 36.429535
20

METHOD 6: DOUBLE EXPONENTIAL SMOOTHING (HOLT'S MODEL)

Plot 18: double exponential smoothing

Model was evaluated using the RMSE metric. Below is the RMSE calculated for this
model.

Alpha Value = 0.1, beta value = 0.1, DoubleExponentialSmoothing36.510010


21

METHOD 7: TRIPLE EXPONENTIAL SMOOTHING (HOLT - WINTER'S MODEL)

Plot19 : plot triple exponential smoothing

Output for best alpha, beta and gamma values is shown by the green
color line in the above plot. Best model had both multiplicative trend as
well as seasonality. So far this is the best model

Model was evaluated using the RMSE metric. Below is the RMSE calculated for this
model.

Alpha=0.4,Beta=0.1,Gamma=0.3,TripleExponentialSmoothing 8.992350
22

5 Check for the stationarity of the data on which the model is being built on using
appropriate statistical tests and also mention the hypothesis for the statistical
test. If the data is found to be non-stationary, take appropriate steps to make it
stationary. Check the new data for stationarity and comment. Note: Stationarity
should be checked at alpha = 0.05.

Check for stationarity of the whole Time Series data.

The Augmented Dickey-Fuller test is an unit root test which determines whether
there is a unit root and subsequently whether the series is non-stationary.

The hypothesis in a simple form for the ADF test is:

● H0 : The Time Series has a unit root and is thus non-stationary.


● H1 : The Time Series does not have a unit root and is thus stationary.

We would want the series to be stationary for building ARIMA models and thus
we would want the p-value of this test to be less than the α value.

We see that at 5% significant level the Time Series is non-


stationary. Plot 20: dickey fuller test

Results of Dickey-Fuller Test:

Test Statistic -1.892338

p-value 0.335674
23

we failed to reject the null hypothesis, which implies the Series is not
stationary in nature. In order to try and make the series stationary we used
the differencing approach. We used

.diff() function on the existing series without any argument, implying the
default diff value of 1 and also dropped the NaN values, since differencing of
order 1 would generate the first value as NaN which need to be dropped

Plot 21: dickey fuller test after diff

Results of Dickey-Fuller Test:

Test Statistic -8.032729e+00

p-value 1.938803e-12

the null hypothesis that the series is not stationary at difference = 1 was
rejected, which implied that the series has indeed become stationary after we
performed the differencing.

We could now proceed ahead with ARIMA/ SARIMA models, since we had made the series stationary.
24

6 Build an automated version of the ARIMA/SARIMA model in which the


parameters are selected using the lowest Akaike Information Criteria (AIC) on
the training data and evaluate this model on the test data using RMSE.

AUTO - ARIMA model We employed a for loop for determining the optimum values of p,d,q, where p is the order of
the AR (Auto-Regressive) part of the model, while q is the order of the MA (Moving Average) part of the model. d
is the differencing that is required to make the series stationary. p,q values in the range of (0,4) were given to
the for loop, while a fixed value of 1 was given for d, since we had already determined d to be 1, while checking
for stationarity using the ADF test.

Some parameter combinations for the


Model... Model: (0, 1, 1)

Model: (0, 1, 2)

Model: (0, 1, 3)

Model: (1, 1, 0)

Model: (1, 1, 1)

Model: (1, 1, 2)

Model: (1, 1, 3)

Model: (2, 1, 0)

Model: (2, 1, 1)

Model: (2, 1, 2)

Model: (2, 1, 3)

Model: (3, 1, 0)

Model: (3, 1, 1)

Model: (3, 1, 2)

Model: (3, 1, 3)
25

Akaike information criterion (AIC) value was evaluated for each of these
models and the model with least AIC value was selected.
26

the summary report for the ARIMA model with values (p=2,d=1,q=3).
27

RMSE values are as below:

36.42079120523518
28

AUTO- SARIMA Model

A similar for loop like AUTO_ARIMA with below values was employed,
resulting in the models shown below.

p = q = range(0, 4) d= range(0,2) D = range(0,2) pdq = list(itertools.product(p,


d, q)) model_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p,
D, q))]

Examples of some parameter combinations for


Model... Model: (0, 1, 1)(0, 0, 1, 12)

Model: (0, 1, 2)(0, 0, 2, 12)

Model: (0, 1, 3)(0, 0, 3, 12)

Model: (1, 1, 0)(1, 0, 0, 12)

Model: (1, 1, 1)(1, 0, 1, 12)

Model: (1, 1, 2)(1, 0, 2, 12)

Model: (1, 1, 3)(1, 0, 3, 12)

Model: (2, 1, 0)(2, 0, 0, 12)

Model: (2, 1, 1)(2, 0, 1, 12)

Model: (2, 1, 2)(2, 0, 2, 12)

Model: (2, 1, 3)(2, 0, 3, 12)

Model: (3, 1, 0)(3, 0, 0, 12)

Model: (3, 1, 1)(3, 0, 1, 12)

Model: (3, 1, 2)(3, 0, 2, 12)

Model: (3, 1, 3)(3, 0, 3, 12)


29

Akaike information criterion (AIC) value was evaluated for each of these
models and the model with least AIC value was selected. Here only the top 5
models are shown.

the summary report for the best SARIMA model with values (3,1,1)(3,0,2,12)
30

We also plotted the graphs for the residual to determine if any further
information can be extracted or all the usable information has already been
extracted. Below were the plots for the best auto SARIMA model.

Plot 22: sarima plots


31

RSME of Model:

18.53502803217281
32

7 Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on
the training data and evaluate this model on the test data using RMSE.

Manual- ARIMA Model

PACF the ACF plot on data :

Plot 23: PACF and ACF plots

Following is plotting the PACF and ACFgraph for the training data.
33

Plot 24: PACF and ACF plot a of train date

Hence the values selected for manual ARIMA:- p=2, d=1,


q=2 summary from this manual ARIMA model.
34

Plot 25: manual arima model plots

Model Evaluation: RSME

RMSE: 36.47322487814613
35

Manual SARIMA Model

Looking at the ACF and PACF plots for training data, we can clearly see
significant spikes at lags 12,24,36,48 etc, indicating a seasonality of 12. The
parameters used for manual SARIMA model are as below.

SARIMAX(2, 1, 2)x(2, 1, 2, 12)

Below is the summary of the manual SARIMA model


36

Plot 26: manula sarima plots

Model Evaluation: RSME

14.975041301618377
37

8. Build a table (create a data frame) with all the models built along with their corresponding parameters and
the respective RMSE values on the test data.

We can clearly see that triple exponential smoothing model with alpha 0.1,
beta 0.7 and gamma 0.2 is the best as it he the lowest RSME score.
38

9 Based on the model-building exercise, build the most optimum model(s) on the
complete data and predict 12 months into the future with appropriate confidence
intervals/bands.

Based on the above comparison of all the various models that we had built, we can conclude that the triple
exponential smoothing or the Holts-Winter model is giving us the lowest RMSE, hence it would be the
most optimum model

sales predictions made by this best optimum model.

the sales prediction on the graph along with the confidence intervals. PFB the
graph.
39

Plot 27: prediction plot

Predictions, 1 year into the future are shown in orange color, while the
confidence interval has been shown in grey color.
40

10. Comment on the model thus built and report your findings and suggest the measures that the company
should be taking for future sales.
● The analysis of the wine sales data indicates a clear downward trend for the Rose wine variety
for the company, which has been declining in popularity for more than a decade.
● This trend is expected to continue in the future as well, based on the predictions of the most
optimal model.
● Wine sales are highly influenced by seasonal changes, with sales increasing during festival
season and dropping during peak winter time i.e. January.
● The company should consider running campaigns to boost the consumption of the wine during
the rest of the year, as sales are subdued during this period.
● Campaigns during the lean period (April to June) might yield maximum results for the company,
as sales are low during this period, and boosting them would increase the overall
performance of the wine in the market across the year.
● Running campaigns during peak periods (such as during festivals) might not generate significant
impact on sales, as they are already high during this time of the year.
● Campaigns during peak winter time (January) are not recommended as people are less likely to
purchase wine due to climatic reasons, and running campaigns during this period may not
change people's opinion.
● The company should also consider exploring reasons behind the decline in popularity of the
Rose wine variety, and if needed, revamp its production and marketing strategies to regain the
market share.

You might also like