ISHA 26MayRoseWine
ISHA 26MayRoseWine
Rose
Isha Shukla
25 May 2024
TSF-Coded-Rose 1
INDEX
SL Title Page No.
No.
2 Data Pre-processing 15
Plots
1. Line plot of dataset
2. Boxplot of dataset
3. Lineplot of sales
4. Boxplot of yearly data
5. Boxplot of monthly data
6. Weekly boxplot
7. Graph of monthly sales over the year
8. Correlation
9. ECDF plot
10. Decomposition addictive
11. Decomposition multiplicative
12. Train and test dataset
13. Linear regression
14. Moving average
15. Simple exponential smoothing
16. Double exponential smoothing
17. Naive approach
18. Simple average
19. Triple exponential smoothing
20. Dickey fuller test
21. Dickey fuller test after diff
22. Auto ARIMA plot
23. Auto SARIMA plots
24. Manual ARIMA
25. Manual SARIMA
26. PACF and ACF plot
27. PACF and ACF plot train dataset
28. Manual ARIMA plot
29. Manual SARIMA plot
30. Prediction plot
TSF-Coded-Rose 2
f
Problem Statement
The main goal of this project is to study and predict wine sales trends from
the 20th century using historical data from ABC Estate Wines. We want to
give ABC Estate Wines useful insights to improve sales, take advantage of
new market opportunities, and stay competitive in the wine industry.
TSF-Coded-Rose 3
f
f
f
f
Table 1 - Rows of the dataset
Comprehensive Summary of
the dataset.
TSF-Coded-Rose 4
5
5
II. Plot the data
Plot 1 - Plot the data
TSF-Coded-Rose 5
5
IV. Perform Exploratory Data Analysis (EDA)
• We'll resample the data to aggregate values at a monthly level from the
daily-level data, computing the average for each month.
TSF-Coded-Rose 6
Plot 2 - The trend of Rose at year level
• There was a peak in 1981. The plot shows that there is trend and
seasonality.
Plot 3 - Yearly Box-plot
TSF-Coded-Rose 7
Yearly Box-plot
Plot 4 - Monthly Boxplot
Monthly Boxplot
Weekly Box-plot
TSF-Coded-Rose 8
Outliers Observation
• Yearly Boxplot - Outliers persist across nearly all years. There was a
peek in 1981.
Pivot Table
TSF-Coded-Rose 9
• There are missing values for August, October, September, December,
and November in 1995.
Plot 6 - Pivot table plot for monthly wine sale across year
TSF-Coded-Rose 10
indicates that there is a consistent decrease in Rose wine sales values as
the years progress, implying a long-term downward trend in Rose sales.
• The moderate positive correlation between Rose Wine sales and Month
indicates some seasonality in Rose sales. This suggests that there is a
tendency for Rose sales to increase as the month progresses, implying a
seasonal pattern within each year. However, this correlation is not as
strong as the trend observed over the years.
From the ECDF plot of wine sales observations, we can observe the following:
• The x-axis represents the range of wine sales observations, while the y-axis
represents the cumulative probability.
• By examining the slope of the curve, we can infer the density of the
observations at different values. Steeper slopes indicate higher density of
observations, while latter slopes indicate lower density.
TSF-Coded-Rose 11
f
• The ECDF plot provides a comprehensive overview of the distribution of
wine sales observations, allowing us to assess characteristics such as
central tendency, spread, and percentiles.
TSF-Coded-Rose 12
IV. Decomposition
a. Additive
Decomposition - Additive
• Peak year was 1981. Afterward sales is decreasing over the time.
TSF-Coded-Rose 13
Additive Decomposition - Trend, Multiplicative Decomposition - Trend,
Seasonality and Residual in year Seasonality and Residual in year
b. Multiplicative - Plot 10
Decomposition - Multiplicative
TSF-Coded-Rose 14
1
1
9
9
8
8
0
0
• Trend and Seasonality is present.
• Peak year was 1981. Afterward sales is decreasing over the time.
2. Data Pre-processing
I. Train-test split
• The data from 1980 to 1990 is used as the training set, while the data from
1991 to 1995 is used as the testing set. This separation allows us to use the
earlier data for training models and the later data for testing their
performance.
Table 5 - Train and Test rows and columns
3. Model Building
(1) Linear Regression
Plot 12 - Linear Regression
TSF-Coded-Rose 16
(2) Moving Average (MA)
For the moving average model, we are going to calculate rolling means (or
moving averages) for different intervals. The best interval can be determined
by the maximum accuracy (or the minimum error) over here. For Moving
Average, we are going to average over the entire data.
TSF-Coded-Rose 17
5
2
2
4
6
9
RMSE calculated for Moving Average:
TSF-Coded-Rose 18
2
(3) Simple Exponential Smoothening Model - Plot 14
A Simple Exponential Smoothing (SES) model is a time series forecasting
technique that applies weighted averages of past observations to make
future predictions. In SES, more recent observations are given exponentially
more weight compared to older observations, allowing the model to adapt
quickly to changes in the data.
The SES model is particularly useful for data with no clear trend or seasonal
pattern, as it effectively smooths out short-term luctuations to reveal longer-
term trends or patterns.
RMSE
TSF-Coded-Rose 19
f
(4) Double Exponential Smoothening (Holt's Model)
Double Exponential Smoothing (DES), also known as Holt's Exponential
Smoothing, is an extension of Simple Exponential Smoothing that
incorporates both level and trend components to handle time series data
with trends.
The DES method helps in capturing both the level and the trend in the time
series, making it suitable for datasets where trends are present, thus
providing more accurate forecasts compared to Simple Exponential
Smoothing when trends exist in the data.
• Two parameters and are estimated in this model. Level and Trend are
Holt’s Model
RMSE
TSF-Coded-Rose 20
𝛼
𝛽
(5) Naive Approach
The Naive Approach is a simple and straightforward time series forecasting
method where the forecast for any future period is assumed to be equal to
the most recent actual observation.
TSF-Coded-Rose 21
(6) Simple Average
The Simple Average Time Series Forecasting (TSF) model is a basic yet
effective method for predicting future values in a time series. It operates on
the principle that the forecasted value for a given period is the simple
average (arithmetic mean) of all previous observations. This model is
particularly useful for data with a consistent level over time and without
signi icant trends or seasonal patterns.
TSF-Coded-Rose 23
RMSE = 11.76
TSF-Coded-Rose 24
f
RMSE Value in sorted way for all the building
TSF-Coded-Rose 25
f
Dickey-Fuller Test
Conclusion:
• The test statistic (-1.933803) is higher than the critical values at the 1%, 5%,
and 10% signi icance levels.
• The p-value (0.316330) is greater than typical signi icance thresholds (e.g.,
0.01, 0.05, 0.10).
TSF-Coded-Rose 26
f
f
• As a result, we fail to reject the null hypothesis that the time series has a
unit root (i.e., it is non-stationary).
• This suggests that the time series is likely non-stationary, meaning its
statistical properties such as mean and variance may change over time.
The Dickey-Fuller test, performed after differencing the data, is used to test
the null hypothesis that a unit root is present in the time series sample. Here
are the key points regarding the null hypothesis and the interpretation of the
test results:
TSF-Coded-Rose 27
1. Null Hypothesis (H0): The time series has a unit root (i.e., it is non-
stationary).
2. Alternative Hypothesis (H1): The time series does not have a unit root (i.e.,
it is stationary).
• This value is the computed test statistic for the Dickey-Fuller test. It is
compared against the critical values to determine whether to reject the
null hypothesis.
• The p-value is extremely small, much less than typical signi icance levels
(e.g., 0.01, 0.05, 0.10). This indicates strong evidence against the null
hypothesis.
Comparison:
• The test statistic (-7.855944) is more negative than all the critical values at
the 1%, 5%, and 10% levels.
• Since the test statistic is much lower (more negative) than the critical
values, and the p-value is extremely small, we reject the null hypothesis.
TSF-Coded-Rose 28
f
f
Conclusion:
• Given that the test statistic (-7.855944) is much lower than the critical
values and the p-value is signi icantly small, we reject the null hypothesis
that the time series has a unit root.
• This indicates that the time series is stationary, meaning its statistical
properties such as mean and variance remain constant over time.
Plot the Autocorrelation and the Partial Autocorrelation function plots on the whole data.
TSF-Coded-Rose 29
f
(1) Auto ARIMA (Auto-Regressive Integrated Moving Average)
For Rose wine sales analysis, the parameter d represents the differencing
required to render the series stationary. The for loop iterates over p and q
values ranging from 0 to 3, while a ixed value of 1 is assigned to d. This
choice is made because we had previously determined through the
Augmented Dickey-Fuller (ADF) test that a differencing order of 1 was
necessary to achieve stationarity.
TSF-Coded-Rose 30
f
f
f
f
Some parameter combinations for the Model:
The summary report for the Auto ARIMA model offers a detailed overview of
the model's performance and diagnostics. It begins by identifying the
dependent variable, labeled as "Rose," and speci ies that 132 observations
were utilized in the analysis. The chosen ARIMA model is denoted as
ARIMA(2, 1, 3), indicating auto-regressive and moving average orders of 2 and
3, respectively, with a differencing order of 1. The log likelihood, AIC (Akaike
Information Criterion) - 1274.695 , and BIC (Bayesian Information Criterion) -
1291.946 values provide measures of model it, with lower AIC and BIC values
indicating better it. Additionally, the report includes parameter estimates for
the model coef icients, standard errors, and statistical signi icance.
Diagnostic tests such as the Ljung-Box (Q) - 0.02 and Jarque-Bera (JB) - 24.44
TSF-Coded-Rose 32
f
f
=
2
=
1
=
3
f
f
f
tests assess the goodness of it, while the Heteroskedasticity (H) - 0.4 test
evaluates the constancy of residual variance. Skewness and kurtosis
measures provide insights into the distributional properties of the residuals.
Overall, this comprehensive summary aids in the interpretation and
evaluation of the Auto ARIMA model, helping to understand its effectiveness
in capturing the underlying patterns in the time series data.
TSF-Coded-Rose 33
f
Diagnostic Plot for auto ARIMA
Plot 23 - Diagnostic plot for auto ARIMA for the best auto ARIMA model
Components of SARIMA
1. Auto-Regressive (AR) part: Represents the correlation between the
current observation and a lagged (past) observation within the same
series.
2. Integrated (I) part: Involves differencing the raw observations to make
the time series stationary. This accounts for trends present in the data.
TSF-Coded-Rose 34
3. Moving Average (MA) part: Represents the correlation between the
current observation and a residual error from a moving average model
applied to lagged observations.
Additionally, SARIMA includes seasonal components:
1. Seasonal Auto-Regressive (SAR) part: Represents the correlation
between the current observation and a lagged observation within the
same series, but over seasonal intervals.
2. Seasonal Integrated (SI) part: Involves seasonal differencing to remove
seasonal trends from the data.
3. Seasonal Moving Average (SMA) part: Represents the correlation
between the current observation and a residual error from a moving
average model applied to lagged observations over seasonal intervals.
Overall, Auto SARIMA helps you forecast future values in your data easily and
accurately by automatically inding the best way to do it.
For Rose wine sales analysis, the parameter d represents the differencing
required to render the series stationary. The for loop iterates over p, d and q
values ranging from 0 to 2. The parameter m represent number of seasonals
months. We are keeping seasonal month as 12. This choice is made because
we had previously determined through the Augmented Dickey-Fuller (ADF)
test that a differencing order of 1 was necessary to achieve stationarity.
p=0, d=1 and q=2 has the minimum AIC value of 716.793
Seasonal p=2, d=2 ,q=2 and m=12.
TSF-Coded-Rose 35
f
Top rows for Auto SARIMA based on the minimum AIC
value
TSF-Coded-Rose 36
5
f
-351.396, indicating how well the model aligns with the data, while the AIC
and BIC stand at 716.793 and 733.467, respectively, serving as measures of
model it and complexity. The temporal span of the dataset extends from
January 31, 1980, to December 31, 1990. Covariance estimation of the model
is identi ied as "opg." Parameter estimates offer insights into the coef icients
of model terms, alongside their associated standard errors and statistical
signi icance. The variance of residuals, denoted as Sigma2, is recorded at
274.1713. Diagnostic tests encompass the Ljung-Box (Q) test, Jarque-Bera (JB)
test, and a test for heteroskedasticity (H), assessing various assumptions
underlying the model. Additionally, skewness and kurtosis measures provide
further characterization of the distributional properties of residuals. Overall,
the summary furnishes a comprehensive assessment of the SARIMAX model's
performance, encompassing its alignment with data, parameter signi icance,
and adherence to underlying assumptions.
RMSE
Plot 24 - Diagnostic plot for auto SARIMA for the best auto SARIMA model
TSF-Coded-Rose 37
f
f
f
f
f
Diagnostic Plot for Auto SARIMA
• Manual ARIMA
In manual ARIMA, the user manually selects appropriate values for
This approach requires a deep understanding of the data and the underlying
patterns to choose the most suitable parameters. Manual ARIMA is often
used when automated methods like Auto ARIMA or Auto SARIMA are not
available or when users prefer a more hands-on approach to model selection.
However, it can be time-consuming and may not always yield the best results
compared to automated approaches.
TSF-Coded-Rose 38
𝑝
𝑑
𝑞
ACF Plot on train data ACF Plot on Partial Auto-correlation
TSF-Coded-Rose 39
Value selected for manual ARIMA: p=1, q=1 and d=1
The manual ARIMA model, denoted as ARIMA(1, 1, 1), was applied to analyze
the "Rose" dataset comprising 132 observations. The model suggests a irst-
order auto-regressive component p=1 and a irst-order moving average
component q=1, along with irst-order differencing d=1. The log likelihood of
the model is reported as -637.287, with corresponding AIC and BIC values of
1280.574 and 1289.200, respectively. The HQIC value stands at 1284.079.
Parameter estimates indicate a coef icient of 0.1814 for the auto-regressive
term and -0.9192 for the moving average term. The variance of residuals
(sigma2) is calculated as 972.5964. Diagnostic tests include the Ljung-Box
test for autocorrelation, with a p-value of 0.98, indicating no signi icant
autocorrelation, and the Jarque-Bera test for normality, yielding a p-value of
0.00. Additionally, the model's heteroskedasticity test returns a p-value of
0.00, suggesting heteroskedasticity is present. Overall, the manual ARIMA
model provides insights into the relationships between the variables and
their predictive capabilities within the dataset.
TSF-Coded-Rose 40
=
1
=
1
=
1
f
f
f
f
f
RSME value for Manual ARIMA
TSF-Coded-Rose 41
1
1
1
• Manual SARIMA
The manual SARIMA (Seasonal Auto-Regressive Integrated Moving Average)
model is a technique for time series forecasting where the user manually
selects the values of the SARIMA parameters to capture both the seasonal
and non-seasonal patterns present in the data.
Once the parameters are selected, the manual SARIMA model is itted to the
data, and forecasts can be generated for future time points. Diagnostic tests
and evaluation metrics are then used to assess the model's performance and
determine if adjustments to the parameter values are necessary.
Manual SARIMA offers lexibility and control over the modeling process, but it
requires expertise in time series analysis and a deep understanding of the
data to select appropriate parameter values that result in accurate forecasts.
TSF-Coded-Rose 43
f
f
f
f
f
f
6. Compare the performance of the models
TSF-Coded-Rose 44
• Rebuild the best model using the entire data - Make a forecast for
the next 12 months
After comparing all the models we constructed, it's evident that the triple
exponential smoothing or Holt-Winters model yields the lowest RMSE.
Therefore, it emerges as the most optimal choice. We will rebuild the best
model using triple exponential smoothing for the next 12 months prediction.
Forecasts and con idence intervals into a DataFrame.
TSF-Coded-Rose 45
1
2
1
2
f
Future predicted plot
TSF-Coded-Rose 46
• July Onwards: Sales begin to increase.
• The highest wine sale was recorded in the year 1981.
3. Seasonal In luence:
Wine sales are signi icantly in luenced by seasonal changes, with an increase
during the festival season and a drop during peak winter (January).
Recommendations
• Focus on marketing campaigns from April to June, when sales are low, to
boost overall annual performance.
• Running campaigns during peak periods might not signi icantly impact
sales, as they are already high.
• Explore reasons behind the decline in Rose wine popularity and adjust
production and marketing strategies as needed to regain market share.
TSF-Coded-Rose 47
f
f
f
f