TSF – ROSE
REPORT
DSBA
`1 | P a g e
Contents
Problem :
Read the data as an appropriate Time Series data and plot the data………3
Perform appropriate Exploratory Data Analysis to understand the data and also
perform decomposition.
Split the data into training and test. The test data should start in 1991.
Build all the exponential smoothing models on the training data and evaluate the
model using RMSE on the test data. Other models such as regression, naïve forecast
models and simple average models. should also be built on the training data and check
the performance on the test data using RMSE.
Check for the stationarity of the data on which the model is being built on using
appropriate statistical tests and also mention the hypothesis for the statistical test. If
the data is found to be non-stationary, take appropriate steps to make it stationary.
Check the new data for stationarity and comment. Note: Stationarity should be
checked at alpha = 0.05.
Build an automated version of the ARIMA/SARIMA model in which the parameters are
selected using the lowest Akaike Information Criteria (AIC) on the training data and
evaluate this model on the test data using RMSE.
Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the
training data and evaluate this model on the test data using RMSE.
Build a table (create a data frame) with all the models built along with their
corresponding parameters and the respective RMSE values on the test data.
Based on the model-building exercise, build the most optimum model(s) on the
complete data and predict 12 months into the future with appropriate confidence
intervals/bands.
Comment on the model thus built and report your findings and suggest the measures
that the company should be taking for future sales.
`2 | P a g e
Problem
You as an analyst have been tasked with performing a thorough analysis of the data
and coming up with insights to improve the marketing campaign. For this particular
assignment, the data of different types of wine sales in the 20th century is to be
analysed. Both of these data are from the same company but of different wines. As
an analyst in the ABC Estate Wines, you are tasked to analyse and forecast Wine
Sales in the 20th century.
Data set for the Problem: [Link] and [Link]
1. Read the data as an appropriate Time Series data and plot the data.
Normally an analyst would be interested in knowing the formal properties of a
Dataset, so we need to know what kind of data we are going to use and need to
plot the data through Time Series Forecasting methods.
Solution:
a) Dimensions of the Dataset = 187 Rows x 2 Columns.
b) Nature of Variables (Datum) present in Dataset:
Nature of Datum - Table
S/NO Column _Datum Datatype
1 YearMonth Integer
2 Rose Integer
c) Displaying First and Last 5 Rows including last date of each month:
Top Five Rows Last Five Rows
Functions & Methods Used:
a) shape() is used to get the size of the Dataset which gives us the
dimensions of the Dataset.
b) info() is used to get the Nature (Datatype) of all variables (Datum) present
in the Dataset.
c) head() is used to display the first 10 rows data by default.
d) tail() is used to display the last 10 rows data by default.
d) Plotting of Data-Line Plot:
`3 | P a g e
Here plotting of data is done to separate the ‘YearMonth’ Column into
separate columns like ‘Year’ & ‘Month’ and the ‘Rose Column is renamed
as ‘Rose Sales’ for better clarity and understanding.
So, the new dataset consists of 187 Rows and 3 Columns.
Top Five Rows Last Five Rows
2. Perform appropriate Exploratory Data Analysis to understand the data and
also perform decomposition.
On performing the preliminary analysis of the variables present in the Dataset,
common answers like Standard Deviation, mean, median, mode etc…are obtained
and especially null values check is also involved.
Solution:
a) Description for newly plotted Dataset:
b) Null Value Check:
`4 | P a g e
Based on the results we can conclude that there are ‘2’ Null Values present
in the given dataset. So ‘Mean’ method is followed to fill up the null values.
c) Plotting of Complete Dataset Variables-Box Plot:
From the box plots we are able to identify the outliers present in and among
the variables. Here the presence of outliers doesn’t make much impact on
our required data, so we have decided not to treat them.
Box Plot – Weekday Wise :
Here it shows ‘Tuesday’ has more sales when compared to the others and
‘Wednesday’ stands at the lowest & Outliers are absent in ‘Friday’ &
‘Thursday’.
d) Plot for Monthly Sparkling Sales over the Years:
`5 | P a g e
Month vs Year Graph
This plot shows that the Sparkling Sales is high in the year ’1981’ when
compared to the others and in month wise it is highest in ‘December’.
e) Correlation Map:
Analysis can be done by co-relating the data with one another through
different methods so that we can gain deeper insights in finding
relationships among the variables in the dataset.
Heat Map
From the above heatmap we can conclude that there is a high Correlation
among the variables ‘Month & Rose Sales’ and low Correlation between
‘Year & Rose Sales’ in the dataset.
f) Empirical Cumulative Distribution Function:
`6 | P a g e
In statistics, an empirical distribution function (commonly also called an
empirical Cumulative Distribution Function, eCDF) is the distribution
function associated with the empirical measure of a sample. This cumulative
distribution function is a step function that jumps up by 1/n at each of
the n data points. Its value at any specified value of the measured variable
is the fraction of observations of the measured variable that are less than or
equal to the specified value.
The empirical distribution function is an estimate of the cumulative
distribution function that generated the points in the sample.
Observations:
This graph shows:
50% of sales have been less than 100;
the highest value is 250;
and almost 90% of sales have been less than 150.
g) Decomposition Plot :
Additive Model:
Additive state decomposition occurs when a system is decomposed into two
or more subsystems with the same dimension as that of the original
system. A commonly used decomposition in the control field is to
decompose a system into two or more lower-order subsystems, called
lower-order subsystem decomposition here. In contrast, additive state
decomposition is to decompose a system into two or more subsystems with
the same dimension as that of the original system.
Observations:
a) It shows the Rose Sales & Trend peaks at 1981.
b) After 1981, Trend & Rose Sales are at normal rate.
c) Here Seasonal & Residue are present to the satisfactory level.
`7 | P a g e
Multiplicative Model:
Multiplicative Model are the one where as the data increases, so does
the seasonal pattern or the variance increases. Here the trend and
seasonal components are multiplied and then added to the error
component. Multiplicative model is non-linear, such as quadratic or
exponential and the trend is a curved line and seasonality has an
𝒴(𝑡)=𝑆t x 𝑇t x Rt
increasing or decreasing frequency and amplitude over time.
Observations:
a) It shows the Rose Sales & Trend peaks at 1981.
b) After 1981, Trend & Rose Sales are at normal rate.
c) Here Seasonal & Residue are present to the satisfactory level,
Since residue is in lower range this model is selected for further
purposes
`8 | P a g e
Functions & Methods & Plots Used:
a) isnull().sum() helps to count missing values in each column by
default, and in each row with axis=1.
b) describe().T is used to view some basic statistical details like
percentile, mean, std, etc. of a data frame or a series of numeric
values.
c) Box Plot - A Box Plot is also known as Whisker plot is created to
display the summary of the set of data values having properties like
minimum, first quartile, median, third quartile and maximum. In the
box plot, a box is created from the first quartile to the third quartile, a
vertical line is also there which goes through the box at the median.
Here x-axis denotes the data to be plotted while the y-axis shows
the frequency distribution (Reference:
[Link]
d) Heatmap - Heatmap is defined as a graphical representation of
data using colours to visualize the value of the matrix. In this, to
represent more common values or higher activities brighter colours
basically reddish colours are used and to represent less common or
activity values, darker colours are preferred. Heatmap is also
defined by the name of the shading matrix.(Reference:
[Link]
3. Split the data into training and test. The test data should start in 1991.
Solution:
a) Plot for the Test and Training data:
`9 | P a g e
Based on the plot we can see that the segregated test data starts from
January of 1991 and continues till the end.
b) Description for the Test and Training data:
`10 | P a g e
4. Build all the exponential smoothing models on the training data and evaluate
the model using RMSE on the test data. Other models such as regression,
naïve forecast models and simple average models. should also be built on the
training data and check the performance on the test data using RMSE.
• Model 1:Linear Regression
• Model 2: Naive Approach
• Model 3: Simple Average
• Model 4: Moving Average(MA)
• Model 5: Simple Exponential Smoothing
• Model 6: Double Exponential Smoothing (Holt's Model)
• Model 7: Triple Exponential Smoothing (Holt - Winter's Model)
Solution:
`11 | P a g e
a) Model 1:Linear Regression:
Observations:
The model's predictions are shown by the green line, while the test results are shown by the
orange values. It is evident that the real values are substantially different from the expected
ones.
The RMSE measure was used to assess the model. The RMSE for this model is shown
below.
b) Model 2:Naïve Approach:
Observations:
c) The model's predictions are shown by the green line, while the test results
are shown by the orange values. It is evident that the real values are
substantially different from the expected ones.
`12 | P a g e
The RMSE measure was used to assess the model. The RMSE for this
model is shown below.
Method 3:Simple Average:
Observations:
a) While the orange numbers represent the actual test results, the
green line represents the model's predictions. It is obvious that the
anticipated numbers are wildly different from the actual values.
The RMSE metric was employed to evaluate the model. The
model's RMSE is displayed below.
d) Method 4:Moving Average(MA):
`13 | P a g e
Observations:
The RMSE measure was used to evaluate the model. The RMSE for this
model is shown below.
We developed a number of moving average models with rolling windows
ranging from 2 to 9. Rolling average is a superior method than simple
average since it predicts using only the previous n values, where n is the
defined rolling window. This takes recent developments into consideration
and is generally more accurate. The higher the rolling window, the smoother
the curve, because more values are taken into account.
`14 | P a g e
e) Method 5:Simple Exponential Smoothing:
Observations:
The RMSE measure was used to evaluate the model. The RMSE for this model is shown
below.
`15 | P a g e
f) Method 6: Double Exponential Smoothing (Holt's Model):
Observations:
RMSE was used to evaluate the model. This model's RMSE is shown below.
`16 | P a g e
g) Method 7: Triple Exponential Smoothing (Holt - Winter's Model):
Observations:
The green colour line in the above plot represents the output for the best alpha, beta, and
gamma values. The best model had both a multiplicative trend and seasonality.
So far, this is the most effective model.
The RMSE measure was used to evaluate the model. The RMSE for this model is shown
above.
`17 | P a g e
5. Check for the stationarity of the data on which the model is being built on
using appropriate statistical tests and also mention the hypothesis for the
statistical test. If the data is found to be non-stationary, take appropriate steps
to make it stationary. Check the new data for stationarity and comment. Note:
Stationarity should be checked at alpha = 0.05.
Solution:
Check for stationarity of the whole Time Series data.
The Augmented Dickey-Fuller test is an unit root test which determines whether
there is a unit root and subsequently whether the series is non-stationary.
The hypothesis in a simple form for the ADF test is:
H0 : The Time Series has a unit root and is thus non-stationary.
H1 : The Time Series does not have a unit root and is thus stationary.
We would want the series to be stationary for building ARIMA models and thus we
would want the p-value of this test to be less than the α value.
We see that at 5% significant level the Time Series is non-stationary.
`18 | P a g e
6. Build an automated version of the ARIMA/SARIMA model in which the
parameters are selected using the lowest Akaike Information Criteria (AIC) on
the training data and evaluate this model on the test data using RMSE.
Solution:
a) Auto ARIMA Model:
We used a for loop to find the best values of p,d,q, where p represents the
order of the AR (Auto-Regressive) part of the model and q represents the
order of the MA (Moving Average) part of the model. d is the amount of
differencing required to make the series stationary. The for loop was given
p,q values in the range (0,4), but d was given a fixed value of 1 because we
had already found d to be 1 while checking for stationarity using the ADF
test.
Some Model parameter combinations...
For each of these models, the Akaike information criterion (AIC) value was
calculated, and the model with the lowest AIC value was [Link]
Model parameter combinations....
`19 | P a g e
b) Auto SARIMA Model:
`20 | P a g e
A for loop identical to AUTO_ARIMA was used with the values provided
below, resulting in the models shown below.
range(0, 4) p = q range(0,2) d D = 0-2 range list([Link](p, d, q))
pdq
model_pdq = [(x[0], x[1], x[2], 12) in list([Link](p, D, q)) for x]
For each of these models, the Akaike information criterion (AIC) value was
calculated, and the model with the lowest AIC value was chosen. Only the
top 5 models are shown here.
The summary report for the best SARIMA model with values (3,1,1)
(3,0,2,12)
We also displayed the leftover graphs to see if any additional information
could be recovered or if all relevant information had already been collected. The
graphs for the best auto SARIMA model are shown below.
`21 | P a g e
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF
on the training data and evaluate this model on the test data using RMSE.
`22 | P a g e
Solution:
a) Manual ARIMA Model:
PACF the ACF plot on data : Plot 23: PACF and ACF plots
Following is plotting the PACF and ACFgraph for the training data.
Hence the values selected for manual ARIMA:- p=2, d=1, q=2 summary from this manual ARIMA
model.
`23 | P a g e
b) Manual SARIMA Model:
`24 | P a g e
Looking at the ACF and PACF plots for training data, we can clearly see significant spikes at lags
12,24,36,48 etc, indicating a seasonality of 12. The parameters used for manual SARIMA model are
as below. SARIMAX(2, 1, 2)x(2, 1, 2, 12)
Below is the summary of the manual SARIMA model
The triple exponential smoothing model with alpha 0.1, beta 0.7, and
gamma 0.2 is clearly the best because it has the lowest RSME score.
`25 | P a g e
8. Build a table (create a data frame) with all the models built along with their
corresponding parameters and the respective RMSE values on the test data.
Solution:
`26 | P a g e
We can clearly see that triple exponential smoothing model with alpha 0.1, beta 0.7 and
gamma 0.2 is the best as it he the lowest RSME score.
`27 | P a g e
9. Based on the model-building exercise, build the most optimum model(s) on
the complete data and predict 12 months into the future with appropriate
confidence intervals/bands.
Solution:
Based on the above comparison of all the various models that we had built,we can
conclude that the triple exponential smoothing or the Holts-Winter model is
giving us the lowest RMSE, hence it would be the most optimum model sales
predictions made by this best optimum model.
the sales prediction on the graph along with the confidence intervals. PFB the
graph
`28 | P a g e
Predictions, 1 year into the future are shown in orange color, while the confidence
interval has been shown in grey color.
10. Comment on the model thus built and report your findings and suggest the
measures that the company should be taking for future sales.
Solution:
The review of wine sales data shows a definite negative trend for the company's
Rose wine variety, which has been dropping in popularity for more than a decade.
Seasonal fluctuations have a significant impact on wine sales, with sales increasing
during festival season and decreasing during peak winter months, such as January.
Because sales are low during this time of year, the company should consider
advertising advertisements to increase wine consumption for the rest of the year.
Campaigns during the lean season (April to June) may offer the best benefits for the
firm because sales are low at this time, and increasing them would improve the
wine's overall success in the market throughout the year.
Running advertisements during peak seasons (such as during festivals) may have
little impact on sales because they are already high at this time of year.
Advertising during peak winter months (January) are not advised because people are
less inclined to purchase wine owing to climatic factors, and running advertising
during this time period may not affect people's opinions.
`29 | P a g e
The corporation should also investigate the reasons for the drop in popularity of the
Rose wine varietal, and if necessary, overhaul its production and marketing strategy
to reclaim market share.
`30 | P a g e
THANK YOU
`31 | P a g e