0% found this document useful (0 votes)
4 views

unit-5 -notes

The document discusses the least squares method, a statistical technique used to find the best-fitting curve or line for a set of data points by minimizing the sum of squared deviations. It covers the definition, application, and limitations of the method, along with examples and formulas for calculating the regression line. Additionally, it outlines the assumptions of linear regression and methods for testing these assumptions.

Uploaded by

mekalar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

unit-5 -notes

The document discusses the least squares method, a statistical technique used to find the best-fitting curve or line for a set of data points by minimizing the sum of squared deviations. It covers the definition, application, and limitations of the method, along with examples and formulas for calculating the regression line. Additionally, it outlines the assumptions of linear regression and methods for testing these assumptions.

Uploaded by

mekalar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Unit-5 PREDICTIVE ANALYTICS

Linear Least Square

The least square method is the process of finding the best-fitting curve or line of best fit for a
set of data points by reducing the sum of the squares of the offsets (residual part) of the points
from the curve. During the process of finding the relation between two variables, the trend of
outcomes are estimated quantitatively. This process is termed as regression analysis. The
method of curve fitting is an approach to regression analysis. This method of fitting equations
which approximates the curves to given raw data is the least squares.

It is quite obvious that the fitting of curves for a particular data set are not always unique. Thus,
it is required to find a curve having a minimal deviation from all the measured data points. This
is known as the best-fitting curve and is found by using the least-squares method.

Least Square Method Definition

The least-squares method is a crucial statistical method that is practised to find a regression
line or a best-fit line for the given pattern. This method is described by an equation with specific
parameters. The method of least squares is generously used in evaluation and regression. In
regression analysis, this method is said to be a standard approach for the approximation of sets
of equations having more equations than the number of unknowns.

The method of least squares actually defines the solution for the minimization of the sum of
squares of deviations or the errors in the result of each equation. Find the formula for sum of
squares of errors, which help to find the variation in observed data.

The least-squares method is often applied in data fitting. The best fit result is assumed to reduce
the sum of squared errors or residuals which are stated to be the differences between the
observed or experimental value and corresponding fitted value given in the model.

There are two basic categories of least-squares problems:

 Ordinary or linear least squares


 Nonlinear least squares
These depend upon linearity or nonlinearity of the residuals. The linear problems are often seen
in regression analysis in statistics. On the other hand, the non-linear problems are generally
used in the iterative method of refinement in which the model is approximated to the linear one
with each iteration.

Least Square Method Graph


In linear regression, the line of best fit is a straight line as shown in the following diagram:

The given data points are to be minimized by the method of reducing residuals or offsets of
each point from the line. The vertical offsets are generally used in surface, polynomial and
hyperplane problems, while perpendicular offsets are utilized in common practice.

Least Square Method Formula

The least-square method states that the curve that best fits a given set of observations, is said
to be a curve having a minimum sum of the squared residuals (or deviations or errors) from the
given data points. Let us assume that the given points of data are (x1, y1), (x2, y2), (x3, y3), …,
(xn, yn) in which all x’s are independent variables, while all y’s are dependent ones. Also,
suppose that f(x) is the fitting curve and d represents error or deviation from each given point.

Now, we can write:

d1 = y1 − f(x1)

d2 = y2 − f(x2)

d3 = y3 − f(x3)

…..

dn = yn – f(xn)

The least-squares explain that the curve that best fits is represented by the property that the sum
of squares of all the deviations from given values must be minimum, i.e:

Sum = Minimum Quantity

Suppose when we have to determine the equation of line of best fit for the given data, then we
first use the following formula.

The equation of least square line is given by Y = a + bX

Normal equation for ‘a’:

∑Y = na + b∑X

Normal equation for ‘b’:

∑XY = a∑X + b∑X2

Solving these two normal equations we can get the required trend line equation.

Thus, we can get the line of best fit with formula y = ax + b


Solved Example

The Least Squares Model for a set of data (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) passes through
the point (xa, ya) where xa is the average of the xi‘s and ya is the average of the yi‘s. The below
example explains how to find the equation of a straight line or a least square line using the least
square method.

Question:

Consider the time series data given below:

xi 8 3 2 10 11 3 6 5 6 8

yi 4 12 1 12 9 4 9 6 1 14

Use the least square method to determine the equation of line of best fit for the data. Then plot
the line.

Solution:

Mean of xi values = (8 + 3 + 2 + 10 + 11 + 3 + 6 + 5 + 6 + 8)/10 = 62/10 = 6.2

Mean of yi values = (4 + 12 + 1 + 12 + 9 + 4 + 9 + 6 + 1 + 14)/10 = 72/10 = 7.2

Straight line equation is y = a + bx.

The normal equations are

∑y = an + b∑x

∑xy = a∑x + b∑x2

x y x2 xy

8 4 64 32

3 12 9 36

2 1 4 2

10 12 100 120
11 9 121 99

3 4 9 12

6 9 36 54

5 6 25 30

6 1 36 6

8 14 64 112

∑x = 62 ∑y = 72 ∑x2 = 468 ∑xy = 503

Substituting these values in the normal equations,

10a + 62b = 72….(1)

62a + 468b = 503….(2)

(1) × 62 – (2) × 10,

620a + 3844b – (620a + 4680b) = 4464 – 5030

-836b = -566

b = 566/836

b = 283/418

b = 0.677

Substituting b = 0.677 in equation (1),

10a + 62(0.677) = 72

10a + 41.974 = 72

10a = 72 – 41.974

10a = 30.026

a = 30.026/10
a = 3.0026

Therefore, the equation becomes,

y = a + bx

y = 3.0026 + 0.677x

This is the required trend line equation.

Now, we can find the sum of squares of deviations from the obtained values as:

d1 = [4 – (3.0026 + 0.677*8)] = (-4.4186)

d2 = [12 – (3.0026 + 0.677*3)] = (6.9664)

d3 = [1 – (3.0026 + 0.677*2)] = (-3.3566)

d4 = [12 – (3.0026 + 0.677*10)] = (2.2274)

d5 = [9 – (3.0026 + 0.677*11)] =(-1.4496)

d6 = [4 – (3.0026 + 0.677*3)] = (-1.0336)

d7 = [9 – (3.0026 + 0.677*6)] = (1.9354)

d8 = [6 – (3.0026 + 0.677*5)] = (-0.3876)


d9 = [1 – (3.0026 + 0.677*6)] = (-6.0646)

d10 = [14 – (3.0026 + 0.677*8)] = (5.5814)

∑d2 = (-4.4186)2 + (6.9664)2 + (-3.3566)2 + (2.2274)2 + (-1.4496)2 + (-1.0336)2 + (1.9354)2 +


(-0.3876)2 + (-6.0646)2 + (5.5814)2 = 159.27990

Limitations for Least-Square Method

The least-squares method is a very beneficial method of curve fitting. Despite many benefits,
it has a few shortcomings too. One of the main limitations is discussed here.

In the process of regression analysis, which utilizes the least-square method for curve fitting,
it is inevitably assumed that the errors in the independent variable are negligible or zero. In
such cases, when independent variable errors are non-negligible, the models are subjected to
measurement errors. Therefore, here, the least square method may even lead to hypothesis
testing, where parameter estimates and confidence intervals are taken into consideration due to
the presence of errors occurring in the independent variables.

The Goodness of Fit test is used to check the sample data whether it fits from a distribution of
a population. Population may have normal distribution or Weibull distribution. In simple
words, it signifies that sample data represents the data correctly that we are expecting to find
from actual population. Following tests are generally used by statisticians:

 Chi-square
 Kolmogorov-Smirnov
 Anderson-Darling
 Shipiro-Wilk

Example

A toy company builts football player toys. It claims that 30% of the cards are mid-fielders, 60%
defenders, and 10% are forwards. Considering a random sample of 100 toys has 50 mid-
fielders, 45 defenders, and 5 forwards. Given 0.05 level of significance, can you justify
company's claim?

Solution:

Determine Hypotheses

 Null hypothesis H0 - The proportion of mid-fielders, defenders, and forwards is 30%,


60% and 10%, respectively.
 Alternative hypothesis H1 - At least one of the proportions in the null hypothesis is
false.
Once the degree of relationship between variables has been established using co-relation
analysis, it is natural to delve into the nature of relationship. Regression analysis helps in
determining the cause and effect relationship between variables. It is possible to predict the
value of other variables (called dependent variable) if the values of independent variables can
be predicted using a graphical method or the algebraic method.

Graphical Method

It involves drawing a scatter diagram with independent variable on X-axis and dependent
variable on Y-axis. After that a line is drawn in such a manner that it passes through most of
the distribution, with remaining points distributed almost evenly on either side of the line.

A regression line is known as the line of best fit that summarizes the general movement of data.
It shows the best mean values of one variable corresponding to mean values of the other. The
regression line is based on the criteria that it is a straight line that minimizes the sum of squared
deviations between the predicted and observed values of the dependent variable.

Algebraic Method
Algebraic method develops two regression equations of X on Y, and Y on X.

Regression equation of Y on X

Where −

 Y = Dependent variable
 X = Independent variable
 a = Constant showing Y-intercept
 b = Constant showing slope of line

Values of a and b is obtained by the following normal equations:

Where −

 N = Number of observations

Regression equation of X on Y

Where −

 X = Dependent variable
 Y = Independent variable
 a = Constant showing Y-intercept
 b = Constant showing slope of line

Values of a and b is obtained by the following normal equations:

Where −

 N = Number of observations
Hence the equation Y on X can be written as

Y=19.96−0.713X
Hence regression equation of X and Y is

X=22.58+0.653Y

Testing a linear model

The very first step after building a linear regression model is to check whether your model
meets the assumptions of linear regression. These assumptions are a vital part of assessing
whether the model is correctly specified. In this blog I will go over what the assumptions of
linear regression are and how to test if they are met using R.
What are the Assumptions of Linear Regression?

There are primarily five assumptions of linear regression. They are:

1. There is a linear relationship between the predictors (x) and the outcome (y)

2. Predictors (x) are independent and observed with negligible error

3. Residual Errors have a mean value of zero

4. Residual Errors have constant variance

5. Residual Errors are independent from each other and predictors (x)

How to Test the Assumptions of Linear Regression?

Assumption One: Linearity of the Data

We can check the linearity of the data by looking at the Residual vs Fitted plot. Ideally, this
plot would not have a pattern where the red line (lowes smoother) is approximately horizontal
at zero.

Here is the code: plot(model name, 1)

This is what we don’t want to see:


In the above plot, we can see that there is a clear pattern in the residual plot. This would indicate
that we failed to meet the assumption that there is a linear relationship between the predictors
and the outcome variable.

Assumption Two: Predictors (x) are Independent & Observed with Negligible Error

The easiest way to check the assumption of independence is using the Durbin-Watson test. We
can conduct this test using R’s built-in function called durbinWatsonTest on our model.
Running this test will give you an output with a p-value, which will help you determine whether
the assumption is met or not.

Here is the code: durbinWatsonTest(model name)

The null hypothesis states that the errors are not auto-correlated with themselves (they are
independent). Thus, if we achieve a p-value > 0.05, we would fail to reject the null hypothesis.
This would give us enough evidence to state that our independence assumption is met!

Assumption Three: Residual Errors have a Mean Value of Zero


We can easily check this assumption by looking at the same residual vs fitted plot. We would
ideally want to see the red line flat on 0, which would indicate that the residual errors have a
mean value of zero.

In the above plot, we can see that the red line is above 0 for low fitted values and high fitted
values. This indicates that the residual errors don’t always have a mean value of 0.

Assumption Four: Residual Errors have Constant Variance

We can check this assumption using the Scale-Location plot. In this plot we can see the fitted
values vs the square root of the standardized residuals. Ideally, we would want to see the
residual points equally spread around the red line, which would indicate constant variance.

Here is the code: plot(model name, 3)

This is what we want to see:


This is what we don't want to see:

In the above plot, we can see that the residual points are not all equally spread out. Thus, this
assumption is not met. One common solution to this problem is to calculate the log or square
root transformation of the outcome variable.
We can also use the Non-Constant Error Variance (NVC) Test using R’s built in function
called nvcTest to check this assumption. Make sure you install the package car prior to running
the nvc test.

Here is the code: nvcTest(model name)

This will output a p-value which will help you determine whether your model follows the
assumption or not. The null hypothesis states that there is constant variance. Thus, if you get a
p-value> 0.05, you would fail to reject the null. This means you have enough evidence to state
that your assumption is met!

Assumption Five: Residual Errors are Independent from Each Other & Predictors (x)

This assumption requires knowledge of study design or data collection in order to establish the
validity of this assumption

Weighted resampling

Resampling is a fundamental technique in data science, widely used to improve the accuracy
and efficiency of statistical models. This method is particularly significant when dealing with
small or imbalanced datasets. By understanding and applying resampling methods, data
scientists can gain more insights from their data, make better predictions, and enhance the
generalisability of their models.

This blog is about the intricacies of resampling in data science, its goals, the different types
available, and common errors encountered. It summarises the importance of this technique,
for anyone venturing into the realm of data science or looking to sharpen their analytical
skills, grasping the concept of resampling is essential.

Let's explore resampling in more detail and consider how it can be a powerful tool in a data
scientist's arsenal.

What is Resampling in Data Science?

Resampling in data science refers to repeatedly drawing samples from a given data set and
recalculating statistics on these samples. This technique is used to estimate the accuracy of
sample statistics by using subsets of accessible data or drawing randomly with replacement.

Resampling provides a flexible and robust method for making statistical inferences or
predictions when the traditional assumptions of classical statistical tests cannot be satisfied
or when sample sizes are too small for conventional methods.
The method is central to many modern statistical techniques, including bootstrapping and
cross-validation, which are instrumental in validating models and making them reliable for
predictive analytics.

Resampling helps assess the stability of the models and their performance metrics by
simulating the sampling process from the underlying population multiple times. This allows
data scientists to understand variability and bias more comprehensively, thus enhancing the
decision-making process in predictive modelling and hypothesis testing.

Goals of Data Resampling

The primary goals of resampling methodologies in data science encompass a variety of


objectives, each aiming to fortify the model's predictive accuracy and reliability:

1. Estimation of Sampling Distribution: Resampling enables the estimation of the


distribution of sample statistics without requiring complex mathematical formulas or
assumptions about the population. This is particularly useful for generating
confidence intervals and testing hypotheses.
2. Model Validation: Validating a model’s performance in various simulated
environments is crucial. Resampling techniques like cross-validation help understand
how a model generalises to an independent data set, which is essential for practical
applications.
3. Handling Overfitting: Data scientists can detect overfitting in predictive models by
using resampling methods. Techniques like k-fold cross-validation force the model to
prove its effectiveness on multiple train-test splits, ensuring robustness.
4. Improving Model Accuracy: Resampling can improve model accuracy by allowing
multiple iterations and tweaks based on consistently updated data samples, refining
the model progressively.
5. Mitigating Imbalance in Data: In cases of class imbalance, resampling techniques
such as up-sampling the minority class or down-sampling the majority class can help
create a more balanced data environment, leading to fairer and more accurate model
predictions.
6. Feature Selection: By repeatedly resampling the data, it is possible to identify which
features consistently contribute to predictive accuracy, helping in effective feature
selection, which is a critical step in building efficient models.
7. Estimation of Model Uncertainty: Through resampling, data scientists can estimate
the uncertainty or variability in their model predictions, providing a range within
which the true outcome is expected to lie, thereby adding a layer of transparency to
predictions.
8. Algorithm Testing: Different resampling plans can be employed to test various
algorithms under diverse conditions, aiding in selecting the most appropriate
algorithm for the data.
9. Cost Reduction: Virtual simulations of various scenarios using resampling can
significantly reduce the cost associated with physical or more extensive experiments.
10. Enhanced Decision Making: Ultimately, resampling aids in making more informed,
reliable, and data-driven decisions, which is crucial for business intelligence and
strategic planning.

Types of Resampling

In the domain of data science, several resampling techniques are used, each with specific
applications and benefits:

1. Bootstrapping involves sampling with replacement from the data set, creating
thousands of replicas, and calculating the desired statistical measures on each.
2. Cross-Validation: Frequently used in machine learning, cross-validation involves
dividing the data into subsets, using one subset to test the model and the others to
train it.

Cross-Validation is used to estimate the test error associated with a model to evaluate its
performance.

Validation set approach:

This is the most basic approach. It simply involves randomly dividing the dataset into
two parts: first a training set and second a validation set or hold-out set. The model is fit
on the training set and the fitted model is used to make predictions on the validation set.

Validation Set Approach in Resampling Method


Cross-Validation is used to estimate the test error associated with a model to evaluate its
performance.
Validation set approach: This is the most basic approach. It simply involves randomly
dividing the dataset into two parts: first a training set and second a validation set or hold-out
set. The model is fit on the training set and the fitted model is used to make predictions on
the validation set.

Leave-one-out-cross-validation:
LOOCV is a better option than the validation set approach. Instead of splitting the entire
dataset into two halves, only one observation is used for validation and the rest is used to fit
the model.

Leave one out of cross-validation

k-fold cross-validation
This approach involves randomly dividing the set of observations into k folds of nearly equal
size. The first fold is treated as a validation set and the model is fit on the remaining folds.
The procedure is then repeated k times, where a different group each time is treated as the
validation set.
K-fold Cross-validation

This approach involves randomly dividing the set of observations into k folds of nearly
equal size. The first fold is treated as a validation set and the model is fit on the remaining
folds. The procedure is then repeated k times, where a different group each time is treated
as the validation set.

3. Jackknife: A precursor to bootstrapping, the jackknife technique systematically


omits one observation at a time from the sample to estimate a statistic.

The Jackknife works by sequentially deleting one observation in the data set, then
recomputing the desired statistic. It is computationally simpler than bootstrapping,
and more orderly (i.e. the procedural steps are the same over and over again). This
means that, unlike bootstrapping, it can theoretically be performed by hand. However,
it’s still fairly computationally intensive so although in the past it was common to use
by-hand calculations, computers are normally used today. One area where it doesn’t
perform well for non-smooth statistics (like the median) and nonlinear (e.g. the
correlation coefficient).

The main application for the Jackknife is to reduce bias and evaluate variance for an
estimator. It can also be used to:

 Find the standard error of a statistic,


 Estimate precision for an estimator θ.

4. Permutation Tests: Permutation tests are used primarily for hypothesis testing. They
involve calculating all possible values of the test statistic under rearrangements of the
labels on observed data points.
5. Random Subsampling: Similar to cross-validation, the splits are random and can
overlap. This technique is often simpler and faster but may include bias if not
managed carefully.

Errors in Resampling

While resampling is a powerful tool, it is not without potential pitfalls:

1. Overfitting: Although resampling can help mitigate overfitting, inappropriate use,


especially with too many iterations, can lead to models that could be more complex
and specific to the sample data.
2. Underfitting: Conversely, insufficient resampling can result in underfitting, where
models are too simplistic and unable to capture underlying patterns in the data.
3. Bias: Certain resampling techniques, especially those involving non-random
methods, can introduce bias into the model, affecting its generalisability.
4. Variance: High variance in resampling results can make it difficult to discern a
model's true performance, particularly in cases with small data sets or high variability.
5. Computational Expense: Some resampling methods are computationally intensive
and require significant resources, which can be a limitation in large-scale applications

Time Series Analysis


Time series analysis is a specific way of analyzing a sequence of data points collected over an
interval of time. In time series analysis, analysts record data points at consistent intervals over
a set period of time rather than just recording the data points intermittently or randomly.
However, this type of analysis is not merely the act of collecting data over time.

What sets time series data apart from other data is that the analysis can show how variables
change over time. In other words, time is a crucial variable because it shows how the data
adjusts over the course of the data points as well as the final results. It provides an additional
source of information and a set order of dependencies between the data.

Time series analysis typically requires a large number of data points to ensure consistency and
reliability. An extensive data set ensures you have a representative sample size and that analysis
can cut through noisy data. It also ensures that any trends or patterns discovered are not outliers
and can account for seasonal variance. Additionally, time series data can be used for
forecasting—predicting future data based on historical data.

Why organizations use time series data analysis

Time series analysis helps organizations understand the underlying causes of trends or systemic
patterns over time. Using data visualizations, business users can see seasonal trends and dig
deeper into why these trends occur. With modern analytics platforms, these visualizations can
go far beyond line graphs.

When organizations analyze data over consistent intervals, they can also use time series
forecasting to predict the likelihood of future events. Time series forecasting is part
of predictive analytics. It can show likely changes in the data, like seasonality or cyclic
behavior, which provides a better understanding of data variables and helps forecast better.
For example, Des Moines Public Schools analyzed five years of student achievement data to
identify at-risk students and track progress over time. Today’s technology allows us to collect
massive amounts of data every day and it’s easier than ever to gather enough consistent data
for comprehensive analysis.

Time series analysis examples

Time series analysis is used for non-stationary data—things that are constantly fluctuating over
time or are affected by time. Industries like finance, retail, and economics frequently use time
series analysis because currency and sales are always changing. Stock market analysis is an
excellent example of time series analysis in action, especially with automated trading
algorithms. Likewise, time series analysis is ideal for forecasting weather changes, helping
meteorologists predict everything from tomorrow’s weather report to future years of climate
change. Examples of time series analysis in action include:

 Weather data
 Rainfall measurements
 Temperature readings
 Heart rate monitoring (EKG)
 Brain monitoring (EEG)
 Quarterly sales
 Stock prices
 Automated stock trading
 Industry forecasts
 Interest rates
Time Series Analysis Types

Because time series analysis includes many categories or variations of data, analysts sometimes
must make complex models. However, analysts can’t account for all variances, and they can’t
generalize a specific model to every sample. Models that are too complex or that try to do too
many things can lead to a lack of fit. Lack of fit or overfitting models lead to those models not
distinguishing between random error and true relationships, leaving analysis skewed and
forecasts incorrect.

Models of time series analysis include:

 Classification: Identifies and assigns categories to the data.


 Curve fitting: Plots the data along a curve to study the relationships of variables within
the data.
 Descriptive analysis: Identifies patterns in time series data, like trends, cycles, or
seasonal variation.
 Explanative analysis: Attempts to understand the data and the relationships within it,
as well as cause and effect.
 Exploratory analysis: Highlights the main characteristics of the time series data,
usually in a visual format.
 Forecasting: Predicts future data. This type is based on historical trends. It uses the
historical data as a model for future data, predicting scenarios that could happen along
future plot points.
 Intervention analysis: Studies how an event can change the data.
 Segmentation: Splits the data into segments to show the underlying properties of the
source information.
Data classification

Further, time series data can be classified into two main categories:

 Stock time series data means measuring attributes at a certain point in time, like a
static snapshot of the information as it was.
 Flow time series data means measuring the activity of the attributes over a certain
period, which is generally part of the total whole and makes up a portion of the results.
Data variations

In time series data, variations can occur sporadically throughout the data:

 Functional analysis can pick out the patterns and relationships within the data to
identify notable events.
 Trend analysis means determining consistent movement in a certain direction. There
are two types of trends: deterministic, where we can find the underlying cause, and
stochastic, which is random and unexplainable.
 Seasonal variation describes events that occur at specific and regular intervals during
the course of a year. Serial dependence occurs when data points close together in time
tend to be related.

Time series analysis and forecasting models must define the types of data relevant to answering
the business question. Once analysts have chosen the relevant data they want to analyze, they
choose what types of analysis and techniques are the best fit.

Important Considerations for Time Series Analysis


While time series data is data collected over time, there are different types of data that describe
how and when that time data was recorded. For example:

 Time series data is data that is recorded over consistent intervals of time.
 Cross-sectional data consists of several variables recorded at the same time.
 Pooled data is a combination of both time series data and cross-sectional data.

Moving Averages

Time series analysis can be used to analyse historic data and establish any underlying trend and
seasonal variations within the data. The trend refers to the general direction the data is heading
in and can be upward or downward. The seasonal variation refers to the regular variations
which exist within the data. This could be a weekly variation with certain days traditionally
experiencing higher or lower sales than other days, or it could be monthly or quarterly
variations.

The trend and seasonal variations can be used to help make predictions about the future – and
as such can be very useful when budgeting and forecasting.

Calculating moving averages

One method of establishing the underlying trend (smoothing out peaks and troughs) in a set of
data is using the moving averages technique. Other methods, such as regression analysis can
also be used to estimate the trend. Regression analysis is dealt with in a separate article.

A moving average is a series of averages, calculated from historic data. Moving averages can
be calculated for any number of time periods, for example a three-month moving average, a
seven-day moving average, or a four-quarter moving average. The basic calculations are the
same.

The following simplified example will take us through the calculation process.

Monthly sales revenue data were collected for a company for 20X2:

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Sales
125 145 186 131 151 192 137 157 198 143 163 204
$000
From this data, we will calculate a three-month moving average, as we can see a basic cycle
that follows a three-monthly pattern (increases January – March, drops for April then increases
April – June, drops for July and so on). In an exam, the question will state what time period to
use for this cycle/pattern in order to calculate the averages required.

Step 1 – Create a table

Create a table with 5 columns, shown below, and list the data items given in columns one and
two. The first three rows from the data given above have been input in the table:

Step 2 – Calculate the three-month moving average.

Add together the first three sets of data, for this example it would be January, February and
March. This gives a total of (125+145+186) = 456. Put this total in the middle of the data you
are adding, so in this case across from February. Then calculate the average of this total, by
dividing this figure by 3 (the figure you divide by will be the same as the number of time
periods you have added in your total column). Our three-month moving average is therefore
(456 ÷ 3) = 152.

The average needs to be calculated for each three-month period. To do this you move your
average calculation down one month, so the next calculation will involve February, March and
April. The total for these three months would be (145+186+131) = 462 and the average would
be (462 ÷ 3) = 154.
Continue working down the data until you no longer have three items to add together. Note:
you will have fewer averages than the original observations as you will lose the beginning and
end observations in the averaging process.

Step 3 – Calculate the trend

The three-month moving average represents the trend. From our example we can see a clear
trend in that each moving average is $2,000 higher than the preceding month moving average.
This suggests that the sales revenue for the company is, on average, growing at a rate of $2,000
per month.

This trend can now be used to predict future underlying sales values.

Step 4 – Calculate the seasonal variation

Once a trend has been established, any seasonal variation can be calculated. The seasonal
variation can be assumed to be the difference between the actual sales and the trend (three-
month moving average) value. Seasonal variations can be calculated using the additive or
multiplicative models.

Using the additive model:To calculate the seasonal variation, go back to the table and for each
average calculated, compare the average to the actual sales figure for that period.
A negative variation means that the actual figure in that period is less than the trend and a
positive figure means that the actual is more than the trend.

From the data we can see a clear three-month cycle in the seasonal variation. Every first month
has a variation of -7, suggesting that this month is usually $7,000 below the average. Every
second month has a variation of 32 suggesting that this month is usually $32,000 above the
average. In month 3, the variation suggests that every third month, the actual will be $25,000
below the average.

It is assumed that this pattern of seasonal adjustment will be repeated for each three-month
period going forward.

Using the multiplicative model:


If we had used the multiplicative model, the variations would have been expressed as
a percentage of the average figure, rather than an absolute. For example:
This suggests that month 1 is usually 95% of the trend, month 2 is 121% and month 3 is 84%.
The multiplicative model is a better method to use when the trend is increasing or decreasing
over time, as the seasonal variation is also likely to be increasing or decreasing.

Note that with the additive model the three seasonal variations must add up to zero (32-25-7 =
0). Where this is not the case, an adjustment must be made. With the multiplicative model the
three seasonal variations add to three (0.95 + 1.21 + 0.84 = 3). (If it was four-month average,
the four seasonal variations would add to four etc). Again, if this is not the case, an adjustment
must be made.

In this simplified example the trend shows an increase of exactly $2,000 each month, and the
pattern of seasonal variations is exactly the same in each three-month period. In reality a time
series is unlikely to give such a perfect result.

Step 5 – Using time series to forecast the future

Now that the trend and the seasonal variations have been calculated, these can be used to predict
the likely level of sales revenue for the future.

Types of Moving Averages

Simple Moving Average

A simple moving average (SMA) is calculated by taking the arithmetic mean of a given set of
values over a specified period. A set of numbers, or prices of stocks, are added together and
then divided by the number of prices in the set. The formula for calculating the simple moving
average of a security is as follows:

SMA=A1+A2+…+An

where:A=Average in period n

n=Number of time periods

Charting stock prices over 50 days using a simple moving average may look like this:
Exponential Moving Average (EMA)

The exponential moving average gives more weight to recent prices in an attempt to make
them more responsive to new information. To calculate an EMA, the simple moving average
(SMA) over a particular period is calculated first.

Then calculate the multiplier for weighting the EMA, known as the "smoothing factor," which
typically follows the formula: [2/(selected time period + 1)].

For a 20-day moving average, the multiplier would be [2/(20+1)]= 0.0952. The smoothing
factor is combined with the previous EMA to arrive at the current value. The EMA thus gives
a higher weighting to recent prices, while the SMA assigns an equal weighting to all values.

Simple Moving Average (SMA) vs. Exponential Moving Average (EMA)

The calculation for EMA puts more emphasis on the recent data points. Because of this, EMA
is considered a weighted average calculation.

In the figure below, the number of periods used in each average is 15, but the EMA responds
more quickly to the changing prices than the SMA. The EMA has a higher value when the
price is rising than the SMA and it falls faster than the SMA when the price is declining. This
responsiveness to price changes is the main reason why some traders prefer to use the EMA
over the SMA.

Example of a Moving Average

The moving average is calculated differently depending on the type: SMA or EMA. Below,
we look at a simple moving average (SMA) of a security with the following closing prices
over 15 days:

 Week 1 (5 days): 20, 22, 24, 25, 23


 Week 2 (5 days): 26, 28, 26, 29, 27
 Week 3 (5 days): 28, 30, 27, 29, 28

A 10-day moving average would average out the closing prices for the first 10 days as the first
data point. The next data point would drop the earliest price, add the price on day 11, and take
the average.

Missing Values

Handling missing values in time series data in R is a crucial step in the data preprocessing
phase. Time series data often contains gaps or missing observations due to various reasons such
as sensor malfunctions, human errors, or other external factors. In R Programming
Language dealing with missing values appropriately is essential to ensure the accuracy and
reliability of analyses and models built on time series data. Here are some common strategies
for handling missing values in time series data.
Understanding Missing Values in Time Series Data
In general Time Series data is a type of data where observations are collected over some time
at successive intervals. Time series are used in various fields such as finance, engineering, and
biological sciences, etc,
 Missing values will disrupt the order of the data which indirectly results in the inaccurate
representation of trends and patterns over some time
 By Imputing missing values we can ensure the statistical analysis done on the Time Serial
data is reliable based on the patterns we observed.
 Similar to other models handling missing values in the time series data improves the model
performance.
In R Programming there are various ways to handle missing values of Time Series Data using
functions that are present under the ZOO package.
It's important to note that the choice of method depends on the nature of the data and the
underlying reasons for missing values. A combination of methods or a systematic approach to
evaluating different imputation strategies may be necessary to determine the most suitable
approach for a given time series dataset. Additionally, care should be taken to assess the impact
of missing value imputation on the validity of subsequent analyses and models.
Step 1: Load Necessary Libraries and Dataset
R
# Load necessary libraries
library(zoo)
library(ggplot2)
# Generate sample time series data with missing values
set.seed(789)
dates <- seq(as.Date("2022-01-01"), as.Date("2022-01-31"), by = "days")
time_series_data <- zoo(sample(c(50:100, NA), length(dates), replace = TRUE),
order.by = dates)
head(time_series_data)
Output:
2022-01-01 2022-01-02 2022-01-03 2022-01-04 2022-01-05 2022-01-06
94 97 61 NA 91 75

Step 2: Visualize Original Time Series


R
# Visualize the original time series with line and area charts
original_line_plot <- ggplot(data.frame(time = index(time_series_data),
values = coredata(time_series_data)),
aes(x = time, y = values)) +
geom_line(color = "blue") +
ggtitle("Original Time Series Data (Line Chart)")

original_line_plot
Output:

Handling Missing Values in Time Series Data

Step 3: Identify Missing Values


R
# Check for missing values
missing_values <- which(is.na(coredata(time_series_data)))
print(paste("Indices of Missing Values: ", missing_values))
Output:
[1] "Indices of Missing Values: 4" "Indices of Missing Values: 15"

 "Indices of Missing Values: 4": This means that at index (or position) 4 in the time series data,
there is a missing value. In R, indexing usually starts from 1, so this refers to the fourth
observation in our dataset.
 "Indices of Missing Values: 15": Similarly, at index 15 in the time series data, there is another
missing value. This corresponds to the fifteenth observation in our dataset.
Step 4: Handle Missing Values
1. Linear Imputation
Linear Interpolation is the method used to impute the missing values that lie between two
known values in the time series data by the mean of both preceding and succeeding values. To
achieve this, we have a function under the zoo package in R named na.approx() which is used
to interpolate missing values.
R
# Load necessary libraries
library(zoo)
library(ggplot2)

# Assuming time_series_data is already defined and contains missing values

# Mean imputation using na.approx


linear_imputations <- na.approx(time_series_data)

# Visualize with mean imputation in an attractive line plot


Linear_imputation_plot <- ggplot(data.frame(time = index(linear_imputations),
values = coredata(linear_imputations)),
aes(x = time, y = values)) +
geom_line(color = "blue", size = 0.5) + # Adjust line color and size
geom_point(color = "red", size = 1, alpha = 0.7) +
theme_minimal() + # Use a minimal theme
labs(title = "Time Series with Linear Imputation", # Add title
x = "Time", # Label for x-axis
y = "Values") + # Label for y-axis
scale_x_date(date_labels = "%b %d", date_breaks = "1 week") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Linear_imputation_plot
Output:

2. Forward Filling
Forward filling involves filling missing values with the most recent observed value,
R
# Forward fill
time_series_data_fill <- na.locf(time_series_data)

# Forward fill with line plot and points


fill_line_point_plot <- ggplot(data.frame(time = index(time_series_data_fill),
values = coredata(time_series_data_fill)),
aes(x = time, y = values)) +
geom_line(color = "darkgreen", size = 1) +
geom_point(color = "red", size = 1.5) +
ggtitle("Time Series with Forward Fill (Line Plot with Points)")

fill_line_point_plot
Autocorrelation

Autocorrelation refers to the degree of correlation of the same variables between two
successive time intervals. It measures how the lagged version of the value of a variable is
related to the original version of it in a time series.

Autocorrelation, as a statistical concept, is also known as serial correlation. It is often used with
the autoregressive-moving-average model (ARMA) and autoregressive-integrated-moving-
average model (ARIMA). The analysis of autocorrelation helps to find repeating periodic
patterns, which can be used as a tool for technical analysis in the capital markets.

How It Works

In many cases, the value of a variable at a point in time is related to the value of it at a previous
point in time. Autocorrelation analysis measures the relationship of the observations between
the different points in time, and thus seeks a pattern or trend over the time series. For example,
the temperatures on different days in a month are autocorrelated.

Similar to correlation, autocorrelation can be either positive or negative. It ranges from -1


(perfectly negative autocorrelation) to 1 (perfectly positive autocorrelation). Positive
autocorrelation means that the increase observed in a time interval leads to a proportionate
increase in the lagged time interval.

The example of temperature discussed above demonstrates a positive autocorrelation. The


temperature the next day tends to rise when it’s been increasing and tends to drop when it’s
been decreasing during the previous days.

The observations with positive autocorrelation can be plotted into a smooth curve. By adding
a regression line, it can be observed that a positive error is followed by another positive one,
and a negative error is followed by another negative one.
Conversely, negative autocorrelation represents that the increase observed in a time interval
leads to a proportionate decrease in the lagged time interval. By plotting the observations with
a regression line, it shows that a positive error will be followed by a negative one and vice
versa.
Autocorrelation can be applied to different numbers of time gaps, which is known as lag. A lag
1 autocorrelation measures the correlation between the observations that are a one-time gap
apart. For example, to learn the correlation between the temperatures of one day and the
corresponding day in the next month, a lag 30 autocorrelation should be used (assuming 30
days in that month).

Test for Autocorrelation

The Durbin-Watson statistic is commonly used to test for autocorrelation. It can be applied to
a data set by statistical software. The outcome of the Durbin-Watson test ranges from 0 to 4.
An outcome closely around 2 means a very low level of autocorrelation. An outcome closer to
0 suggests a stronger positive autocorrelation, and an outcome closer to 4 suggests a stronger
negative autocorrelation.

It is necessary to test for autocorrelation when analyzing a set of historical data. For example,
in the equity market, the stock prices on one day can be highly correlated to the prices on
another day. However, it provides little information for statistical data analysis and does not
tell the actual performance of the stock.

Therefore, it is necessary to test for the autocorrelation of the historical prices to identify to
what extent the price change is merely a pattern or caused by other factors. In finance, an
ordinary way to eliminate the impact of autocorrelation is to use percentage changes in asset
prices instead of historical prices themselves.

Autocorrelation and Technical Analysis

Although autocorrelation should be avoided in order to apply further data analysis more
accurately, it can still be useful in technical analysis, as it looks for a pattern from historical
data. The autocorrelation analysis can be applied together with the momentum factor analysis.

A technical analyst can learn how the stock price of a particular day is affected by those of
previous days through autocorrelation. Thus, he can estimate how the price will move in the
future.

If the price of a stock with strong positive autocorrelation has been increasing for several days,
the analyst can reasonably estimate the future price will continue to move upward in the recent
future days. The analyst may buy and hold the stock for a short period of time to profit from
the upward price movement.

The autocorrelation analysis only provides information about short-term trends and tells little
about the fundamentals of a company. Therefore, it can only be applied to support the trades
with short holding periods.

Serial Correlation:

Serial correlation is used in statistics to describe the relationship between observations of the
same variable over specific periods. If a variable's serial correlation is measured as zero, there
is no correlation, and each of the observations is independent of one another. Conversely, if a
variable's serial correlation skews toward one, the observations are serially correlated, and
future observations are affected by past values. Essentially, a variable that is serially correlated
has a pattern and is not random.

Error terms occur when a model is not completely accurate and results in differing results
during real-world applications. When error terms from different (usually adjacent) periods (or
cross-section observations) are correlated, the error term is serially correlated. Serial
correlation occurs in time-series studies when the errors associated with a given period carry
over into future periods. For example, when predicting the growth of stock dividends, an
overestimate in one year will lead to overestimates in succeeding years.

The Concept of Serial Correlation

Serial correlation was originally used in engineering to determine how a signal, such as a
computer signal or radio wave, varies compared to itself over time. The concept grew in
popularity in economic circles as economists and practitioners of econometrics used the
measure to analyze economic data over time.

Almost all large financial institutions now have quantitative analysts, known as quants, on
staff. These financial trading analysts use technical analysis and other statistical inferences to
analyze and predict the stock market. These modelers attempt to identify the structure of the
correlations to improve forecasts and the potential profitability of a strategy. In addition,
identifying the correlation structure improves the realism of any simulated time series based
on the model. Accurate simulations reduce the risk of investment strategies.

Quants are integral to the success of many of these financial institutions since they provide
market models that the institution then uses as the basis for its investment strategy.

Introduction to survival analysis

Survival analysis is a collection of statistical procedures for data analysis where the outcome
variable of interest is time until an event occurs. Because of censoring–the nonobservation of
the event of interest after a period of follow-up–a proportion of the survival times of interest
will often be unknown.

What is censoring?
The notion of censoring is fundamental to survival analysis and is used when computing our
survival functions (more on that in the next part of the series). But what do I mean by
censoring? Strictly speaking, censoring is a condition when only part of the observation or
measurement is known. That is the ability to take into account missing data, whereby the
time to event is not observed.
For example, death in office of a president, or someone leaving a medical study before the
study formally concludes. In the case of the latter, you can see this is really important for the
analysis in medical trials, but in both cases the underlying principle is the same – we made
some observations until a given time, but we cannot measure the event. If a president dies after
one year in office, how can we possibly know that they would have served two terms?

Left and right censoring in Survival Analysis


There are different types of censoring, two commonly discussed ones are left and right
censoring (two others that come to mind are interval censoring and random censoring, but are
not discussed here).
 Left censoring is when the event has occurred before the data is collected (or study has
started) – that is we only know the upper bound of the time. For example, in a medical study
someone dies before the drug trial begins (which is normally not considered).
 Whereas, right censoring is when only a lower limit of the time is known, for example, if
a subject leaves a study before the end, or the study ends before the event occurs.
You can think of this as events that happened to the left of time (in the past) are left censored,
and events that may happen to the right of time (in the future) are right censored.

In the case of turnover, we are only considering right censoring, where a person may leave at
some point in the future, but we don’t know when they will (if at all). Hopefully, the diagram
below will help demonstrate this.

The
notion of Censoring in Survival Analysis – Right Censoring for employee churn
Example of censoring: medical study
In the above example, we have 10 subjects in a medical study that begins at time t=0 and ends
at time t=20 (don’t worry about units in this example, you can imagine weeks, months, years
if it helps). Each subject is recorded until either the event happens (circle) or the end of the
study is reached (the black vertical line at t=20).
As you can see we observe the event during the study for the red subjects, and the blue lines
represent participants that no event occurred during the study period. Notice that some of the
blue lines do end before the current time but occur after the end of the study period, and this is
the critical thing, they are right-censored! If we did not include this into our analysis we would
be underestimating the true average for our subjects.

Example of censoring: virus testing


Another example, which is much more fitting in today’s climate, is one that concerns virus
testing. Let us imagine that some proportion of the population has been exposed to a virus and
individuals are tested at a given point in time to see whether they have the virus or not. We will
assume that these tests are
 unrealistically accurate
 produce no false positives or negatives
 therefore anyone who tests positive has the virus and similarly, anyone testing negative does
not have the virus (at the time of testing).

The notion of Censoring in Survival Analysis – Left and Right Censoring


Now we can say that people with a positive test have been exposed to the virus at some point
leading up to the test, but we don’t know exactly when they contracted the virus. Therefore,
they are left-censored, since the event is when the individual contracted the virus, not the
positive test. Similarly, anyone who tests negative is right-censored. In this rather unique case,
our dataset is filled with only left and right censored cases, we actually never observe the event
directly and only have lower and upper bounds for individuals’ time of contracting the virus.
In reality, the situation becomes even more complex given the testing accuracy, and we would
need to consider interval censoring as well.

Conversely, if we define the event as the positive test then we have no left censoring and have
the case as described previously, observing the event and all negative tests are then right
censored, as shown in the diagram below.

The notion of Censoring in Survival Analysis – Left and Right Censoring

You might also like