unit-5 -notes
unit-5 -notes
The least square method is the process of finding the best-fitting curve or line of best fit for a
set of data points by reducing the sum of the squares of the offsets (residual part) of the points
from the curve. During the process of finding the relation between two variables, the trend of
outcomes are estimated quantitatively. This process is termed as regression analysis. The
method of curve fitting is an approach to regression analysis. This method of fitting equations
which approximates the curves to given raw data is the least squares.
It is quite obvious that the fitting of curves for a particular data set are not always unique. Thus,
it is required to find a curve having a minimal deviation from all the measured data points. This
is known as the best-fitting curve and is found by using the least-squares method.
The least-squares method is a crucial statistical method that is practised to find a regression
line or a best-fit line for the given pattern. This method is described by an equation with specific
parameters. The method of least squares is generously used in evaluation and regression. In
regression analysis, this method is said to be a standard approach for the approximation of sets
of equations having more equations than the number of unknowns.
The method of least squares actually defines the solution for the minimization of the sum of
squares of deviations or the errors in the result of each equation. Find the formula for sum of
squares of errors, which help to find the variation in observed data.
The least-squares method is often applied in data fitting. The best fit result is assumed to reduce
the sum of squared errors or residuals which are stated to be the differences between the
observed or experimental value and corresponding fitted value given in the model.
The given data points are to be minimized by the method of reducing residuals or offsets of
each point from the line. The vertical offsets are generally used in surface, polynomial and
hyperplane problems, while perpendicular offsets are utilized in common practice.
The least-square method states that the curve that best fits a given set of observations, is said
to be a curve having a minimum sum of the squared residuals (or deviations or errors) from the
given data points. Let us assume that the given points of data are (x1, y1), (x2, y2), (x3, y3), …,
(xn, yn) in which all x’s are independent variables, while all y’s are dependent ones. Also,
suppose that f(x) is the fitting curve and d represents error or deviation from each given point.
d1 = y1 − f(x1)
d2 = y2 − f(x2)
d3 = y3 − f(x3)
…..
dn = yn – f(xn)
The least-squares explain that the curve that best fits is represented by the property that the sum
of squares of all the deviations from given values must be minimum, i.e:
Suppose when we have to determine the equation of line of best fit for the given data, then we
first use the following formula.
∑Y = na + b∑X
Solving these two normal equations we can get the required trend line equation.
The Least Squares Model for a set of data (x1, y1), (x2, y2), (x3, y3), …, (xn, yn) passes through
the point (xa, ya) where xa is the average of the xi‘s and ya is the average of the yi‘s. The below
example explains how to find the equation of a straight line or a least square line using the least
square method.
Question:
xi 8 3 2 10 11 3 6 5 6 8
yi 4 12 1 12 9 4 9 6 1 14
Use the least square method to determine the equation of line of best fit for the data. Then plot
the line.
Solution:
∑y = an + b∑x
x y x2 xy
8 4 64 32
3 12 9 36
2 1 4 2
10 12 100 120
11 9 121 99
3 4 9 12
6 9 36 54
5 6 25 30
6 1 36 6
8 14 64 112
-836b = -566
b = 566/836
b = 283/418
b = 0.677
10a + 62(0.677) = 72
10a + 41.974 = 72
10a = 72 – 41.974
10a = 30.026
a = 30.026/10
a = 3.0026
y = a + bx
y = 3.0026 + 0.677x
Now, we can find the sum of squares of deviations from the obtained values as:
The least-squares method is a very beneficial method of curve fitting. Despite many benefits,
it has a few shortcomings too. One of the main limitations is discussed here.
In the process of regression analysis, which utilizes the least-square method for curve fitting,
it is inevitably assumed that the errors in the independent variable are negligible or zero. In
such cases, when independent variable errors are non-negligible, the models are subjected to
measurement errors. Therefore, here, the least square method may even lead to hypothesis
testing, where parameter estimates and confidence intervals are taken into consideration due to
the presence of errors occurring in the independent variables.
The Goodness of Fit test is used to check the sample data whether it fits from a distribution of
a population. Population may have normal distribution or Weibull distribution. In simple
words, it signifies that sample data represents the data correctly that we are expecting to find
from actual population. Following tests are generally used by statisticians:
Chi-square
Kolmogorov-Smirnov
Anderson-Darling
Shipiro-Wilk
Example
A toy company builts football player toys. It claims that 30% of the cards are mid-fielders, 60%
defenders, and 10% are forwards. Considering a random sample of 100 toys has 50 mid-
fielders, 45 defenders, and 5 forwards. Given 0.05 level of significance, can you justify
company's claim?
Solution:
Determine Hypotheses
Graphical Method
It involves drawing a scatter diagram with independent variable on X-axis and dependent
variable on Y-axis. After that a line is drawn in such a manner that it passes through most of
the distribution, with remaining points distributed almost evenly on either side of the line.
A regression line is known as the line of best fit that summarizes the general movement of data.
It shows the best mean values of one variable corresponding to mean values of the other. The
regression line is based on the criteria that it is a straight line that minimizes the sum of squared
deviations between the predicted and observed values of the dependent variable.
Algebraic Method
Algebraic method develops two regression equations of X on Y, and Y on X.
Regression equation of Y on X
Where −
Y = Dependent variable
X = Independent variable
a = Constant showing Y-intercept
b = Constant showing slope of line
Where −
N = Number of observations
Regression equation of X on Y
Where −
X = Dependent variable
Y = Independent variable
a = Constant showing Y-intercept
b = Constant showing slope of line
Where −
N = Number of observations
Hence the equation Y on X can be written as
Y=19.96−0.713X
Hence regression equation of X and Y is
X=22.58+0.653Y
The very first step after building a linear regression model is to check whether your model
meets the assumptions of linear regression. These assumptions are a vital part of assessing
whether the model is correctly specified. In this blog I will go over what the assumptions of
linear regression are and how to test if they are met using R.
What are the Assumptions of Linear Regression?
1. There is a linear relationship between the predictors (x) and the outcome (y)
5. Residual Errors are independent from each other and predictors (x)
We can check the linearity of the data by looking at the Residual vs Fitted plot. Ideally, this
plot would not have a pattern where the red line (lowes smoother) is approximately horizontal
at zero.
Assumption Two: Predictors (x) are Independent & Observed with Negligible Error
The easiest way to check the assumption of independence is using the Durbin-Watson test. We
can conduct this test using R’s built-in function called durbinWatsonTest on our model.
Running this test will give you an output with a p-value, which will help you determine whether
the assumption is met or not.
The null hypothesis states that the errors are not auto-correlated with themselves (they are
independent). Thus, if we achieve a p-value > 0.05, we would fail to reject the null hypothesis.
This would give us enough evidence to state that our independence assumption is met!
In the above plot, we can see that the red line is above 0 for low fitted values and high fitted
values. This indicates that the residual errors don’t always have a mean value of 0.
We can check this assumption using the Scale-Location plot. In this plot we can see the fitted
values vs the square root of the standardized residuals. Ideally, we would want to see the
residual points equally spread around the red line, which would indicate constant variance.
In the above plot, we can see that the residual points are not all equally spread out. Thus, this
assumption is not met. One common solution to this problem is to calculate the log or square
root transformation of the outcome variable.
We can also use the Non-Constant Error Variance (NVC) Test using R’s built in function
called nvcTest to check this assumption. Make sure you install the package car prior to running
the nvc test.
This will output a p-value which will help you determine whether your model follows the
assumption or not. The null hypothesis states that there is constant variance. Thus, if you get a
p-value> 0.05, you would fail to reject the null. This means you have enough evidence to state
that your assumption is met!
Assumption Five: Residual Errors are Independent from Each Other & Predictors (x)
This assumption requires knowledge of study design or data collection in order to establish the
validity of this assumption
Weighted resampling
Resampling is a fundamental technique in data science, widely used to improve the accuracy
and efficiency of statistical models. This method is particularly significant when dealing with
small or imbalanced datasets. By understanding and applying resampling methods, data
scientists can gain more insights from their data, make better predictions, and enhance the
generalisability of their models.
This blog is about the intricacies of resampling in data science, its goals, the different types
available, and common errors encountered. It summarises the importance of this technique,
for anyone venturing into the realm of data science or looking to sharpen their analytical
skills, grasping the concept of resampling is essential.
Let's explore resampling in more detail and consider how it can be a powerful tool in a data
scientist's arsenal.
Resampling in data science refers to repeatedly drawing samples from a given data set and
recalculating statistics on these samples. This technique is used to estimate the accuracy of
sample statistics by using subsets of accessible data or drawing randomly with replacement.
Resampling provides a flexible and robust method for making statistical inferences or
predictions when the traditional assumptions of classical statistical tests cannot be satisfied
or when sample sizes are too small for conventional methods.
The method is central to many modern statistical techniques, including bootstrapping and
cross-validation, which are instrumental in validating models and making them reliable for
predictive analytics.
Resampling helps assess the stability of the models and their performance metrics by
simulating the sampling process from the underlying population multiple times. This allows
data scientists to understand variability and bias more comprehensively, thus enhancing the
decision-making process in predictive modelling and hypothesis testing.
Types of Resampling
In the domain of data science, several resampling techniques are used, each with specific
applications and benefits:
1. Bootstrapping involves sampling with replacement from the data set, creating
thousands of replicas, and calculating the desired statistical measures on each.
2. Cross-Validation: Frequently used in machine learning, cross-validation involves
dividing the data into subsets, using one subset to test the model and the others to
train it.
Cross-Validation is used to estimate the test error associated with a model to evaluate its
performance.
This is the most basic approach. It simply involves randomly dividing the dataset into
two parts: first a training set and second a validation set or hold-out set. The model is fit
on the training set and the fitted model is used to make predictions on the validation set.
Leave-one-out-cross-validation:
LOOCV is a better option than the validation set approach. Instead of splitting the entire
dataset into two halves, only one observation is used for validation and the rest is used to fit
the model.
k-fold cross-validation
This approach involves randomly dividing the set of observations into k folds of nearly equal
size. The first fold is treated as a validation set and the model is fit on the remaining folds.
The procedure is then repeated k times, where a different group each time is treated as the
validation set.
K-fold Cross-validation
This approach involves randomly dividing the set of observations into k folds of nearly
equal size. The first fold is treated as a validation set and the model is fit on the remaining
folds. The procedure is then repeated k times, where a different group each time is treated
as the validation set.
The Jackknife works by sequentially deleting one observation in the data set, then
recomputing the desired statistic. It is computationally simpler than bootstrapping,
and more orderly (i.e. the procedural steps are the same over and over again). This
means that, unlike bootstrapping, it can theoretically be performed by hand. However,
it’s still fairly computationally intensive so although in the past it was common to use
by-hand calculations, computers are normally used today. One area where it doesn’t
perform well for non-smooth statistics (like the median) and nonlinear (e.g. the
correlation coefficient).
The main application for the Jackknife is to reduce bias and evaluate variance for an
estimator. It can also be used to:
4. Permutation Tests: Permutation tests are used primarily for hypothesis testing. They
involve calculating all possible values of the test statistic under rearrangements of the
labels on observed data points.
5. Random Subsampling: Similar to cross-validation, the splits are random and can
overlap. This technique is often simpler and faster but may include bias if not
managed carefully.
Errors in Resampling
What sets time series data apart from other data is that the analysis can show how variables
change over time. In other words, time is a crucial variable because it shows how the data
adjusts over the course of the data points as well as the final results. It provides an additional
source of information and a set order of dependencies between the data.
Time series analysis typically requires a large number of data points to ensure consistency and
reliability. An extensive data set ensures you have a representative sample size and that analysis
can cut through noisy data. It also ensures that any trends or patterns discovered are not outliers
and can account for seasonal variance. Additionally, time series data can be used for
forecasting—predicting future data based on historical data.
Time series analysis helps organizations understand the underlying causes of trends or systemic
patterns over time. Using data visualizations, business users can see seasonal trends and dig
deeper into why these trends occur. With modern analytics platforms, these visualizations can
go far beyond line graphs.
When organizations analyze data over consistent intervals, they can also use time series
forecasting to predict the likelihood of future events. Time series forecasting is part
of predictive analytics. It can show likely changes in the data, like seasonality or cyclic
behavior, which provides a better understanding of data variables and helps forecast better.
For example, Des Moines Public Schools analyzed five years of student achievement data to
identify at-risk students and track progress over time. Today’s technology allows us to collect
massive amounts of data every day and it’s easier than ever to gather enough consistent data
for comprehensive analysis.
Time series analysis is used for non-stationary data—things that are constantly fluctuating over
time or are affected by time. Industries like finance, retail, and economics frequently use time
series analysis because currency and sales are always changing. Stock market analysis is an
excellent example of time series analysis in action, especially with automated trading
algorithms. Likewise, time series analysis is ideal for forecasting weather changes, helping
meteorologists predict everything from tomorrow’s weather report to future years of climate
change. Examples of time series analysis in action include:
Weather data
Rainfall measurements
Temperature readings
Heart rate monitoring (EKG)
Brain monitoring (EEG)
Quarterly sales
Stock prices
Automated stock trading
Industry forecasts
Interest rates
Time Series Analysis Types
Because time series analysis includes many categories or variations of data, analysts sometimes
must make complex models. However, analysts can’t account for all variances, and they can’t
generalize a specific model to every sample. Models that are too complex or that try to do too
many things can lead to a lack of fit. Lack of fit or overfitting models lead to those models not
distinguishing between random error and true relationships, leaving analysis skewed and
forecasts incorrect.
Further, time series data can be classified into two main categories:
Stock time series data means measuring attributes at a certain point in time, like a
static snapshot of the information as it was.
Flow time series data means measuring the activity of the attributes over a certain
period, which is generally part of the total whole and makes up a portion of the results.
Data variations
In time series data, variations can occur sporadically throughout the data:
Functional analysis can pick out the patterns and relationships within the data to
identify notable events.
Trend analysis means determining consistent movement in a certain direction. There
are two types of trends: deterministic, where we can find the underlying cause, and
stochastic, which is random and unexplainable.
Seasonal variation describes events that occur at specific and regular intervals during
the course of a year. Serial dependence occurs when data points close together in time
tend to be related.
Time series analysis and forecasting models must define the types of data relevant to answering
the business question. Once analysts have chosen the relevant data they want to analyze, they
choose what types of analysis and techniques are the best fit.
Time series data is data that is recorded over consistent intervals of time.
Cross-sectional data consists of several variables recorded at the same time.
Pooled data is a combination of both time series data and cross-sectional data.
Moving Averages
Time series analysis can be used to analyse historic data and establish any underlying trend and
seasonal variations within the data. The trend refers to the general direction the data is heading
in and can be upward or downward. The seasonal variation refers to the regular variations
which exist within the data. This could be a weekly variation with certain days traditionally
experiencing higher or lower sales than other days, or it could be monthly or quarterly
variations.
The trend and seasonal variations can be used to help make predictions about the future – and
as such can be very useful when budgeting and forecasting.
One method of establishing the underlying trend (smoothing out peaks and troughs) in a set of
data is using the moving averages technique. Other methods, such as regression analysis can
also be used to estimate the trend. Regression analysis is dealt with in a separate article.
A moving average is a series of averages, calculated from historic data. Moving averages can
be calculated for any number of time periods, for example a three-month moving average, a
seven-day moving average, or a four-quarter moving average. The basic calculations are the
same.
The following simplified example will take us through the calculation process.
Monthly sales revenue data were collected for a company for 20X2:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Sales
125 145 186 131 151 192 137 157 198 143 163 204
$000
From this data, we will calculate a three-month moving average, as we can see a basic cycle
that follows a three-monthly pattern (increases January – March, drops for April then increases
April – June, drops for July and so on). In an exam, the question will state what time period to
use for this cycle/pattern in order to calculate the averages required.
Create a table with 5 columns, shown below, and list the data items given in columns one and
two. The first three rows from the data given above have been input in the table:
Add together the first three sets of data, for this example it would be January, February and
March. This gives a total of (125+145+186) = 456. Put this total in the middle of the data you
are adding, so in this case across from February. Then calculate the average of this total, by
dividing this figure by 3 (the figure you divide by will be the same as the number of time
periods you have added in your total column). Our three-month moving average is therefore
(456 ÷ 3) = 152.
The average needs to be calculated for each three-month period. To do this you move your
average calculation down one month, so the next calculation will involve February, March and
April. The total for these three months would be (145+186+131) = 462 and the average would
be (462 ÷ 3) = 154.
Continue working down the data until you no longer have three items to add together. Note:
you will have fewer averages than the original observations as you will lose the beginning and
end observations in the averaging process.
The three-month moving average represents the trend. From our example we can see a clear
trend in that each moving average is $2,000 higher than the preceding month moving average.
This suggests that the sales revenue for the company is, on average, growing at a rate of $2,000
per month.
This trend can now be used to predict future underlying sales values.
Once a trend has been established, any seasonal variation can be calculated. The seasonal
variation can be assumed to be the difference between the actual sales and the trend (three-
month moving average) value. Seasonal variations can be calculated using the additive or
multiplicative models.
Using the additive model:To calculate the seasonal variation, go back to the table and for each
average calculated, compare the average to the actual sales figure for that period.
A negative variation means that the actual figure in that period is less than the trend and a
positive figure means that the actual is more than the trend.
From the data we can see a clear three-month cycle in the seasonal variation. Every first month
has a variation of -7, suggesting that this month is usually $7,000 below the average. Every
second month has a variation of 32 suggesting that this month is usually $32,000 above the
average. In month 3, the variation suggests that every third month, the actual will be $25,000
below the average.
It is assumed that this pattern of seasonal adjustment will be repeated for each three-month
period going forward.
Note that with the additive model the three seasonal variations must add up to zero (32-25-7 =
0). Where this is not the case, an adjustment must be made. With the multiplicative model the
three seasonal variations add to three (0.95 + 1.21 + 0.84 = 3). (If it was four-month average,
the four seasonal variations would add to four etc). Again, if this is not the case, an adjustment
must be made.
In this simplified example the trend shows an increase of exactly $2,000 each month, and the
pattern of seasonal variations is exactly the same in each three-month period. In reality a time
series is unlikely to give such a perfect result.
Now that the trend and the seasonal variations have been calculated, these can be used to predict
the likely level of sales revenue for the future.
A simple moving average (SMA) is calculated by taking the arithmetic mean of a given set of
values over a specified period. A set of numbers, or prices of stocks, are added together and
then divided by the number of prices in the set. The formula for calculating the simple moving
average of a security is as follows:
SMA=A1+A2+…+An
where:A=Average in period n
Charting stock prices over 50 days using a simple moving average may look like this:
Exponential Moving Average (EMA)
The exponential moving average gives more weight to recent prices in an attempt to make
them more responsive to new information. To calculate an EMA, the simple moving average
(SMA) over a particular period is calculated first.
Then calculate the multiplier for weighting the EMA, known as the "smoothing factor," which
typically follows the formula: [2/(selected time period + 1)].
For a 20-day moving average, the multiplier would be [2/(20+1)]= 0.0952. The smoothing
factor is combined with the previous EMA to arrive at the current value. The EMA thus gives
a higher weighting to recent prices, while the SMA assigns an equal weighting to all values.
The calculation for EMA puts more emphasis on the recent data points. Because of this, EMA
is considered a weighted average calculation.
In the figure below, the number of periods used in each average is 15, but the EMA responds
more quickly to the changing prices than the SMA. The EMA has a higher value when the
price is rising than the SMA and it falls faster than the SMA when the price is declining. This
responsiveness to price changes is the main reason why some traders prefer to use the EMA
over the SMA.
The moving average is calculated differently depending on the type: SMA or EMA. Below,
we look at a simple moving average (SMA) of a security with the following closing prices
over 15 days:
A 10-day moving average would average out the closing prices for the first 10 days as the first
data point. The next data point would drop the earliest price, add the price on day 11, and take
the average.
Missing Values
Handling missing values in time series data in R is a crucial step in the data preprocessing
phase. Time series data often contains gaps or missing observations due to various reasons such
as sensor malfunctions, human errors, or other external factors. In R Programming
Language dealing with missing values appropriately is essential to ensure the accuracy and
reliability of analyses and models built on time series data. Here are some common strategies
for handling missing values in time series data.
Understanding Missing Values in Time Series Data
In general Time Series data is a type of data where observations are collected over some time
at successive intervals. Time series are used in various fields such as finance, engineering, and
biological sciences, etc,
Missing values will disrupt the order of the data which indirectly results in the inaccurate
representation of trends and patterns over some time
By Imputing missing values we can ensure the statistical analysis done on the Time Serial
data is reliable based on the patterns we observed.
Similar to other models handling missing values in the time series data improves the model
performance.
In R Programming there are various ways to handle missing values of Time Series Data using
functions that are present under the ZOO package.
It's important to note that the choice of method depends on the nature of the data and the
underlying reasons for missing values. A combination of methods or a systematic approach to
evaluating different imputation strategies may be necessary to determine the most suitable
approach for a given time series dataset. Additionally, care should be taken to assess the impact
of missing value imputation on the validity of subsequent analyses and models.
Step 1: Load Necessary Libraries and Dataset
R
# Load necessary libraries
library(zoo)
library(ggplot2)
# Generate sample time series data with missing values
set.seed(789)
dates <- seq(as.Date("2022-01-01"), as.Date("2022-01-31"), by = "days")
time_series_data <- zoo(sample(c(50:100, NA), length(dates), replace = TRUE),
order.by = dates)
head(time_series_data)
Output:
2022-01-01 2022-01-02 2022-01-03 2022-01-04 2022-01-05 2022-01-06
94 97 61 NA 91 75
original_line_plot
Output:
"Indices of Missing Values: 4": This means that at index (or position) 4 in the time series data,
there is a missing value. In R, indexing usually starts from 1, so this refers to the fourth
observation in our dataset.
"Indices of Missing Values: 15": Similarly, at index 15 in the time series data, there is another
missing value. This corresponds to the fifteenth observation in our dataset.
Step 4: Handle Missing Values
1. Linear Imputation
Linear Interpolation is the method used to impute the missing values that lie between two
known values in the time series data by the mean of both preceding and succeeding values. To
achieve this, we have a function under the zoo package in R named na.approx() which is used
to interpolate missing values.
R
# Load necessary libraries
library(zoo)
library(ggplot2)
Linear_imputation_plot
Output:
2. Forward Filling
Forward filling involves filling missing values with the most recent observed value,
R
# Forward fill
time_series_data_fill <- na.locf(time_series_data)
fill_line_point_plot
Autocorrelation
Autocorrelation refers to the degree of correlation of the same variables between two
successive time intervals. It measures how the lagged version of the value of a variable is
related to the original version of it in a time series.
Autocorrelation, as a statistical concept, is also known as serial correlation. It is often used with
the autoregressive-moving-average model (ARMA) and autoregressive-integrated-moving-
average model (ARIMA). The analysis of autocorrelation helps to find repeating periodic
patterns, which can be used as a tool for technical analysis in the capital markets.
How It Works
In many cases, the value of a variable at a point in time is related to the value of it at a previous
point in time. Autocorrelation analysis measures the relationship of the observations between
the different points in time, and thus seeks a pattern or trend over the time series. For example,
the temperatures on different days in a month are autocorrelated.
The observations with positive autocorrelation can be plotted into a smooth curve. By adding
a regression line, it can be observed that a positive error is followed by another positive one,
and a negative error is followed by another negative one.
Conversely, negative autocorrelation represents that the increase observed in a time interval
leads to a proportionate decrease in the lagged time interval. By plotting the observations with
a regression line, it shows that a positive error will be followed by a negative one and vice
versa.
Autocorrelation can be applied to different numbers of time gaps, which is known as lag. A lag
1 autocorrelation measures the correlation between the observations that are a one-time gap
apart. For example, to learn the correlation between the temperatures of one day and the
corresponding day in the next month, a lag 30 autocorrelation should be used (assuming 30
days in that month).
The Durbin-Watson statistic is commonly used to test for autocorrelation. It can be applied to
a data set by statistical software. The outcome of the Durbin-Watson test ranges from 0 to 4.
An outcome closely around 2 means a very low level of autocorrelation. An outcome closer to
0 suggests a stronger positive autocorrelation, and an outcome closer to 4 suggests a stronger
negative autocorrelation.
It is necessary to test for autocorrelation when analyzing a set of historical data. For example,
in the equity market, the stock prices on one day can be highly correlated to the prices on
another day. However, it provides little information for statistical data analysis and does not
tell the actual performance of the stock.
Therefore, it is necessary to test for the autocorrelation of the historical prices to identify to
what extent the price change is merely a pattern or caused by other factors. In finance, an
ordinary way to eliminate the impact of autocorrelation is to use percentage changes in asset
prices instead of historical prices themselves.
Although autocorrelation should be avoided in order to apply further data analysis more
accurately, it can still be useful in technical analysis, as it looks for a pattern from historical
data. The autocorrelation analysis can be applied together with the momentum factor analysis.
A technical analyst can learn how the stock price of a particular day is affected by those of
previous days through autocorrelation. Thus, he can estimate how the price will move in the
future.
If the price of a stock with strong positive autocorrelation has been increasing for several days,
the analyst can reasonably estimate the future price will continue to move upward in the recent
future days. The analyst may buy and hold the stock for a short period of time to profit from
the upward price movement.
The autocorrelation analysis only provides information about short-term trends and tells little
about the fundamentals of a company. Therefore, it can only be applied to support the trades
with short holding periods.
Serial Correlation:
Serial correlation is used in statistics to describe the relationship between observations of the
same variable over specific periods. If a variable's serial correlation is measured as zero, there
is no correlation, and each of the observations is independent of one another. Conversely, if a
variable's serial correlation skews toward one, the observations are serially correlated, and
future observations are affected by past values. Essentially, a variable that is serially correlated
has a pattern and is not random.
Error terms occur when a model is not completely accurate and results in differing results
during real-world applications. When error terms from different (usually adjacent) periods (or
cross-section observations) are correlated, the error term is serially correlated. Serial
correlation occurs in time-series studies when the errors associated with a given period carry
over into future periods. For example, when predicting the growth of stock dividends, an
overestimate in one year will lead to overestimates in succeeding years.
Serial correlation was originally used in engineering to determine how a signal, such as a
computer signal or radio wave, varies compared to itself over time. The concept grew in
popularity in economic circles as economists and practitioners of econometrics used the
measure to analyze economic data over time.
Almost all large financial institutions now have quantitative analysts, known as quants, on
staff. These financial trading analysts use technical analysis and other statistical inferences to
analyze and predict the stock market. These modelers attempt to identify the structure of the
correlations to improve forecasts and the potential profitability of a strategy. In addition,
identifying the correlation structure improves the realism of any simulated time series based
on the model. Accurate simulations reduce the risk of investment strategies.
Quants are integral to the success of many of these financial institutions since they provide
market models that the institution then uses as the basis for its investment strategy.
Survival analysis is a collection of statistical procedures for data analysis where the outcome
variable of interest is time until an event occurs. Because of censoring–the nonobservation of
the event of interest after a period of follow-up–a proportion of the survival times of interest
will often be unknown.
What is censoring?
The notion of censoring is fundamental to survival analysis and is used when computing our
survival functions (more on that in the next part of the series). But what do I mean by
censoring? Strictly speaking, censoring is a condition when only part of the observation or
measurement is known. That is the ability to take into account missing data, whereby the
time to event is not observed.
For example, death in office of a president, or someone leaving a medical study before the
study formally concludes. In the case of the latter, you can see this is really important for the
analysis in medical trials, but in both cases the underlying principle is the same – we made
some observations until a given time, but we cannot measure the event. If a president dies after
one year in office, how can we possibly know that they would have served two terms?
In the case of turnover, we are only considering right censoring, where a person may leave at
some point in the future, but we don’t know when they will (if at all). Hopefully, the diagram
below will help demonstrate this.
The
notion of Censoring in Survival Analysis – Right Censoring for employee churn
Example of censoring: medical study
In the above example, we have 10 subjects in a medical study that begins at time t=0 and ends
at time t=20 (don’t worry about units in this example, you can imagine weeks, months, years
if it helps). Each subject is recorded until either the event happens (circle) or the end of the
study is reached (the black vertical line at t=20).
As you can see we observe the event during the study for the red subjects, and the blue lines
represent participants that no event occurred during the study period. Notice that some of the
blue lines do end before the current time but occur after the end of the study period, and this is
the critical thing, they are right-censored! If we did not include this into our analysis we would
be underestimating the true average for our subjects.
Conversely, if we define the event as the positive test then we have no left censoring and have
the case as described previously, observing the event and all negative tests are then right
censored, as shown in the diagram below.