0% found this document useful (0 votes)
28 views28 pages

Business Statics

Analysis of variance (ANOVA) is a statistical test used to compare variances across group means. It can determine if differences between group means are statistically significant. ANOVA compares within-group and between-group variances to generate an F-statistic. If the F-statistic shows no real difference between groups, the null hypothesis is accepted. Otherwise, the null hypothesis is rejected.

Uploaded by

Abdi Reta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views28 pages

Business Statics

Analysis of variance (ANOVA) is a statistical test used to compare variances across group means. It can determine if differences between group means are statistically significant. ANOVA compares within-group and between-group variances to generate an F-statistic. If the F-statistic shows no real difference between groups, the null hypothesis is accepted. Otherwise, the null hypothesis is rejected.

Uploaded by

Abdi Reta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

ANALYSIS OF VARIANCE ....................................................................................................

1
AREAS OF APPLICATION..................................................................................................... 5
REGRESSION AND CORRELATION .................................................................................... 7
I. What is linear correlation? ....................................................................................... 7
II. Discuss the coefficient of correlation ............................................................. 12
III. Discuss Rank correlation coefficient and Simple linear regression......... 20
Sample Distribution of the Difference of Two Proportions ..................................... 23
Reference .............................................................................................................................. 27
Analysis of Variance (ANOVA) is a statistical formula used to compare variances across the
means (or average) of different groups. A range of scenarios use it to determine if there is any
difference between the means of different groups.

For example, to study the effectiveness of different diabetes medications, scientists design and
experiment to explore the relationship between the type of medicine and the resulting blood
sugar level. The sample population is a set of people. We divide the sample population into
multiple groups, and each group receives a particular medicine for a trial period. At the end of
the trial period, blood sugar levels are measured for each of the individual participants. Then for
each group, the mean blood sugar level is calculated. ANOVA helps to compare these group
means to find out if they are statistically different or if they are similar.

The outcome of ANOVA is the ‘F statistic’. This ratio shows the difference between the within
group variance and the between group variance, which ultimately produces a figure which allows
a conclusion that the null hypothesis is supported or rejected. If there is a significant difference
between the groups, the null hypothesis is not supported, and the F-ratio will be larger.

Dependent variable: This is the item being measured that is theorized to be affected by the
independent variables.

Independent variable/s: These are the items being measured that may have an effect on the
dependent variable.

1
A null hypothesis (H0): This is when there is no difference between the groups or means.
Depending on the result of the ANOVA test, the null hypothesis will either be accepted or
rejected.

An alternative hypothesis (H1): When it is theorized that there is a difference between groups
and means.

Factors and levels: In ANOVA terminology, an independent variable is called a factor which
affects the dependent variable. Level denotes the different values of the independent variable that
are used in an experiment.

Fixed-factor model: Some experiments use only a discrete set of levels for factors. For example,
a fixed-factor test would be testing three different dosages of a drug and not looking at any other
dosages.

Random-factor model: This model draws a random value of level from all the possible values of
the independent variable.

The ANOVA test is the initial step


in analyzing factors that affect a given data set. Once the test is finished, an analyst performs
additional testing on the methodical factors that measurably contribute to the data set's
inconsistency. The analyst utilizes the ANOVA test results in an f-test to generate additional data
that aligns with the proposed regression models. The ANOVA test allows a comparison of more
than two groups at the same time to determine whether a relationship exists between them. The
result of the ANOVA formula, the F statistic (also called the F-ratio), allows for the analysis of
multiple groups of data to determine the variability between samples and within samples.

If no real difference exists between the tested groups, which is called the null hypothesis, the
result of the ANOVA's F-ratio statistic will be close to 1. The distribution of all possible values
of the F statistic is the F-distribution. This is actually a group of distribution functions, with two
characteristic numbers, called the numerator degrees of freedom and the denominator degrees of
freedom.

2
A researcher might, for example, test students from
multiple colleges to see if students from one of the colleges consistently outperform students
from the other colleges. In a business application, an R&D researcher might test two different
processes of creating a product to see if one process is better than the other in terms of cost
efficiency.

The type of ANOVA test used depends on a number of factors. It is applied when data needs to
be experimental. Analysis of variance is employed if there is no access to statistical software
resulting in computing ANOVA by hand. It is simple to use and best suited for small samples.
With many experimental designs, the sample sizes have to be the same for the various factor
level combinations.

ANOVA is helpful for testing three or more variables. It is similar to multiple two-sample t-tests.
However, it results in fewer type I errors and is appropriate for a range of issues. ANOVA
groups differences by comparing the means of each group and includes spreading out the
variance into diverse sources. It is employed with subjects, test groups, between groups and
within groups.

There are two types of ANOVA.

One-Way ANOV: The one-way analysis of variance is also known as single-factor ANOVA or
simple ANOVA. As the name suggests, the one-way ANOVA is suitable for experiments with
only one independent variable (factor) with two or more levels. For instance a dependent
variable may be what month of the year there are more flowers in the garden. There will be
twelve levels. A one-way ANOVA assumes:

 Independence: The value of the dependent variable for one observation is independent of
the value of any other observations.
 Normalcy: The value of the dependent variable is normally distributed
 Variance: The variance is comparable in different experiment groups.
 Continuous: The dependent variable (number of flowers) is continuous and can be
measured on a scale which can be subdivided.
Full Factorial ANOVA (also called two-way ANOVA): Full Factorial ANOVA is used when
there are two or more independent variables. Each of these factors can have multiple levels. Full-
factorial ANOVA can only be used in the case of a full factorial experiment, where there is use
of every possible permutation of factors and their levels. This might be the month of the year
when there are more flowers in the garden, and then the number of sunshine hours. This two-way
ANOVA not only measures the independent vs. the independent variable, but if the two factors
affect each other. A two-way ANOVA assumes:

3
 Continuous: The same as a one-way ANOVA, the dependent variable should be
continuous.
 Independence: Each sample is independent of other samples, with no crossover.
 Variance: The variance in data across the different groups is the same.
 Normalcy: The samples are representative of a normal population.
 Categories: The independent variables should be in separate categories or groups.

Some people question the need for ANOVA; after all, mean
values can be assessed just by looking at them. But ANOVA does more than only comparing
means.

Even though the mean values of various groups appear to be different, this could be due to a
sampling error rather than the effect of the independent variable on the dependent variable. If it
is due to sampling error, the difference between the group means is meaningless. ANOVA helps
to find out if the difference in the mean values is statistically significant.

ANOVA also indirectly reveals if an independent variable is influencing the dependent variable.
For example, in the above blood sugar level experiment, suppose ANOVA finds that group
means are not statistically significant, and the difference between group means is only due to
sampling error. This result infers that the type of medication (independent variable) is not a
significant factor that influences the blood sugar level.

ANOVA can only tell if there is a significant difference between


the means of at least two groups, but it can’t explain which pair differs in their means. If there is
a requirement for granular data, deploying further follow up statistical processes will assist in
finding out which groups differ in mean value. Typically, ANOVA is used in combination with
other statistical methods.

ANOVA also makes assumptions that the dataset is uniformly distributed, as it compares means
only. If the data is not distributed across a normal curve and there are outliers, then ANOVA is
not the right process to interpret the data.

Similarly, ANOVA assumes the standard deviations are the same or similar across groups. If
there is a big difference in standard deviations, the conclusion of the test may be inaccurate.

One of the biggest challenges in machine


learning is the selection of the most reliable and useful features that are used in order to train a
model. ANOVA helps in selecting the best features to train a model. ANOVA minimizes the
number of input variables to reduce the complexity of the model. ANOVA helps to determine if
an independent variable is influencing a target variable.

4
An example of ANOVA use in data science is in email spam detection. Because of the massive
number of emails and email features, it has become very difficult and resource-intensive to
identify and reject all spam emails. ANOVA and f-tests are deployed to identify features that
were important to correctly identify which emails were spam and which were not.

: Even though ANOVA involves


complex statistical steps, it is a beneficial technique for businesses via use of AI. Organizations
use ANOVA to make decisions about which alternative to choose among many possible options.
For example, ANOVA can help to:

 Compare the yield of two different wheat varieties under three different fertilizer brands.
 Compare the effectiveness of various social media advertisements on the sales of a
particular product.
 Compare the effectiveness of different lubricants in different types of vehicles.

The term variance refers to a statistical measurement of the spread between numbers in a data
set. More specifically, variance measures how far each number in the set is from
the mean (average), and thus from every other number in the set. Variance is often depicted by
this symbol: σ2. It is used by both analysts and traders to determine volatility and market
security. The square root of the variance is the standard deviation (SD or σ), which helps
determine the consistency of an investment’s returns over a period of time.

In statistics, variance measures variability from the average or mean. It is calculated by taking
the differences between each number in the data set and the mean, then squaring the differences
to make them positive, and finally dividing the sum of the squares by the number of values in the
data set.

Variance is calculated by using the following formula:

Where: xi=each value in the data set

x=Mean of all values in the data set

N=Number of values in the data set

5
Statisticians use variance to see how individual numbers relate to each other within a data set,
rather than using broader mathematical techniques such as arranging numbers into quartiles. The
advantage of variance is that it treats all deviations from the mean as the same regardless of their
direction. The squared deviations cannot sum to zero and give the appearance of no variability at
all in the data. One drawback to variance, though, is that it gives added weight to outliers. These
are the numbers far from the mean. Squaring these numbers can skew the data. Another pitfall of
using variance is that it is not easily interpreted. Users often employ it primarily to take the
square root of its value, which indicates the standard deviation of the data. As noted above,
investors can use standard deviation to assess how consistent returns are over time.

Here’s a hypothetical example to demonstrate how variance works. Let’s say returns for stock in
Company ABC are 10% in Year 1, 20% in Year 2, and −15% in Year 3. The average of these
three returns is 5%. The differences between each return and the average are 5%, 15%, and
−20% for each consecutive year.

Squaring these deviations yields 0.25%, 2.25%, and 4.00%, respectively. If we add these squared
deviations, we get a total of 6.5%. When you divide the sum of 6.5% by one less the number of
returns in the data set, as this is a sample (2 = 3-1), it gives us a variance of 3.25% (0.0325).
Taking the square root of the variance yields a standard deviation of 18% (√0.0325 = 0.180) for
the returns.

6
I. What is linear correlation?
Linear correlation is a measure of dependence between two random variables.

It has the following characteristics:

 it ranges between -1 and 1;


 it is proportional to covariance;
 Its interpretation is very similar to that of covariance.

7
Let and be two random variables.

The linear correlation coefficient (or Pearson's correlation coefficient) between X and Y is

Where:

 is the covariance between X and Y;


 and are the standard deviations of X and Y.
The linear correlation coefficient is well-defined only as long as ,

and exist and are well-defined.

It is often denoted by .

In principle, the ratio is well-defined only if and are strictly greater than zero.

However, it is often assumed that when one of the two standard deviations is zero.

This is equivalent to assuming that 0/0=0 because when one of the two standard
deviations is zero.

The interpretation is similar to the interpretation of covariance: the


correlation between X and Y provides a measure of how similar their deviations from the
respective means are.

Linear correlation ranges between -1 and 1:

Thanks to this property, correlation allows us to easily understand the intensity of the linear
dependence between two random variables:

 The closer correlation is to 1, the stronger the positive linear dependence


between X and Y is;
 The closer it is to -1, the stronger the negative linear dependence between X and Y is.

8
The following terminology is often used:

9
In this example we show how to compute the coefficient of linear correlation
between two discrete random variables.

10
11
II. Discuss the coefficient of correlation
A correlation coefficient is a number between -1 and 1 that tells you the strength and direction of
a relationship between variables. In other words, it reflects how similar the measurements of two
or more variables are across a dataset.

Correlation Meaning
coefficient Correlation type
value
1 Perfect positive When one variable changes, the other variables change in the same
correlation direction.

0 Zero correlation There is no relationship between the variables.


-1 Perfect negative When one variable changes, the other variables change in the opposite
correlation direction.

Correlation
coefficients summarize data and help you compare results between studies.

Summarizing data; A correlation coefficient is a descriptive statistic. That means that it


summarizes sample data without letting you infer anything about the population. A correlation
coefficient is a bivariate statistic when it summarizes the relationship between two variables, and
it’s a multivariate statistic when you have more than two variables.

If your correlation coefficient is based on sample data, you’ll need an inferential statistic if you
want to generalize your results to the population. You can use an F test or a t test to calculate
a test statistic that tells you the statistical significance of your finding.

12
Comparing studies; A correlation coefficient is also an effect size measure, which tells you the
practical significance of a result. Correlation coefficients are unit-free, which makes it possible
to directly compare coefficients between studies.

In correlation research, you investigate whether


changes in one variable are associated with changes in other variables.

Correlation research example; you investigate whether standardized scores from high school are
related to academic grades in college. You predict that there’s a positive correlation: higher SAT
scores are associated with higher college GPAs while lower SAT scores are associated with
lower college GPAs. After data collection, you can visualize your data with a scatter plot by
plotting one variable on the x-axis and the other on the y-axis. It doesn’t matter which variable
you place on either axis.

Visually inspect your plot for a pattern and decide whether there is a linear or non-linear pattern
between variables. A linear pattern means you can fit a straight line of best fit between the data
points, while a non-linear or curvilinear pattern can take all sorts of different shapes, such as a U-
shape or a line with a curve.

Visual inspection example; You gather a sample of 5,000 college graduates and survey them on
their high school SAT scores and college GPAs. You visualize the data in a scatter plot to check
for a linear pattern:

There are many different correlation coefficients that you can calculate. After removing any
outliers, select a correlation coefficient that’s appropriate based on the general shape of the
scatter plot pattern. Then you can perform a correlation analysis to find the correlation
coefficient for your data. You calculate a correlation coefficient to summarize the relationship
between variables without drawing any conclusions about causation.

Correlation analysis example; You check whether the data meet all of the assumptions for the
Pearson’s r correlation test. Both variables are quantitative and normally distributed with no
outliers, so you calculate a Pearson’s r correlation coefficient.

The correlation coefficient is strong at .58.

13
The value of the correlation coefficient
always ranges between 1 and -1, and you treat it as a general indicator of the strength of the
relationship between variables. The sign of the coefficient reflects whether the variables change
in the same or opposite directions: a positive value means the variables change together in the
same direction, while a negative value means they change together in opposite directions.
The absolute value of a number is equal to the number without its sign. The absolute value of a
correlation coefficient tells you the magnitude of the correlation: the greater the absolute value,
the stronger the correlation. There are many different guidelines for interpreting the correlation
coefficient because findings can vary a lot between study fields. You can use the table below as
a general guideline for interpreting correlation strength from the value of the correlation
coefficient. While this guideline is helpful in a pinch, it’s much more important to take your
research context and purpose into account when forming conclusions. For example, if most
studies in your field have correlation coefficients nearing .9, a correlation coefficient of .58 may
be low in that context.

Correlation coefficient Correlation strength Correlation type

-.7 to -1 Very strong Negative

-.5 to -.7 Strong Negative

-.3 to -.5 Moderate Negative

0 to -.3 Weak Negative

0 None Zero

0 to .3 Weak Positive

.3 to .5 Moderate Positive

.5 to .7 Strong Positive

.7 to 1 Very strong Positive

The correlation coefficient tells you how closely


your data fit on a line. If you have a linear relationship, you’ll draw a straight line of best fit that
takes all of your data points into account on a scatter plot. The closer your points are to this line,
the higher the absolute value of the correlation coefficient and the stronger your linear
correlation.

14
If all points are perfectly on this line, you have a perfect correlation.

If all points are close to this line, the absolute value of your correlation coefficient is high.

If these points are spread far from this line, the absolute value of your correlation coefficient
is low.

Note that the steepness or slope of the line isn’t related to the correlation coefficient value. The
correlation coefficient doesn’t help you predict how much one variable will change based on a
given change in the other, because two datasets with the same correlation coefficient value can
have lines with very different slopes.

15
You can choose from many different correlation
coefficients based on the linearity of the relationship, the level of measurement of your
variables, and the distribution of your data. For high statistical power and accuracy, it’s best to
use the correlation coefficient that’s most appropriate for your data. The most commonly used
correlation coefficient is Pearson’s r because it allows for strong inferences. It’s parametric and
measures linear relationships. But if your data do not meet all assumptions for this test, you’ll
need to use a non-parametric test instead.

Non-parametric tests of rank correlation coefficients summarize non-linear relationships between


variables. The Spearman’s rho and Kendall’s tau have the same conditions for use, but Kendall’s
tau is generally preferred for smaller samples whereas Spearman’s rho is more widely used.

The table below is a selection of commonly used correlation coefficients, and we’ll cover the two
most widely used coefficients in detail in this article.

Correlation Type of Levels of measurement Data distribution


coefficient relationship
Pearson’s r Linear Two quantitative (interval or ratio) variables Normal distribution

Spearman’s Non-linear Two ordinal, interval or ratio variables Any distribution


rho
Point-biserial Linear One dichotomous (binary) variable and one quantitative Normal distribution
(interval or ratio) variable
Cramér’s V Non-linear Two nominal variables Any distribution
(Cramér’s φ)
Kendall’s tau Non-linear Two ordinal, interval or ratio variables Any distribution

The Pearson’s product-moment correlation coefficient, also known as


Pearson’s r, describes the linear relationship between two quantitative variables.

These are the assumptions your data must meet if you want to use Pearson’s r:

 Both variables are on an interval or ratio level of measurement

16
 Data from both variables follow normal distributions
 Your data have no outliers
 Your data is from a random or representative sample
 You expect a linear relationship between the two variables
The Pearson’s r is a parametric test, so it has high power. But it’s not a good measure of
correlation if your variables have a nonlinear relationship, or if your data have outliers, skewed
distributions, or come from categorical variables. If any of these assumptions are violated, you
should consider a rank correlation measure. The formula for the Pearson’s r is complicated, but
most computer programs can quickly churn out the correlation coefficient from your data. In a
simpler form, the formula divides the covariance between the variables by the product of
their standard deviations.

Formula Explanation
= strength of the correlation between variables
x and y
= sample size
= sum of what follows…
= every x-variable value
= every y-variable value
= the product of each x-variable score and the
corresponding y-variable score
Pearson sample vs. population correlation coefficient formula; When using the Pearson
correlation coefficient formula, you’ll need to consider whether you’re dealing with data from a
sample or the whole population. The sample and population formulas differ in their symbols and
inputs. A sample correlation coefficient is called r, while a population correlation coefficient is
called rho, the Greek letter ρ. The sample correlation coefficient uses the sample covariance
between variables and their sample standard deviations.

Sample correlation coefficient formula Explanation


rxy= strength of the correlation between variables x and y
cov(x,y) = covariance of x and y
sx = sample standard deviation of x
sy = sample standard deviation of y
The population correlation coefficient uses the population covariance between variables and their
population standard deviations.

Population correlation coefficient formula Explanation

 ρXY= strength of the correlation between variables X and Y


 cov(X,Y) = covariance of X and Y
 σX = population standard deviation of X
 σY = population standard deviation of Y

17
Spearman’s rho, or Spearman’s rank correlation coefficient, is the most
common alternative to Pearson’s r. It’s a rank correlation coefficient because it uses the
rankings of data from each variable (e.g., from lowest to highest) rather than the raw data itself.

You should use Spearman’s rho when your data fail to meet the assumptions of Pearson’s r. This
happens when at least one of your variables is on an ordinal level of measurement or when the
data from one or both variables do not follow normal distributions. While the Pearson correlation
coefficient measures the linearity of relationships, the Spearman correlation coefficient measures
the monotonicity of relationships. In a linear relationship, each variable changes in one direction
at the same rate throughout the data range. In a monotonic relationship, each variable also always
changes in only one direction but not necessarily at the same rate.

 Positive monotonic: when one variable increases, the other also increases.
 Negative monotonic: when one variable increases, the other decreases.
Monotonic relationships are less restrictive than linear relationships.

Spearman’s rank correlation coefficient formula: The symbols for Spearman’s rho are ρ for
the population coefficient and rs for the sample coefficient. The formula calculates the
Pearson’s r correlation coefficient between the rankings of the variable data. To use this
formula, you’ll first rank the data from each variable separately from low to high: every data
point gets a rank from first, second, or third, etc. Then, you’ll find the differences (di) between
the ranks of your variables for each data pair and take that as the main input for the formula.

Spearman’s rank correlation Explanation


coefficient formula
rs= strength of the rank correlation between variables
di = the difference between the x-variable rank and the y-variable rank for
each pair of data
∑d2i = sum of the squared differences between x- and y-variable ranks
n = sample size

18
If you have a correlation coefficient of 1, all of the rankings for each variable match up for every
data pair. If you have a correlation coefficient of -1, the rankings for one variable are the exact
opposite of the ranking of the other variable. A correlation coefficient near zero means that
there’s no monotonic relationship between the variable rankings.

The correlation coefficient is related to two other coefficients, and


these give you more information about the relationship between variables.

Coefficient of determination; When you square the correlation coefficient, you end up with the
correlation of determination (r2). This is the proportion of common variance between the
variables. The coefficient of determination is always between 0 and 1, and it’s often expressed as
a percentage.

Coefficient of determination Explanation


r2 The correlation coefficient multiplied by itself
The coefficient of determination is used in regression models to measure how much of the
variance of one variable is explained by the variance of the other variable. A regression analysis
helps you find the equation for the line of best fit, and you can use it to predict the value of one
variable given the value for the other variable. A high r2 means that a large amount
of variability in one variable is determined by its relationship to the other variable. A
low r2 means that only a small portion of the variability of one variable is explained by its
relationship to the other variable; relationships with other variables are more likely to account for
the variance in the variable.

The correlation coefficient can often overestimate the relationship between variables, especially
in small samples, so the coefficient of determination is often a better indicator of the relationship.

Coefficient of alienation; When you take away the coefficient of determination from unity
(one), you’ll get the coefficient of alienation. This is the proportion of common variance not
shared between the variables, the unexplained variance between the variables.

Coefficient of alienation Explanation


1 – r2 One minus the coefficient of determination
A high coefficient of alienation indicates that the two variables share very little variance in
common. A low coefficient of alienation means that a large amount of variance is accounted for
by the relationship between the variables.

19
III. Discuss Rank correlation coefficient and Simple linear regression
The Spearman’s Rank Correlation
Coefficient is the non-parametric statistical measure used to study the strength of association
between the two ranked variables. This method is applied to the ordinal set of numbers, which
can be arranged in order, i.e. one after the other so that ranks can be given to each. In the rank
correlation coefficient method, the ranks are given to each individual on the basis of its quality or
quantity, such as ranking starts from position 1st and goes till Nth position for the one ranked last
in the group.

The formula to calculate the rank correlation coefficient is:

Where, R = Rank coefficient of correlation

D = Difference of ranks

N = Number of Observations

The value of R lies between ±1 such as:

R =+1, there is a complete agreement in the order of ranks and move in the same direction.
R=-1, there is a complete agreement in the order of ranks, but are in opposite directions.
R =0, there is no association in the ranks.

While solving for the rank correlation coefficient one may come across the following problems:

 Where actual Ranks are given


 Where ranks are not given
 Equal Ranks or Tie in Ranks
Where actual ranks are given: An individual must follow the following steps to calculate the
correlation coefficient:

 First, the difference between the ranks (R1-R2) must be calculated, denoted by D.
 Then, square these differences to remove the negative sign and obtain its sum ∑D2.
 Apply the formula as shown above.

20
Where ranks are not given: In case the ranks are not given, then the individual may assign the
rank by taking either the highest value or the lowest value as 1. Whatever criteria is being
decided the same method should be applied to all the variables.

Equal Ranks or Tie in Ranks: In case the same ranks are assigned to two or more entities, then
the ranks are assigned on an average basis. Such as if two individuals are ranked equal at third
position, then the ranks shall be calculated as: (3+4)/2 = 3.5

The formula to calculate the rank correlation coefficient when there is a tie in the ranks is:

Where m = number of items whose ranks are common.

Note: The Spearman’s rank correlation coefficient method is applied only when the initial data
are in the form of ranks, and N (number of observations) is fairly small, i.e. not greater than 25
or 30.

is used to estimate the relationship between two quantitative


variables. You can use simple linear regression when you want to know:

 How strong the relationship is between two variables (e.g., the relationship between
rainfall and soil erosion).
 The value of the dependent variable at a certain value of the independent variable (e.g.,
the amount of soil erosion at a certain level of rainfall).
Regression models describe the relationship between variables by fitting a line to the observed
data. Linear regression models use a straight line, while logistic and nonlinear regression models
use a curved line. Regression allows you to estimate how a dependent variable changes as the
independent variable(s) change.

Simple linear regression example; You are a social researcher interested in the relationship
between income and happiness. You survey 500 people whose incomes range from 15k to 75k
and ask them to rank their happiness on a scale from 1 to 10.

Your independent variable (income) and dependent variable (happiness) are both quantitative, so
you can do a regression analysis to see if there is a linear relationship between them.

21
Simple linear regression is a parametric test, meaning that it makes certain assumptions about the
data. These assumptions are:

1. Homogeneity of variance (homoscedasticity): the size of the error in our prediction


doesn’t change significantly across the values of the independent variable.
2. Independence of observations: the observations in the dataset were collected using
statistically valid sampling methods, and there are no hidden relationships among
observations.
3. Normality: The data follows a normal distribution.
Linear regression makes one additional assumption:

4. The relationship between the independent and dependent variable is linear: the line of
best fit through the data points is a straight line (rather than a curve or some sort of grouping
factor).

If your data do not meet the assumptions of homoscedasticity or normality, you may be able to
use a nonparametric test instead, such as the Spearman rank test.

Example: Data that doesn’t meet the assumptions; you think there is a linear relationship
between cured meat consumption and the incidence of colorectal cancer in the U.S. However,
you find that much more data has been collected at high rates of meat consumption than at low
rates of meat consumption, with the result that there is much more variation in the estimate of
cancer rates at the low range than at the high range. Because the data violate the assumption of
homoscedasticity, it doesn’t work for regression, but you perform a Spearman rank test instead.

If your data violate the assumption of independence of observations (e.g., if observations are
repeated over time), you may be able to perform a linear mixed-effects model that accounts for
the additional structure in the data.

How to perform a simple linear regression: Simple linear regression formula

The formula for a simple linear regression is:

 y is the predicted value of the dependent variable (y) for any given value of the
independent variable (x).
 B0 is the intercept, the predicted value of y when the x is 0.
 B1 is the regression coefficient – how much we expect y to change as x increases.
 x is the independent variable ( the variable we expect is influencing y).
 e is the error of the estimate, or how much variation there is in our estimate of the
regression coefficient.

22
Linear regression finds the line of best fit line through your data by searching for the regression
coefficient (B1) that minimizes the total error (e) of the model. While you can perform a linear
regression by hand, this is a tedious process, so most people use statistical programs to help them
quickly analyze the data.

Simple linear regression in R: R is a free, powerful, and widely-used statistical program.


Download the dataset to try it yourself using our income and happiness example. Load the
income. Data dataset into your R environment, and then run the following command to generate
a linear model describing the relationship between income and happiness:

R code for simple linear regression

income. happiness. lm <- lm(happiness ~ income, data = income. data)

This code takes the data you have collected data = income.data and calculates the effect that the
independent variable income has on the dependent variable happiness using the equation for the
linear model: lm().

The mean of the sampling distribution of the difference between two independent proportions
(p1 - p2) is:

The standard error of p1- p2 is:

The sampling distribution of p1- p2 is approximately normal as long as the proportions are not
too close to 1 or 0 and the sample sizes are not too small. As a rule of thumb, if n1 and n2 are
both at least 10 and neither is within 0.10 of 0 or 1 then the approximation is satisfactory for
most purposes. An alternative rule of thumb is that the approximation is good if both Nπ and N
(1 - π) are greater than 10 for both π1 and π2.

To see the application of this sampling distribution, assume that 0.8 of high school graduates but
only 0.4 of high school drop outs are able to pass a basic literacy test. If 20 students are sampled
from the population of high school graduates and 25 students are sampled from the population of
high school drop outs, what is the probability that the proportion of drop outs that pass will be as
high as the proportion of graduates?

23
For this example, the mean of the sampling distribution of p1 - p2 is:

μ= π1 - π2 = 0.8 - 0.4 = 0.4.

The standard error is:

= 0 .133.

The solution to the problem is the probability that p1 - p2 is less than or equal to 0. The number of
standard deviations above the mean associated with a difference in proportions of 0 is:

From z table it can be determined that only 0.0013 of the time would p1 - p2 be 3.01 or more
standard deviations below the mean.

Suppose we have two populations with proportions equal to P1 and P2. Suppose further that we
take all possible samples of size n1 and n2. And finally, suppose that the following assumptions
are valid.

 The size of each population is large relative to the sample drawn from the population.
That is, N1 is large relative to n1, and N2 is large relative to n2. (In this context,
populations are considered to be large if they are at least 20 times bigger than their
sample.)
 The samples from each population are big enough to justify using a normal distribution to
model differences between proportions. The sample sizes will be big enough when the
following conditions are met: n1P1 > 10, n1(1 -P1) > 10, n2P2 > 10, and n2(1 - P2) > 10.
(This criterion requires that at least 40 observations be sampled from each population.
When P1 or P2 is more extreme than 0.5, even more observations are required.)
 The samples are independent; that is, observations in population 1 are not affected by
observations in population 2, and vice versa.
Given these assumptions, we know the following.

 The set of differences between sample proportions will be normally distributed. We know
this from the central limit theorem.
 The expected value of the difference between all possible sample proportions is equal to
the difference between population proportions. Thus, E (p1 - p2) = P1 - P2.
 The standard deviation of the difference between sample proportions (σd) is
approximately equal to:
σd = sqrt{ [P1(1 - P1) / n1] + [P2(1 - P2) / n2] }

24
It is straightforward to derive the last bullet point, based on material covered in previous lessons.
The derivation starts with a recognition that the variance of the difference between independent
random variables is equal to the sum of the individual variances. Thus,

σ2d = σ2P1 - P2 = σ21 + σ22

If the populations N1 and N2 are both large relative to n1 and n2, respectively, then

σ21 = P1(1 - P1) / n1 And σ22 = P2(1 - P2) / n2

Therefore,

σ2d = [ P1(1 - P1) / n1 ] + [ P2(1 - P2) / n2 ]

And

σd = sqrt{ [ P1(1 - P1) / n1 ] + [ P2(1 - P2) / n2 ] }

Example: In one state, 52% of the voters are Republicans, and 48% are Democrats. In a second
state, 47% of the voters are Republicans, and 53% are Democrats. Suppose 100 voters are
surveyed from each state. Assume the survey uses simple random sampling.

What is the probability that the survey will show a greater percentage of Republican voters in the
second state than in the first state?

(A) 0.04 (B) 0.05 (C) 0.24 (D) 0.71 (E) 0.76

Solution: The correct answer is C. For this analysis, let P1 = the proportion of Republican voters
in the first state, P2 = the proportion of Republican voters in the second state, p1 = the proportion
of Republican voters in the sample from the first state, and p2 = the proportion of Republican
voters in the sample from the second state. The number of voters sampled from the first state (n1)
= 100, and the number of voters sampled from the second state (n2) = 100.

The solution involves four steps.

 Make sure the samples from each population are big enough to model differences with a
normal distribution. Because n1P1 = 100 * 0.52 = 52, n1(1 - P1) = 100 * 0.48 = 48, n2P2 =
100 * 0.47 = 47, and n2(1 - P2) = 100 * 0.53 = 53 are each greater than 10, the sample
size is large enough.
 Find the mean of the difference in sample proportions: E(p1 - p2) = P1 - P2 = 0.52 - 0.47 =
0.05.

25
 Find the standard deviation of the difference.
σd = sqrt{ [ P1(1 - P1) / n1 ] + [ P2(1 - P2) / n2 ] }

σd = sqrt{[(0.52)(0.48) / 100] + [(0.47)(0.53) / 100]}

σd = sqrt (0.002496 + 0.002491)

σd = sqrt(0.004987) = 0.0706

 Find the probability. This problem requires us to find the probability that p1 is less than
p2. This is equivalent to finding the probability that p1 - p2 is less than zero. To find this
probability, we need to transform the random variable (p1 - p2) into a z-score. That
transformation appears below.
zp1 - p2 = (x - μp1 - p2) / σd

zp1 - p2 = (0 - 0.05)/0.0706 = -0.7082

26
 https://www.statlect.com/fundamentals-of-probability/linear-correlation#hid2
 https://www.analyticsvidhya.com/blog/2018/01/anova-analysis-of-variance/
 https://www.scribbr.com/statistics/correlation-
coefficient/#:~:text=A%20correlation%20coefficient%20is%20a,variables%20are%20acr
oss%20a%20dataset.&text=When%20one%20variable%20changes%2C%20the,change%
20in%20the%20same%20direction.
 https://www.scribbr.com/statistics/simple-linear-regression/
 https://stattrek.com/sampling/difference-in-proportion
 https://stattrek.com/sampling/difference-in-proportion

27

You might also like