0% found this document useful (0 votes)
72 views384 pages

Hypothesis Testing in Statistics

The document discusses hypothesis testing in statistics, explaining the concepts of null and alternate hypotheses, the significance of p-values, and the relationship between hypothesis tests and confidence intervals. It outlines the steps for conducting hypothesis tests, including the p-value approach and rejection region approach, along with examples to illustrate these concepts. Additionally, it covers different types of hypothesis tests based on sample sizes and known population standard deviations.

Uploaded by

Vishnu Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views384 pages

Hypothesis Testing in Statistics

The document discusses hypothesis testing in statistics, explaining the concepts of null and alternate hypotheses, the significance of p-values, and the relationship between hypothesis tests and confidence intervals. It outlines the steps for conducting hypothesis tests, including the p-value approach and rejection region approach, along with examples to illustrate these concepts. Additionally, it covers different types of hypothesis tests based on sample sizes and known population standard deviations.

Uploaded by

Vishnu Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Mathematics for Computer Science

Engineers

HYPOTHESIS and INFERENCE


Dr. Deepa Nair
Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science Engineers
Introduction
_

Textbook: Page 396


Mathematics for Computer Science Engineers
Introduction

water after Is it plausible


filtration that this sample,
with its mean of
550, could have
come from a
filtration population
whose mean is
650 or more?
Sample Sample
n=100 n=100
Mathematics for Computer Science Engineers
Introduction

• It turns out that hypothesis tests are closely related to confidence


intervals.

• In general, whenever a confidence interval can be computed, a


hypothesis test can also be performed, and vice versa.

• A Hypothesis is an assumption about population Parameter.

A hypothesis test produces a number between 0 and 1 that measures


the degree of certainty we may have in the truth of a hypothesis about a
quantity such as a population mean or proportion.
Mathematics for Computer Science Engineers
Null Hypothesis

• Represented by H0

• The Null Hypothesis says that the effect indicated by the sample is due
only to random variation between the sample and the population.

• In performing a hypothesis test, we essentially put the Null Hypothesis


on trial.

• In scientific research, the null hypothesis is the claim that the effect
being studied does not exist.
Mathematics for Computer Science Engineers
Alternate Hypothesis

• Represented by H1

• The Alternate Hypothesis says that the effect indicated by the sample is
real, in that it accurately represents the whole population.

• In statistical hypothesis testing, the Alternative Hypothesis is one of


the proposed proposition in the hypothesis test.
Mathematics for Computer Science Engineers
Introduction

Hypothesis Testing

• State the Hypothesized value • All possible alternatives


of the parameter before other than the null
Sampling Hypothesis
• The assumption we wish to
test.
Mathematics for Computer Science Engineers
Introduction
Example:
Do men and women having different average salaries after graduating
University?

Example:
A coin was flipped 50 times, resulting in 40 Heads and 10 Tails.

Example:
A new type of battery will be installed in heart pacemakers if it can be shown to
have a mean lifetime is greater than eight years.
Mathematics for Computer Science Engineers
Introduction

• A parameter is the characteristic of the population Like its mean or


variance

• The parameter must be identified before the analysis.

I Assume the average weight


of this class is 58 Kg
Mathematics for Computer Science Engineers
Introduction

• The best way to determine whether a statistical hypothesis is true would


be to examine the entire population.

• Since that is often impractical, researchers typically examine a random


sample from the population.

• If sample data are not consistent with the statistical hypothesis, the
hypothesis is rejected.
Mathematics for Computer Science Engineers

Textbook: Page 400


Mathematics for Computer Science Engineers

Note:
In any hypothesis test, we are calculating conditional probabilities based on the
assumption that the null hypothesis is true.

P-value: measures the plausibility of H0. The smaller the P-value, the stronger the
evidence is against H0.

If the P-value is sufficiently small, we may be willing to abandon our


assumption that H0 is true and believe H1 instead.
This is referred to as rejecting the null hypothesis.

Note:
It is natural to ask how small the P-value should be in order to reject H0.
Some people use the “5% rule”; they reject H0 if P ≤ 0.05.
Mathematics for Computer Science Engineers
Conclusions from hypothesis

𝛼-is the level of significance


Mathematics for Computer Science Engineers

Type of Hypothesis Tests:


Sl No Type of Alternate Null
Hypothesis Hypothesis Hypothesis
Test

1 Left Tailed Test < ≥

2 Right Tailed Test > ≤

3 Two Tailed Test ≠ =


Mathematics for Computer Science Engineers

Note:
Hypothesis Test Population S.D. Population
known S.D.
unknown
𝑛 < 30 Small sample Use z- test Use t-test
drawn from Normal
population
𝑛 ≥ 30 Large sample Use z- test Use z- test
Or t- test
Mathematics for Computer Science Engineers
Topics Covered

❖6.1 - Large-Sample Tests for a Population


Mean

Textbook: Chapter 6 , section 6.1


Mathematics for Computer Science Engineers
❖ Large-Sample Tests for a Population Mean

Textbook: Chapter 6 , section 6.1


Mathematics for Computer Science Engineers
❖ Large-Sample Tests for a Population Mean

Textbook: Chapter 6 , section 6.1


Mathematics for Computer Science Engineers
Large-Sample Tests for a Population Mean
• Compute the P-value.
• The 𝑷 -value is an area under the which
normalcurve, depends on the alternate
hypothesis asAlternate
in the table: 𝑷-value
Hypothesis
𝐻1 ∶ 𝜇 > 𝜇0 Area to the right of 𝑧

𝐻 1: 𝜇 < 𝜇 0 Area to the left of 𝑧

𝐻 1: 𝜇 = 𝜇 0 Sum of the areas in the tails cut off


by 𝑧 𝑎𝑛𝑑 −𝑧
Mathematics for Computer Science Engineers
Example
The article “Wear in Boundary Lubrication” (S. Hsu, R. Munro, and M. Shen, Journal of
Engineering Tribology, 2002: 427– 441 ) discusses several experiments involving various
lubricants. In one experiment, 𝟒𝟓 steel balls lubricated with purified paraffin were subjected to
a 40 𝑘𝑔 load at 600 𝑟𝑝𝑚 for 60 𝑚𝑖𝑛𝑢𝑡𝑒𝑠. The average wear, measured by the reduction in
diameter, was 𝟔𝟕𝟑. 𝟐 𝝁𝒎, and the standard deviation was 𝟏𝟒. 𝟗 𝝁𝒎. Assume that the
specification for a lubricant is that the mean wear be less than 𝟔𝟕𝟓 𝝁𝒎. Find the 𝑷-value for
testing 𝑯𝟎: 𝝁 ≥ 𝟔𝟕𝟓 versus 𝑯𝟏: 𝝁 < 𝟔𝟕𝟓.

Textbook: Page 400


Mathematics for Computer Science Engineers
Example - Continued
Mathematics for Computer Science Engineers
Example - Continued

Solution continued:
• We got 𝑷 − 𝒗alue is 𝟎. 𝟐𝟎𝟗 .

• Therefore if 𝑯𝟎 is true, there is a 𝟐𝟎. 𝟗% chance


to observe a samplewhose disagreement with 𝑯𝟎 is as least as great
as that which was actually observed.

• Since 𝟎. 𝟐𝟎𝟗 is not a very small probability, we do not reject 𝑯𝟎.

• Instead, we conclude that 𝑯𝟎 is plausible.

Textbook: Page 400,401


Mathematics for Computer Science Engineers
Example - Continued
Rejection based on the critical value

Textbook: Page 400,401


Mathematics for Computer Science Engineers
Example

A scale is to be calibrated by weighing a 1000 g test weight 60 times. The 60

scale readings have mean 1000.6 g and standard deviation 2 g.

• Find the P-value for testing

𝑯𝟎 ∶ 𝝁 = 𝟏𝟎𝟎𝟎 𝒗𝒆𝒓𝒔𝒖𝒔 𝑯𝟏 ∶ 𝝁 ≠ 𝟏𝟎𝟎𝟎.

Textbook: Page 402


Mathematics for Computer Science Engineers
Example - Continued

Solution:
𝑯𝟎 ∶ 𝝁 = 𝟏𝟎𝟎𝟎 𝒗𝒆𝒓𝒔𝒖𝒔 𝑯𝟏 ∶ 𝝁 ≠ 𝟏𝟎𝟎𝟎.
We assume 𝑯𝟎 is true
𝒛 𝟏𝟎𝟎𝟎. 𝟔−1000
𝟎. 𝟐𝟓𝟖
= 𝟐. 𝟑𝟐

Textbook: Page 402


Mathematics for Computer Science Engineers
Example - Continued

Solution continued:

• The P-value is the sum of the areas in both of these tails, which is 0.0204.

• Therefore, if 𝑯𝟎is true, the probability of a result as extreme as or more


extreme than that observed is only 0.0204.

• The evidence against 𝑯𝟎 is pretty strong. It would be prudent to reject 𝑯𝟎and to


recalibrate the scale.

Textbook: Page 402


Mathematics for Computer Science Engineers
Large-Sample Tests for a Population Mean

Example:

• A sample of 𝟒𝟎𝟎 male students is found to have a mean height 𝟔𝟕. 𝟒𝟕


inches. Can it be reasonably regarded as a sample from a large population
with mean height 𝟔𝟕. 𝟑𝟗 inches and standard deviation𝟏. 𝟑𝟎 inches?
Mathematics for Computer Science Engineers
Large-Sample Tests for a Population Mean
Solution:
𝑯𝟎: 𝝁 = 𝟔𝟕. 𝟑𝟗inches
𝑯𝟏: 𝝁 ≠ 𝟔𝟕. 𝟑𝟗inches
67. 47−67 .39
= = 𝟏. 𝟐𝟑𝟏
1.30 / 400
Mathematics for Computer Science Engineers
Large-Sample Tests for a Population Mean
Example:

• A coin was tossed 400 times and the head turned up 216 times. Test
the hypothesis that the coin is unbiased?
Mathematics for Computer Science Engineers
Large-Sample Tests for a Population Mean
Example:

• A coin was tossed 400 times and the head turned up 216 times. Test
the hypothesis that the coin is unbiased?

Solution:
𝟏
𝑯𝟎: 𝒑 = 𝟐

𝟏
𝑯𝟏 ∶ 𝒑 ≠ 𝟐

𝒙 − 𝒏𝒑
𝒛= = 𝟏. 𝟔
𝒏𝒑𝒒
Mathematics for Computer Science Engineers
Large-Sample Tests for a Population Mean

P>0.05, Accept H0 or Fail to reject H0, Hence we can conclude


that coin is unbiased
Mathematics for Computer Science Engineers
Large-Sample Tests for a Population Mean

Example:

• A trucking firm is suspicious of the claim that the average lifetime of certain
tires is at least 28,000 miles.

• To check the claim, the firm puts 40 of these tires on its trucks and gets a
mean lifetime of 27,463 with a standard deviation 1,348 miles.

Find the P value for testing


𝑯𝟎: 𝝁 ≥ 𝟐𝟖, 𝟎𝟎𝟎 𝒎𝒊𝒍𝒆𝒔
𝑯𝟏: 𝝁 < 𝟐𝟖, 𝟎𝟎𝟎 𝒎𝒊𝒍𝒆𝒔?
Mathematics for Computer Science Engineers
Large-Sample Tests for a Population Mean

Solution:
𝑯𝟎: 𝝁 ≥ 𝟐𝟖, 𝟎𝟎𝟎 𝒎𝒊𝒍𝒆𝒔

𝑯𝟏: 𝝁 < 𝟐𝟖, 𝟎𝟎𝟎 𝒎𝒊𝒍𝒆𝒔

𝟐𝟕, 𝟒𝟔𝟑 − 𝟐𝟖, 𝟎𝟎𝟎


= = −𝟐. 𝟓𝟐 < −𝟐. 𝟑𝟑
𝟏, 𝟑𝟒𝟖/ 𝟒𝟎

P – Value is 0.0059
Since P- Value is a very small probability we need to reject 𝑯𝟎
Mathematics for Computer Science Engineers
Large-Sample Tests for a Population Mean

Textbook: Page 403


Mathematics for Computer Science Engineers
MCQS
Mathematics for Computer Science Engineers
References:

• [Link]

• [Link]

• “Statistics for Engineers and Scientists”, William Navidi, McGraw Hill Education, India, 4th Edition, 2015.
Dr. Deepa Nair
Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer
Science Engineers

HYPOTHESIS and INFERENCE

Dr. Deepa Nair


Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science
Engineers
UNIT- 2 HYPOTHESIS and INFERENCE
Drawing conclusions from the results of Hypothesis Tests
(p-valued approach, Rejection Region approach)

Dr. Deepa Nair


Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science Engineers
Topics Covered

❖ Drawing Conclusions from the Results of Hypothesis Tests

❖P Value and Acceptance and Rejection region Approach

❖Statistical significance.

Textbook: Chapter 6 , section 6.2


Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests

Textbook: Page 407


Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests

p-value Approach

p-value = Probability of obtaining results as extreme as


observed, assuming H0 is true.
• Steps:
1. Compute test statistic (Z, t, etc.).
2. Find p-value.
3. Compare with α.
• Rule:
- If p ≤ α → Reject H0
- If p > α → Fail to Reject H0
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests

Rejection Region approach for Hypothesis Test

Critical Point & Rejection Region

• A critical point is a value of the test statistic that produces a P-value


exactly equal to α.

• The region on the side of the critical point that leads to rejection is
called the rejection region.

• The critical point itself is also in the rejection region.


Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests

Rejection region Approach

• Based on rejection and acceptance regions of the test


statistic.
• Steps:
1. Choose significance level (α).
2. Find critical value(s) from statistical tables.
3. Define acceptance & rejection regions.
4. Compare test statistic with critical value(s).
• Rule:
- If test statistic ∈ rejection region → Reject H0
- Else → Fail to Reject H0
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests
Comparison of Approaches
p-value Approach:
• Gives exact strength of evidence against H0.
• Common in modern software.

Rejection region Approach:


• Relies on cutoff values of test statistic.
• Traditional/manual method.

Note: Both lead to the same decision.


Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests
Note:
The P-value Is Not the Probability That H0 Is True.
The p-value is the probability (assuming H0 is true) of obtaining data as extreme as, or
more extreme than, what was observed.

Textbook: Page
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests

The null hypothesis, on the other hand, either is true or is not true.
The truth or falsehood of H0 cannot be changed by repeating the
experiment.
It is therefore not correct to discuss the “probability” that H0 is true.

Textbook: Page
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests

Textbook: Page 406


Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests

Textbook: Page 406


Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests

Textbook: Page
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests

Textbook: Page
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests

Textbook: Page
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests

Textbook: Page
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests
Statistical Significance Is Not the Same as Practical Significance:

Textbook: Page
Mathematics for Computer Science Engineers
Drawing Conclusions from the Results of Hypothesis Tests

Statistical Significance Is Not the Same as Practical Significance:

• What it does measure is the degree of confidence we can have that the true
value is really different from the value specified by the null hypothesis.

• When the P-value is small, then we can be confident that the true value is
really different.

• This does not necessarily imply that the difference is large enough to be of
practical importance.
Mathematics for Computer Science Engineers
MCQS/true or false

True or false:
If P = 0.02, then

a. The result is statistically significant at the 5% level.


b. The result is statistically significant at the 1% level.
c. The null hypothesis is rejected at the 5% level.
d. The null hypothesis is rejected at the 1% level.

Ans:
(a) True. The result is statistically significant at any level greater than or equal to2%.
(b) False. P > 0.01, so the result is not statistically significant at the 1% level.
(c) True. The null hypothesis is rejected at any level greater than or equal to 2%.
(d) False. P > 0.01, so the null hypothesis is not rejected at the 1% level.
Mathematics for Computer Science Engineers
MCQS/true or false
2) George performed a hypothesis test. Luis checked George’s work by redoing the
calculations. Both George and Luis agree that the result was statistically significant
the 5% level, but they got different P-values. George got a P-value of 0.20 and Luis
got a P-value of 0.02.
a) Is it possible that George’s work is correct? Explain.
b. Is it possible that Luis’s work is correct? Explain.

Ans:
(a) No. If the P-value is 0.20, then the result is not statistically significant at the 5% level.
(b) Yes. If the P-value is 0.02, then the result is statistically significant at the 5% level.
Mathematics for Computer Science Engineers
References

• “Statistics for Engineers and Scientists”, William Navidi, McGraw Hill Education, India, 4th
Edition, 2015.
Dr. Deepa Nair
Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science Engineers

HYPOTHESIS and INFERENCE

Dr. Deepa Nair


Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science
Engineers

Relationship between Hypothesis and


Confidence Mean
Tests for a Population Proportion

Dr. Deepa Nair


Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science Engineers

Textbook: Page 410


Mathematics for Computer Science Engineers
Topics Covered

❖6.3 - Large-Sample Tests for a Population Proportion

Textbook: Chapter 6 , section 6.3, 6.4, 6.5


Mathematics for Computer Science Engineers
Tests for a Population Proportion

More than 85% of the students across the campuses


participated in the revision session taken on 7/10/2020

Sample of 120 students

Found 75 attended

Can we accept the claim?


Mathematics for Computer Science Engineers
Tests for a Population Proportion
Mathematics for Computer Science Engineers
Tests for a Population Proportion
Mathematics for Computer Science Engineers
Tests for a Population Proportion
Mathematics for Computer Science Engineers
Tests for a Population Proportion

P value
Alternate Hypothesis

H_1: p > p_0 Area to the right of z

H_1: p <p_0 Area to the left of z

H_1: p =p_0 Sum of the areas in the tails cut off by z and -z
Mathematics for Computer Science Engineers
Tests for a Population Proportion

Example:

• The article “Refinement of Gravimetric Geoid Using GPS and Leveling Data” (W.
Thurston, Journal of Surveying Engineering, 2000:27–56) presents a method for
measuring orthometric heights above sea level.

• For a sample of 1225 baselines, 926 gave results that were within the class C spirit
leveling tolerance limits.

• Can we conclude that this method produces results within the tolerance limits more
than 75% of the time?

Textbook: Page
Mathematics for Computer Science Engineers
Tests for a Population Proportion

Textbook: Page
Mathematics for Computer Science Engineers
Tests for a Population Proportion

Example:

A commonly prescribed drug for relieving nervous tension is believed to be only 60%
effective. Experimental results with a new drug administered to a random sample of
100 adults who were suffering from nervous tension show that 70 received relief.

Is this sufficient evidence to conclude that the new drug is superior to the one
commonly prescribed? Use a 0.05 level of significance.
Mathematics for Computer Science Engineers
Tests for a Population Proportion
Mathematics for Computer Science Engineers
Tests for a Population Proportion
Mathematics for Computer Science Engineers
Tests for a Population Proportion

Textbook: Page
Mathematics for Computer Science Engineers
MCQS

A company claims that 60% of its customers are satisfied. You take a sample of 200
customers and find that 120 are satisfied. Which hypothesis test setup is correct for
testing the claim?

Answer) a
Mathematics for Computer Science Engineers
References

• “Statistics for Engineers and Scientists”, William Navidi, McGraw Hill Education, India, 4th Edition,
2015.

• [Link]
Dr. Deepa Nair
Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science Engineers

HYPOTHESIS and INFERENCE

Dr. Deepa Nair


Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science
Engineers
Small-Sample Tests for a Population
mean

Dr. Deepa Nair


Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science Engineers
Topics Covered

❖6.4 - Small-Sample Tests for a Population mean

Textbook: Chapter 6 , section 6.3, 6.4, 6.5


Mathematics for Computer Science Engineers
Small-Sample Tests for a Population Mean

A small sample test for the population mean is typically conducted when the
sample size ≤30, and the population standard deviation is unknown. In such
cases, a t-test is used instead of a z-test
Mathematics for Computer Science Engineers
Small-Sample Tests for a Population Mean

Examples
1. Spacer collars for a transmission countershaft have a thickness specification of
38.98–39.02 mm. The process that manufactures the collars is supposed to be
calibrated so that the mean thickness is 39.00 mm, which is in the center of the
specification window.
A sample of six collars is drawn and measured for thickness. The six thicknesses are
39.030, 38.997, 39.012, 39.008, 39.019, and 39.002. Assume that the population of
thicknesses of the collars is approximately normal. Can we conclude that the process
needs recalibration?
Mathematics for Computer Science Engineers
Small-Sample Tests for a Population Mean

Solution
Mathematics for Computer Science Engineers
Small-Sample Tests for a Population Mean
Mathematics for Computer Science Engineers
Small-Sample Tests for a Population Mean
Mathematics for Computer Science Engineers
Small-Sample Tests for a Population Mean

Example Solution:
1. Before a substance can be deemed safe for landfilling, its chemical properties must
be characterized. The article “Landfilling Ash/Sludge Mixtures” (J. Benoit, T. Eighmy,
and B. Crannell, Journal of Geotechnical and Geoenvironmental Engineering, 1999:
877–888) reports that in a sample of six replicates of sludge from a New Hampshire
wastewater treatment plant, the mean pH was 6.68 with a standard deviation of
0.20. Can we conclude that the mean pH is less than 7.0?
Mathematics for Computer Science Engineers
Small-Sample Tests for a Population Mean
Mathematics for Computer Science Engineers
Small-Sample Tests for a Population Mean

More Example
2. A certain manufactured product is supposed to contain 23% potassium by weight. A
sample of 10 specimens of this product had an average percentage of 23.2 with a
standard deviation of 0.2. If the mean percentage is found to differ from 23, the
manufacturing process will be recalibrated.
a. State the appropriate null and alternate hypotheses.
b. Compute the P-value.
c. Should the process be recalibrated? Explain.
Mathematics for Computer Science Engineers
Small-Sample Tests for a Population Mean
Solution:
Mathematics for Computer Science Engineers
Small-Sample Tests for a Population Mean
Mathematics for Computer Science Engineers
MCQS

You sample 12 items from a population with unknown standard deviation. Which test should
you use to test the population mean?
a) Z-test
b) T-test for mean (small sample, σ unknown)
c) Chi-square test
d) F-test
Answer b)
Mathematics for Computer Science Engineers
References

• “Statistics for Engineers and Scientists”, William Navidi, McGraw Hill Education, India, 4th Edition,
2015.
Dr. Deepa Nair
Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science
Engineers
HYPOTHESIS and INFERENCE

Dr. Deepa Nair


Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science Engineers

HYPOTHESIS and INFERENCE


Distribution Free Tests

Dr. Deepa Nair


Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science Engineers
Distribution Free Tests.

Nonparametric methods :
▪ rank-based methods are used when we have no idea about the population
distribution from which the data is sampled.
▪ Used for small sample sizes.
▪ Used when the data are measured on an ordinal scale and only their ranks are
meaningful.
▪ Data are skewed or heavily tailed, or you detect outliers.
▪ Variances between groups are not equal and transformation either not possible
or undesirable.
Mathematics for Computer Science Engineers
Non-Parametric test procedure

• Do not involve population parameters


— Example: probability distributions, independence
• Data measured on any scale
— Ratio or interval
— Ordinal
▪ Example: good-better-best
— Nominal
▪ Example: male-female
• Example: Wilcoxon rank sum test
Mathematics for Computer Science Engineers
Nonnormal Distributions - t-Statistic is Invalid
Mathematics for Computer Science Engineers
Distribution Free Tests.

• The samples are not required to come from any specific distribution.

• While distribution free tests do require assumptions for their validity, these
assumptions are somewhat less restrictive than the assumptions needed for the t
test.

• Distribution-free tests are sometimes called nonparametric tests.


• We discuss two distribution-free tests in this section. The first, called the
Wilcoxon signed-rank test, is a test for a population mean.
The second, called the Wilcoxon rank-sum test, or the Mann– Whitney test.
Mathematics for Computer Science Engineers
Advantages and disadvantages of distribution free test

Advantages Disadvantages
• Used with all scales • May waste information
• Easier to compute — If data permit using parametric procedures
— Developed originally before wide computer — Example: converting data from
use ratio to ordinal scale
• Make fewer assumptions • Difficult to compute by hand for
large samples
• Need not involve population parameters
• Tables not widely available
• Results may be as exact as parametric procedures
Mathematics for Computer Science Engineers
Distribution Free Tests: Wilcoxon Signed-Rank Test

The nickel content (in parts per thousand by weight) is measured for six welds, giving the
following results:
9.3, 0.9, 9.0, 21.7, 11.5, and 13.9.
Let 𝜇represent the mean nickel content for this type of weld.
It is desired to test whether the average nickel content is less than 12.
Mathematics for Computer Science Engineers
Distribution Free Tests
Mathematics for Computer Science Engineers
Distribution Free Tests

The Wilcoxon Signed-Rank Test:


• To compute the rank-sum statistic, we begin by subtracting 12 from each sample
observation to obtain differences. The difference closest to 0, ignoring sign, is assigned
a rank of 1.

• The difference next closest to 0, again ignoring sign, is assigned a rank of 2, and so on.

• Finally, the ranks corresponding to negative differences are given negative signs. The
following table shows the results.
• Denote the sum of the positive ranks S+ and the sum of the absolute values of the
negative ranks S−.
• Either S+ or S− may be used as a test statistic;
• we shall use S+
Mathematics for Computer Science Engineers
Distribution Free Tests

The Wilcoxon Signed-Rank Test:

X X-12 Rank
11.5 -0.5 -1
13.9 1.9 2
9.3 -2.7 -3
9.0 -3.0 -4
21.7 9.7 5
0.9 -11.1 -6
Mathematics for Computer Science Engineers
Distribution Free Tests

In this example
S+ = 2+5 = 7, and S− = 1+3+4+6 = 14.

Note that since the sample size is 6,


(S+) + (S−) = 1 + 2 + 3 + 4 + 5 + 6 = 21.

For any sample, it is the case that


(S+) + (S−) = 1+2+···+n = n(n+1)/2.

In some cases, where there are many more positive ranks than negative ranks, it is
easiest to first compute S− by summing the negative ranks and then computing
S+ = n(n + 1)/2 − (S−).
Mathematics for Computer Science Engineers
Distribution Free Tests

how S+ can be used as a test statistic.

In Figure 6.17, μ > 12. For this distribution, positive differences are more probable than negative
differences and tend to be larger in magnitude as well.
Therefore it is likely that the positive ranks will be greater both in number and in magnitude than
the negative ranks, so S+ is likely to be large.
Mathematics for Computer Science Engineers
Distribution Free Tests

In Figure 6.18, μ < 12, and the situation is reversed.


positive ranks are likely to be fewer in number and smaller in magnitude, so S+ is likely
to be small.
Mathematics for Computer Science Engineers
Distribution Free Tests

In general,
large values of S+ will provide evidence against a null hypothesis of the form H0 :μ ≤ μ0,

while small values of S+ will provide evidence against a null hypothesis of the form H0 :μ ≥
μ0.
Mathematics for Computer Science Engineers
Distribution Free Tests
Mathematics for Computer Science Engineers
Distribution Free Tests

Ties:
• Sometimes two or more of the quantities to be ranked have exactly the same
value. Such quantities are said to be tied. The standard method for dealing with
ties is to assign to each tied observation the average of the ranks they would
have received if they had differed slightly.

• For example, the quantities 3, 4, 4, 5, 7 would receive the ranks 1, 2.5, 2.5, 4, 5

• The quantities 12, 15, 16, 16, 16, 20 would receive the ranks 1, 2, 4, 4, 4, 6.
Mathematics for Computer Science Engineers
Distribution Free Tests
Mathematics for Computer Science Engineers
Distribution Free Tests

The nickel content for six welds was measured to be 9.3, 0.9, 9.0,
21.7, 11.5, and 13.9. Use these data to test H0 :μ ≤ 5 versus H1
:μ > 5.
Mathematics for Computer Science Engineers
Distribution Free Tests

Solution

The observed value of the test statistic is S+ = 19.


Since the null hypothesis is of the form μ ≤ μ0, large values of S+ provide
evidence against H0.
Therefore the P-value is the area in the right-hand tail of the null
distribution, corresponding to values greater than or equal to 19.

Consulting Table A.5 shows that the P-value is 0.0469


Mathematics for Computer Science Engineers
Distribution Free Tests

The nickel content for six welds was measured to be 9.3, 0.9, 9.0, 21.7, 11.5, and 13.9.
Use these data to test H0 :μ = 16 versus H1 :μ ≠ 16.
Mathematics for Computer Science Engineers
Distribution Free Tests

Solution

• null hypothesis is of the form H0 :μ = μ0, this is a two-tailed test.


• The observed value of the test statistic is S+ = 3.
• Consulting Table A.5, we find that the area in the left-hand tail, corresponding to
values less than or equal to 3, is 0.0781.
• The P-value is twice this amount, since it is the sum of areas in two equal tails.
• Thus the P-value is 2(0.0781) = 0.1562.
Mathematics for Computer Science Engineers
Distribution Free Tests

Use the data in previous Example to test H0 :μ = 9 versus H1 :μ ≠ 9.

Solution

The value of the test statistic is S+ = 11.


The sample size for the purposes of the test is 5, since the value 9.0 is not ranked.
Entering Table A.5 with sample size 5, we find that if S+ = 12, the P-value would be
2(0.1562) = 0.3124.
We conclude that for S+ = 11, P > 0.3124.
Mathematics for Computer Science Engineers
Distribution Free Tests

Large-Sample Approximation
When the sample size n is large, the test statistic S+ is approximately normally distributed.
A rule of thumb is that the normal approximation is good if n > 20.
It can be shown by advanced methods that under H0, S+ has

mean= n(n + 1)/4


and variance= n(n + 1)(2n + 1)/24
The Wilcoxon signed-rank test is performed by computing the z-score of S+, and then using
the normal table to find the P-value.
The z-score is
Mathematics for Computer Science Engineers
Distribution Free Tests

The article “Exact Evaluation of Batch-Ordering Inventory Policies in Two-


Echelon Supply Chains with Periodic Review” (G. Chacon, Operations
Research, 2001: 79–98) presents an evaluation of a reorder point policy,
which is a rule for determining when to restock an inventory. Costs for 32
scenarios are estimated. Let μ represent the mean cost. Test H0 :μ ≥ 70
versus H1 :μ < 70. The data is presented in Table
Mathematics for Computer Science Engineers
Distribution Free Tests

X X X
79.26 22.39 10.08
Given data
80.79 118.39 7.28
82.07 118.46 6.87
82.14 20.32 6.23
57.19 16.69 4.57
55.86 16.50 4.09
42.08 15.95 140.09
41.78 15.16 140.77
100.01 14.22
100.36 11.64
30.46 11.48
30.27 11.28
Mathematics for Computer Science Engineers
Distribution Free Tests
Mathematics for Computer Science Engineers
Distribution Free Tests
Mathematics for Computer Science Engineers
MCQS
Which of the following is a key assumption of the Wilcoxon Signed-Rank
Test?
a) Data must be normally distributed
b) Data must be paired and at least ordinal
c) Samples must be independent
d) Population variance must be known
Answer b)

In the Wilcoxon Signed-Rank Test, what do you do with zero differences (ties)?
a) Count them as positive differences
b) Count them as negative differences
c) Exclude them from ranking
d) Add them to the smallest rank
b) Answer C)
Mathematics for Computer Science Engineers
References

• “Statistics for Engineers and Scientists”, William Navidi, McGraw Hill Education, India, 4th Edition,
2015.
Dr. Deepa Nair
Department of Science and Humanities
PES University, Bangalore
Mathematics for Computer Science
Engineers
HYPOTHESIS and INFERENCE

Dr. Deepa Nair


Department of Science and Humanities
Mathematics for Computer Science Engineers
HYPOTHESIS and INFERENCE
Distribution Free Tests

Dr. Deepa Nair


Department of Science and Humanities
Mathematics for Computer Science Engineers
Distribution Free Tests

The Wilcoxon Rank-Sum Test:


• The Wilcoxon Rank–Sum Test (also known as the Mann–Whitney U Test) is a
non-parametric test.
• Compares two independent samples to check if they come from the same
population or if one tends to have larger values than the other.
• Two assumptions are necessary.

• First the populations must be continuous.

• Second, their probability density functions must be identical in shape and


size; the only possible difference between them being their location.
Mathematics for Computer Science Engineers
Distribution Free Tests

The Wilcoxon Rank-Sum Test:

• Let 𝑋1 , . . . , 𝑋𝑚 be a random sample from one population and let


𝑌1 . . . , 𝑌𝑛 be a random sample from the other.

• We adopt the notational convention that when the sample sizes are
unequal, the smaller sample will be denoted 𝑋1 , . . . , 𝑋𝑚 .

• Thus the sample sizes are 𝑚 and 𝑛, with 𝑚 ≤ 𝑛.

• Denote the population means by 𝜇𝑋 and 𝜇𝑌 , respectively.


Mathematics for Computer Science Engineers
Distribution Free Tests

The Wilcoxon Rank-Sum Test:

• The test is performed by ordering the m + n values obtained by combining


the two samples, and assigning ranks 1, 2, . . . , 𝑚 + 𝑛 to them.

• The test statistic, denoted by 𝑊, is the sum of the ranks corresponding to


𝑋1 , . . . , 𝑋𝑚 .
Mathematics for Computer Science Engineers
Distribution Free Tests

The Wilcoxon Rank-Sum Test:

• Since the populations are identical with the possible exception of location,
it follows that if 𝜇𝑋 < 𝜇𝑌 , the values in the 𝑋 sample will tend to be
smaller than those in the 𝑌 sample.

• So the rank sum W will tend to be smaller as well.

• By similar reasoning, if 𝜇𝑋 > 𝜇𝑌 , 𝑊 will tend to be larger.


Mathematics for Computer Science Engineers
Distribution Free Tests

The Wilcoxon Rank-Sum Test:


Example:

• Resistances, in m, are measured for five wires of one type and six wires of
another type. The results are as follows:
𝑿: 𝟑𝟔 𝟐𝟖 𝟐𝟗 𝟐𝟎 𝟑𝟖
𝒀: 𝟑𝟒 𝟒𝟏 𝟑𝟓 𝟒𝟕 𝟒𝟗 𝟒𝟔

• Use the Wilcoxon rank-sum test to test 𝑯𝟎 : 𝝁𝑿 ≥ 𝝁𝒀 𝒗𝒆𝒓𝒔𝒖𝒔 𝑯𝟏 : 𝝁𝑿 <


𝝁𝒀 .
Mathematics for Computer Science Engineers
Distribution Free Tests

The Wilcoxon Rank-Sum Test:


Solution:

We order the 11 values and assign the ranks.

Rank Value Rank Value


20 1 X
28 2 X 38 7 X
29 3 X 41 8 Y
34 4 Y 46 9 Y
35 5 Y 47 10 Y
36 6 X 49 11 Y
Mathematics for Computer Science Engineers
Distribution Free Tests

The Wilcoxon Rank-Sum Test:


Solution:

𝑊 = 1 + 2 + 3 + 6 + 7 = 19.

• To determine the P-value, we consult Table A.6 (in Appendix A).

• We note that small values of 𝑊 provide evidence against 𝐻0 : 𝜇𝑋 ≥ 𝜇𝑌 ,


so the P value Is the area in the left-hand tail of the null distribution.
Entering the table with 𝑚 = 5 𝑎𝑛𝑑 𝑛 = 6 we find that the area to the
left of 𝑊 = 19 is 0.0260. This is the 𝑃-value
Mathematics for Computer Science Engineers
Distribution Free Tests

Large-Sample Approximation:

• When both sample sizes 𝑚 and 𝑛 are greater than 8, it can be shown by advanced
methods that the null distribution of the test statistic 𝑊 is approximately normal
with mean 𝑚(𝑚 + 𝑛 + 1)/2 and variance 𝑚𝑛(𝑚 + 𝑛 + 1)/12.

• 𝑧 − 𝑠𝑐𝑜𝑟𝑒 𝑖𝑠
𝑊 − 𝑚(𝑚 + 𝑛 + 1)/2
𝑧 =
𝑚𝑛(𝑚 + 𝑛 + 1)/12
Mathematics for Computer Science Engineers
Distribution Free Tests

The article “Cost Analysis Between SABER and Design Bid Build Contracting Methods”
(E. Henry and H. Brothers, Journal of Construction Engineering and Management,
2001:359–366) presents data on construction costs for 10 jobs bid by the traditional
method (denoted X) and 19 jobs bid by an experimental system (denoted Y ). The data,
in units of dollars per square meter, and their ranks, are presented in . Test H0 :μX ≤ μY
versus H1 :μX > μY .
Mathematics for Computer Science Engineers
Distribution Free Tests
Mathematics for Computer Science Engineers
Distribution Free Tests
Mathematics for Computer Science Engineers
MCQS

The Wilcoxon Rank-Sum Test is used to:


a) Compare means of two paired samples
b) Compare medians of two independent samples when data are not normally
distributed.
c) Compare variances of two samples
d) Test correlation between two variables
Answer b)

How are ranks assigned in the Wilcoxon Rank-Sum Test?


a) Rank values within each group separately
b) Rank values after combining both groups
c) Only rank the larger group
d) Only rank the smaller group
Answer b)
Dr. Deepa Nair
Department of Science and Humanities

deepanair@[Link]
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS

HYPOTHESIS and INFERENCE


Dr. Deepa Nair
Department of Science and Humanities
PES University, Bangalore
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS

UNIT-3 HYPOTHESIS and INFERENCE


Chi-squared Test

Dr. Deepa Nair


Department of Science and Humanities
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test

• Chi square test is used to measure independence of two categorical


variables. i.e if one variable has any affect on another
• Ex:Does the gender of the person determine which chocolate they like?

• It is also used to measure goodness of fit. i.e if the observed and


expected values match.

• For example, your model expected women winning lottery is 0.05 more
than male, but is that really the case. Does the real data really match up
with your prediction?

• Chi square is used when data is categorical (i.e can be classified into
groups (yes/no)(red/blue/green) )
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test

Tender for collecting toll for a newly You want to know if the day of
opened bridge week affects
The number of people sending
Day Mon Tues Wedn Thur Frid Saturd Sun tenders for the bridge?
day day esday sday ay ay day
You accordingly set up your H0
No 50 20 90 130 200 170 220 and H1

Day Mon Tues Wedn Thur Frid Saturd Sun Ex:H0 probability of all day
same
day day esday sday ay ay day
If H0 rejected then the day
No 50 50 100 130 200 150 200 affects number of tenders else
not
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test for best fit of model
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test

Example:
Consider a die is thrown 600 times with following results.
Number 1 2 3 4 5 6
turned up
Frequency 115 97 91 101 110 86

Is the die unbiased at 10% significance level?


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test

Example:
Expected value=600/6=100 (unbiased die means equal outcome
of all)
Hypothesis
H0: Die is unbiased Catogory Observed Expected
H1: Die is biased 1 115 100
2 97 100
3 91 100
4 101 100
5 110 100
6 86 100
Tot 600 600
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS

Chi-squared Test
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS

Chi-squared Test
The Chi-Square test is used to:
a) Compare means of two groups
b) Compare medians of two groups
c) Test the association between categorical variables or goodness-of-fit
d) Test correlation between two quantitative variables
Answer b)

In a Chi-Square test, what happens to the test statistic if the difference between observed
and expected frequencies increases?
a) It decreases
b) Remains the same
c) It increases
d) Becomes negative
Answer C)
Dr. Deepa Nair

Department of Science and Humanities

deepanair@[Link]
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS

HYPOTHESIS and INFERENCE


Dr. Deepa Nair
Department of Science and Humanities
PES University, Bangalore
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS

UNIT-3 HYPOTHESIS and INFERENCE


Chi-squared Test

Dr. Deepa Nair


Department of Science and Humanities
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test

The Chi-Square Test for Independence:

• In some cases, both row and column totals are random. In either case,
we can test the null hypothesis that the probabilities of the column
outcomes are the same for each row outcome, and the test is exactly
the same in both cases.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test

• Then calculate overall chi-square (applying formula over all cells observed
and expected)

• Then calculate degree of freedom using formula


df=(r−1)×(c−1)
where r is the number of rows and c is the number of columns.
• Then look up critical X2 val and compare with calculated to determine
whether to reject null hypothesis or not
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test

The Chi-Square Test for Homogeneity:

Example:

• Use the following data to test the null hypothesis that the proportions of
pins that are too thin, OK, or too thick are the same for all the machines.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test

The Chi-Square Test for Homogeneity:


Example: specification. A pin may meet the specification, or it may be too thin or too
thick. Pins are sampled from each machine, and the number of pins in each category is
counted. Table below presents the results. Use the data in Table to test the null
hypothesis that the proportions of pins that are too thin, OK, or too thick are the same
for all the machines.

T00 OK Too
thin thick
Machine 1 10 102 8
Machine 2 34 161 5
Machine 3 12 79 9
Machine 4 10 60 10
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test

Null Hypothesis (H₀): The proportions of pins that are too thin, OK, or
too thick are the same across all machines.
Alternative Hypothesis (H₁): The proportions of pins that are too thin,
OK, or too thick differ for at least one machine.

Expected Value Formula


For each cell in the table, the expected value is calculated as:
E=(Row Total)×(Column Total)/Grand total

Note: calculate the expected values for each cell. The apply in chi square
formula
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test

Observed values
T00 OK Too
thin thick
Machine 1 10 102 8
Machine 2 34 161 5
Machine 3 12 79 9
Machine 4 10 60 10
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test
The Chi-Square Test for Independence:
Example:
The cylindrical steel pins in Example 6.21 are subject to a length specification as well as a diameter
specification. With respect to the length, a pin may meet the specification, or it may be too short or too
long. A total of 1021 pins are sampled and categorized with respect to both length and diameter
specification. The results are presented in the following table. Test the null hypothesis that the
proportions of pins that are too thin, OK, or too thick with respect to the diameter specification do not
depend on the classification with respect to the length specification.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test

Example:
Formulate the Hypotheses:
• Null Hypothesis (H₀): The proportions of pins that are too thin, OK, or too thick with respect to the
diameter specification do not depend on the length specification.
• Alternative Hypothesis (H₁): The proportions of pins that are too thin, OK, or too thick with respect to
the diameter specification depend on the length specification.
Calculate the expected values
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Chi-squared Test
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
Chi-squared Test
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
Chi-squared Test-Practice Problems (independence test)

• At an assembly plant for light trucks, routine monitoring of the


quality of welds yields the following data:
Can you conclude that the quality varies
among shifts?
a. State the appropriate null hypothesis.
b. Compute the expected values under the
null hypothesis.
c. Compute the value of the chi-square
statistic.
d. Find the P-value. What do you conclude?
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
Chi-squared Test-Practice Problems (homogeneity test)

• A survey of adults with diabetes. Each respondent was


categorized by gender and income level.

Can you conclude that the proportions in the


various income categories differ between men
and women?
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
Mcqs

The Chi-Square Test of Independence is used to:


a) Compare means of two groups
b) Test whether two categorical variables are associated
c) Compare medians of two groups
d) Test whether variances are equal
Answer b)

Q2: The Chi-Square Test of Homogeneity is used to:


a) Compare means of independent samples
b) Test whether the distribution of a categorical variable is the same across different populations
c) Test relation ship between two quantitative variables
d) Compare paired observations
Answer b)
Dr. Deepa Nair

Department of Science and Humanities

deepanair@[Link]
MATHEMATICS FOR COMPUTER
SCIENCE ENGINEERS

HYPOTHESIS and INFERENCE


Dr. Deepa Nair
Department of Science and Humanities
PES University, Bangalore
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS

Fixed Level Testing


Type I and Type II Errors

Dr. Deepa Nair


Department of Science and Humanities
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing
Why we need fixed-level hypothesis testing
• Controls the probability of a Type I error
A Type I error occurs when we wrongly reject a true null hypothesis.
By fixing a level (say, α = 0.05), we cap the risk of making this false rejection
This makes hypothesis testing reliable and fair — everyone knows what risk of false alarm we are willing
to tolerate.
• Ensures consistent decision-making
Without a fixed α, two people analyzing the same data could draw different conclusions.
Setting α (e.g., 0.01, 0.05, or 0.10) provides a standard threshold for decision-making:
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing

Balances evidence and risk


• A smaller α (like 0.01) demands stronger evidence to reject 𝐻0, reducing false positives
but increasing false negatives (Type II errors).
• A larger α (like 0.10) allows more flexibility but increases the risk of false [Link], α
defines how cautious or liberal we are in making decisions.

Supports comparison across studies


When multiple studies use the same significance level (usually α = 0.05), their results can be
compared meaningfully — which is essential in research, meta-analysis, and policy-making.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing

Balances evidence and risk


• A smaller α (like 0.01) demands stronger evidence to reject 𝐻0, reducing false positives
but increasing false negatives (Type II errors).
• A larger α (like 0.10) allows more flexibility but increases the risk of false [Link], α
defines how cautious or liberal we are in making decisions.

Supports comparison across studies


When multiple studies use the same significance level (usually α = 0.05), their results can be
compared meaningfully — which is essential in research, meta-analysis, and policy-making.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing

Type I error (False Positive):Healthy person diagnosed as sick (false alarm).


The test says the person has COVID-19, but in reality, they don’t.

Type II error (False Negative): Sick person diagnosed as healthy (missed detection).
The test says the person does NOT have COVID-19, but they actually do.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing

Example:
• A new concrete mix is being evaluated. 100 concrete blocks made
with the new mix are sampled, the sample mean compressive
strength is X, it is tested on the following hypothesis
• H0 :μ ≤ 1350 Mpa
• H1 :μ > 1350 Mpa
• If population standard deviation is 70 Mpa. Find critical point and
rejection region if significance level of the test is 5%.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing

Example:
In a hypothesis test to determine whether a scale is in calibration, the
null hypothesis is
H0 :μ = 1000
and the null distribution of X is N(1000, 0.262).
Find the rejection region if the test will be conducted at a significance
level of 5%.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing

0.262=sample variance=s2/n
=0.26
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Fixed Level Testing
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Type I and Type II Errors
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Type I and Type II Errors
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Type I and Type II Errors

Statistical Errors The truth

H0 True H0 False

Your Research Accept Ho Correct Decision Type II Error


1-α 𝛽

Reject H0 Type I error Correct


α Decision
1-𝛽
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Type I and Type II Errors
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Type I and Type II Errors
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Type I and Type II Errors

When conducting a fixed-level test at significance level α, there are


two types of errors that can be made. These are

• Type I error: Reject H0 when it is true/plaussible.

• Type II error: Fail to reject H0 when it is false.

The probability of a type I error is never greater than α.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Type I and Type II Errors
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Type I and Type II Errors
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Type I and Type II Errors
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Recall- type I error and type II error

Statistical Errors The truth

H0 True H0 False

Your Research Accept Ho Correct Decision Type II Error


Type I error (𝛼) Correct Decision
1-α 𝛽
(1 – β)

Correct Decision
Reject H0 Type I error Correct
Type II error α
(β)
(1 – 𝛼) Decision
1-𝛽
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test

MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test
Effect of bio-fertilizer ‘x’ on plant growth

Power is 0.80(or 80%) there is an 80% chance of rejecting the null


hypothesis( false) when conducting the study.
Source: [Link]
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test
Why is Power Important?
• The power of a test is important because it measures the probability of correctly
rejecting a false null hypothesis, meaning it's the test's ability to detect a real effect
when one exists.

• A high-powered test reduces the risk of a Type II error (a false negative), preventing
researchers from mistakenly concluding there is no effect when one actually exists.

• This ensures that research is more likely to find significant results that are actually
there, avoiding wasted resources and the failure to identify important findings.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test
How large the power must be for a test ?

• As with P-values, there is no scientifically valid dividing line


between sufficient and insufficient power.

• In general, tests with power greater than 0.80 or perhaps


0.90 are considered acceptable, but there are no well-
established rules of thumb.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test
Analysis of power is performed:

1) Before gathering data


To determine the minimal sample size needed to have desired power
in statistical testing (to detect a particular effect size).

2) After gathering data


To determine the magnitude of power that your statistical test will
have given the sample parameters (n and s) and the magnitude of the
effect that you want to detect.
Note: Statistical power has relevance only when the null is false.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test

Dr. Deepa Nair

Department of Science and Humanities


deepanair@[Link]
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
POWER OF TEST

SIVASANKARI V
Department of Science & Humanities
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS

Power of test(Computation)
Factor Affecting the power of the test)

SIVASANKARI V
Department of Science & Humanities
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test

Computing the power involves two steps:

1. Compute the rejection region.

2. Compute the probability that the test statistic falls in the rejection region
if the alternate hypothesis is true.
This is the power.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example of a power calculation
Assume that a new chemical process has been developed that may increase the
yield over that of the current process. The current process is known to have a
mean yield of 80 and a standard deviation of 5, where the units are the
percentage of a theoretical maximum. If the mean yield of the new process is
shown to be greater than 80, the new process will be put into production.

Let μ denote the mean yield of the new process. It is proposed to run the new
process 50 times and then to test the hypothesis
𝐻0 : μ ≤ 80 versus 𝐻1 : μ > 80 at a significance level of 5%.
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
Calculation of Power

Problem 1 :
Find the power of the 5% level test of
𝑯𝟎 : μ ≤ 80 versus 𝑯𝟏 : μ > 80
for the mean yield of the new process under the
alternative μ = 81, assuming n = 50 and σ = 5.
Solution:
Null distribution of 𝑋ത :
2 𝜎

𝑋~𝑁 𝜇, 𝜎𝑋ത 𝑤ℎ𝑒𝑟𝑒 𝜎𝑋ത =
𝑛
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test

ഥ:
Null distribution of 𝑿

𝑋~𝑁 80, 0.7072
The critical point has a z-score of 1.645, so its value is
𝑋ത = 80 + (1.645)(0.707) = 81.16.
The rejection region consists of all values of 𝑋ത ⩾ 81.16
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test
Alternate distribution of ഥ 𝑿:

𝑋~𝑁 81, 0.7072
(The alternate distribution is obtained by shifting the null
distribution to chosen value of μ.)
Power is the probability that 𝑋ത will fall into the
rejection region if the alternate hypothesis μ = 81 is true.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Computing the power
𝑧 -Score under 𝑯𝟏 for the critical point 81.16 is
𝑋ത − 𝜇 81.16 − 81
𝑧= = = 0.23
𝜎 0.707
The area to the right of 𝑧 = 0.23 is 0.4090.
This is the power of the test.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test
Conclusion:

A power of 0.4090 is very low.


➢ It means that if the mean yield of new process is actually equal to
81, there is only a 41% chance that the proposed experiment will
detect the improvement over the old process and allow the new
process to be put into production.

➢It would be unwise to invest time and money to run this


experiment, since it has a large chance to fail.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Calculation of Power

Problem 2 :
Find the power of the 5% level test of
𝑯𝟎 : μ ≤ 80 versus 𝑯𝟏 : μ > 80
for the mean yield of the new process under the alternative μ = 82, assuming
n = 50 and σ = 5.
Solution:
Null distribution of 𝑋ത :
2 𝜎

𝑋~𝑁 𝜇, 𝜎𝑋ത 𝑤ℎ𝑒𝑟𝑒 𝜎𝑋ത =
𝑛
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test

ഥ:
Null distribution of 𝑿

𝑋~𝑁 80, 0.7072
The critical point has a z-score of 1.645, so its value is
𝑋ത = 80 + (1.645)(0.707) = 81.16.
The rejection region consists of all values of 𝑋ത ⩾ 81.16
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test
Alternate distribution of ഥ 𝑿:

𝑋~𝑁 82, 0.7072
(The alternate distribution is obtained by shifting the null
distribution to chosen value of μ.)
Power is the probability that 𝑋ത will fall into the
rejection region if the alternate hypothesis μ = 82 is true.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Computing the power
𝑧 -Score under 𝑯𝟏 for the critical point 81.16 is
𝑋ത − 𝜇 81.16 − 82
𝑧= = = −1.19
𝜎 0.707
The area to the right of 𝑧 = −1.19 is 0.8830.
This is the power of the test.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test
Conclusion
A power of 0.8830 is high.
➢It means that if the mean yield of new process is actually equal to 82, there
is a 88.30% chance that the proposed experiment will detect the
improvement over the old process and allow the new process to be put into
production.
➢ It would be a wise decision to invest time and money to run this
experiment, since it has a large chance to succeed.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Computing Power

In order to compute the power, it is necessary to specify a particular


value of μ, because:

power is different for different values of μ

❑if μ is close to 𝑯𝟎 : the power will be small.

❑if μ is far from 𝑯𝟎 : the power will be large.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test

• When power is not large enough, it can be increased by


increasing the sample size.

• When planning an experiment, one can determine the sample


size necessary to achieve a desired power.

• Knowing the significance level and the required power allows a


researcher to determine a minimum sample size needed for the
study.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test

Problem 3 :
In testing the hypothesis 𝑯𝟎 : μ ≤ 80 versus 𝑯𝟏 : μ > 80
regarding the mean yield of the new process, how many times must the new
process be run so that a test conducted at a significance level of 5% will have
power 0.90 against the alternative μ = 81, if it is assumed that σ = 5?

Solution:
Let n represent the necessary sample size.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test
ഥ:
Null distribution of 𝑿
𝜎

𝑋~𝑁 𝜇, 𝜎𝑋2ത 𝑤ℎ𝑒𝑟𝑒 𝜎𝑋ത =
𝑛
𝟓
Critical point : 𝟖𝟎 + 𝟏. 𝟔𝟒𝟓
𝒏
ഥ.
Consider the alternate distribution of 𝑿
Given Power is 0.90. The power of the test is the area of the rejection
region under the alternate curve. This area must be 0.90.
Therefore, Z-score is -1.28.
𝟓
Critical point : 𝟖𝟏 − 𝟏. 𝟐𝟖
𝒏
We now have two different expression for the critical point. Since
there is only one critical point, these two expressions are equal.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test
Set them equal and solve for n
𝟓 𝟓
𝟖𝟎 + 𝟏. 𝟔𝟒𝟓 = 𝟖𝟏 − 𝟏. 𝟐𝟖
𝒏 𝒏
→ 𝒏 ≈ 𝟐𝟏𝟒.

The critical point is 80.56 (The critical point can by computed by


substituting this value for n into either side of the equation).
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test

Sample
size(n)
Significance
Type of level(𝛼)
Statistical
test
Power

Standard
Effect size
deviation(𝜎)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test – Sample size
Example:
A random sample of n people’s weight whose mean and standard
deviation are 168 lbs and 7.2 lbs. Can we conclude that the mean
of the population is 165lb?

𝐻0 : 𝜇 = 165
𝐻1 : 𝜇 ≠ 165

168 − 165 168 − 165 𝑛


𝑧= =
7.2/ 𝑛 7.2
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test- Sample size

The larger
Sample the sample
Power size, the
size(n) higher the
power.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test – Sample size

The above figure shows that the larger the sample size, the higher
the power. Since sample size is typically under an experimenter's
control, increasing sample size is one way to increase power.
However, it is sometimes difficult and/or expensive to use a large
sample size.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test – Significance level

Source:Houghton Miffin Company


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test- Significance level
The larger
Signific the
significance
Power ance level, the
level(𝛼) higher the
power.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test- Effect Size

Effect size = True Mean - Hypothesized Mean


= 𝝁𝑨 −𝝁𝟎
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example of a power calculation
Assume that a new chemical process has been developed that may
increase the yield over that of the current process. The current
process is known to have a mean yield of 80 and a standard
deviation of 5, where the units are the percentage of a theoretical
maximum. If the mean yield of the new process is shown to be
greater than 80, the new process will be put into production.
Let μ denote the mean yield of the new process. It is proposed to
run the new process 50 times and then to test the hypothesis
𝐻0 : μ ≤ 80 versus 𝐻1 : μ > 80 at a significance level of 5%.
if μ is close to 𝝁𝟎 : the power will be small
(when μ =81, Power=0.4090)
if μ is far from 𝝁𝟎 : the power will be large
(when μ =82, Power=0.8830)
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test- Effect Size
The greater
the effect
Effect size, the
Power greater the
size power of
the test.

Effect size = True value - Hypothesized value


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test –Standard deviation
Example:
A random sample of 200 people’s weight whose mean is 168 lbs.
Can we conclude that the mean of the population is 165lb?

𝐻0 : 𝜇 = 165
𝐻1 : 𝜇 ≠ 165

168 − 165
168 − 165 200
𝑧= =
𝜎/ 200 𝜎
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test- Standard deviation

Standar Smaller the


standard
d deviation,
Power greater the
deviati power of the
on test.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power of test
STANDARD DEVIATION

Figure also shows that power is higher when the standard


deviation is small than when it is large. For all values of N, power is
higher for the standard deviation of 10 than for the standard
deviation of 15 (except, of course, when N = 0). Experimenters can
sometimes control the standard deviation by sampling from a
homogeneous population of subjects, by reducing random
measurement error, and/or by making sure the experimental
procedures are applied very consistently.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test – Standard deviation

In each picture, the area


under the green curve to the
right of the red line is the
power of the test against the
alternate depicted. Note that
this area is larger in the
second picture (the one with
smaller standard deviation)
than in the first picture.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test – Type of Statistical test
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test- Type of Statistical test

Type of One tailed


test Power
Power Statistic > Two-
tailed test
al test Power

ONE- VERSUS TWO-TAILED TESTS


Power is higher with a one-tailed test than with a
two-tailed test as long as the hypothesized
direction is correct. A one-tailed test at the 0.05
level has the same power as a two-tailed test at the
0.10 level. A one-tailed test, in effect, raises the
significance level.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Factors affecting Statistical Power of test

Greater the sample size, the higher the power.

The larger the significance level, the higher the


power.

The greater the effect size, the greater the power


of the test..

Smaller the standard deviation, greater the power


of the test.

One tailed test Power > Two-tailed test Power


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
True or False/ Mcqs
1) A test has power 0.90 when μ = 15.
True or false:

a. The probability of rejecting H0 when μ = 15 is 0.90.


b. The probability of making a correct decision when μ =15 is 0.90.
c. The probability of making a correct decision when μ =15 is 0.10.
d. The probability that H0 is true when μ = 15 is 0.10.

2) If the sample size remains the same, and the level α increases, then the power will
______________
a) Increase b) decrease c) remains constant d) both increase and decreases
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
True or False/ Mcqs
1) A test has power 0.90 when μ = 15.
True or false:

a. The probability of rejecting H0 when μ = 15 is 0.90.


b. The probability of making a correct decision when μ =15 is 0.90.
c. The probability of making a correct decision when μ =15 is 0.10.
d. The probability that H0 is true when μ = 15 is 0.10.

(a) True. This is the definition of power.


(b) True. When H0 is false, making a correct decision means
rejecting H0.
(c) False. The power is 0.90, not 0.10.
(d) False. The power is not the probability that H0 is true.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
True or False/ Mcqs

2) If the sample size remains the same, and the level α increases, then the power will
______________
a) Increase b) decrease c) remains constant d) both increase and decreases

Answer a) increase. If the level increases, the probability of rejecting H0 increases, so in


particular, the probability of rejecting H0 when it is false increases.
THANK YOU

SIVASANKARI V
Department of Science & Humanities
sivasankariv@[Link]
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
Simple Linear Regression

Dr. Karthiyayini
Department of Science and Humanities
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS

Correlation, Simple Linear Regression:

Dr. Karthiyayini
Department of Science & Humanities
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Introduction

So far we have done statistics on one variable at a time.


We are now interested in relationships
between two variables and how to use one variable to
predict another variable.
• Does weight depend on height?
• Does blood pressure level predict life expectancy?
•Does screen time correlate with eye strain?
•Is temperature related to electricity consumption (AC
usage)?
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS

❖ Classification of Data

❖ What is Correlation ?

❖ Pearson’s Correlation Coefficient


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Types of Data

Multivariate Univariate
Data Data

Bivariate Data
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Univariate & Bi - Variate Data
The analysis of Univariate data can be done using : The analysis of Bivariate data can be done using :

Analytical Techniques : Analytical Technique :


▪ Central tendency measures (mean, median ▪ Correlation Co-efficient
and mode) ▪ Regression Analysis
▪ Dispersion or Spread of data (range,
minimum, maximum quartiles, variance
and standard deviation)
▪ Frequency distribution tables

Visualization techniques : Visualization Technique :


▪ Histograms ▪ Scatter Plot
▪ Pie Charts
▪ Frequency Polygon
▪ Bar Charts
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Bi - Variate Analysis

❖ Bivariate analysis means the analysis of bivariate data; used


to find out if there is a relationship between two sets of
values.

❖ It usually involves the variables 𝑋 and 𝑌 and is represented


as an ordered pair (𝑋, 𝑌).

❖ 𝑋 represents the independent variable and 𝑌 represents the


dependent variable.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scatter Plots
❖Scatter Plot is a mathematical diagram that plots pairs
of data on an X-Y graph in order to reveal the
relationship between the data sets.
❖ Scatter plots give you a visual idea of the pattern that
your variables follow.

❖Scatter plots can show you visually the strength of the


relationship between the variables, the direction of the
relationship between the variables and whether any
outliers exist.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example for Physics Lab :
❖ Variation of resistance with change in
temperature of a Semiconductor/ Conductor.

Temperature Resistance
55 5
45 4.94
35 4.84
65 5.11
75 5.19
70 5.14
60 5.09
The resistance decreases with increase in
50 4.99 temperature in a Semiconductor whereas
40 4.92 in a Conductor, the resistance increases
30 4.82 with an increase in the temperature.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Correlation
❖ Does height have an impact on the
performance of a player in a Basket ball
match?

❖ Is there a relationship between internet


bandwidth and time taken for data
transfer?

❖Are Height and Weight of an individual


related?

❖ Does no. of hours effort have an impact


on CGPA scored?
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Brief history of Correlation

❖Sir Francis Galton, (16 February 1822 –


17 January 1911).

❖He was an English Victorian


era statistician and a Fellow of the Royal
Society.

❖Galton produced over 340 papers and


books.

❖In 1892, he published the book “Finger


Prints” and proposed the use of
fingerprints as a means of personal
identification.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Galton’s case study!!

Is there any relation between


the height of an individual and
the length of his forearm???

Sir Francis Galton introduced the concept of ‘Correlation' in 1888 with a


paper discussing how to measure the relationship between two variables.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Case Study : Galtons
❖ The data set that he considered consisted of the heights and
forearm lengths of 348 adult men.
(He measured the distance from the elbow to the tip of the
middle finger which is called as a cubit)

❖ Let the height of the 𝑖𝑡ℎ man be = 𝑥𝑖

❖ Let the length of the forearm of


the 𝑖𝑡ℎ man be = 𝑦𝑖

❖ Then Galton’s data consists of


348 ordered pairs (𝑥𝑖 , 𝑦𝑖 )
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Correlation Coefficient
❖Let (𝑥𝑖 , 𝑦𝑖 ) = ordered pairs that represent points on a scatter
plot.
❖𝑥ҧ = mean of the ‘𝑥’ values
❖𝑦ത = mean of the ‘𝑦’ values
❖𝑆𝑥 = standard deviation of ‘𝑥’ values
❖𝑆𝑦 = standard deviation of ‘𝑦’ values
❖Correlation Co-efficient is given by The Correlation Coefficient is the
average of the product of z-scores
1 𝑥 −𝑥ҧ 𝑦𝑖 −𝑦ത
𝑟= σ𝑛𝑖=1 𝑖
𝑛−1 𝑆𝑥 𝑆𝑦

σ𝑛 ҧ
𝑖=1(𝑥𝑖 −𝑥)(𝑦𝑖 −𝑦) Pearson’s Correlation Coefficient
⇒𝑟=
σ𝑛 ҧ 2
𝑖=1 (𝑥𝑖 −𝑥) σ𝑛
𝑖=1 (𝑦𝑖 −𝑦)
2
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example Problem :

The Psychological tests of intelligence and of engineering ability


were applied to 10 students. Here is a record of ungrouped data
showing intelligence ratio (I.R) and engineering ratio (E.R).
Calculate the Correlation Coefficient?
Student A B C D E F G H I J

I.R 105 104 102 101 100 99 98 96 93 92

E.R 101 103 100 98 95 96 104 92 97 94


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Solution :

Students IR (𝑥) ER (𝑦) 𝑋 = 𝑥 − 𝑥ҧ 𝑌 = 𝑦 − 𝑦ത 𝑋2 𝑌2 𝑋𝑌


1 105 101 6 3 36 9 18
σ 𝑋𝑌
2 104 103 5 5 25 25 25 𝑟=
σ 𝑋2 σ 𝑌2
3 102 100 3 2 9 4 6
4 101 98 2 0 4 0 0
5 100 95 1 -3 1 9 -3
6 99 96 0 -2 0 4 0
7 98 104 -1 6 1 36 -6
8 96 92 -3 -6 9 36 18
9 93 97 -6 -1 36 1 6
10 92 94 -7 -4 49 16 28
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Solution :

Students IR (𝑥) ER (𝑦) 𝑋 = 𝑥 − 𝑥ҧ 𝑌 = 𝑦 − 𝑦ത 𝑋2 𝑌2 𝑋𝑌


1 105 101 6 3 36 9 18
σ 𝑋𝑌
2 104 103 5 5 25 25 25 𝑟=
σ 𝑋2 σ 𝑌2
3 102 100 3 2 9 4 6
4 101 98 2 0 4 0 0
5 100 95 1 -3 1 9 -3
6 99 96 0 -2 0 4 0
7 98 104 -1 6 1 36 6
8 96 92 -3 -6 9 36 18
9 93 97 -6 -1 36 1 6
10 92 94 -7 -4 49 16 28
990 980 0 0 170 140 92

𝑥ҧ =99 𝑦ത = 98
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scatter Plot

Intelligence Quotient / Engineering Ability


106

104

102

100
ER

98

96

94

92

90
90 92 94 96 98 100 102 104 106
IR

Correlation coefficient r=0.59


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Classification of Correlation

Positive
Correlation

Correlation

No
Correlation Negative
/ Poor Correlation
Correlation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Interpretation of Correlation Coefficient

𝑟 = ±1 ⟹
Perfect Positive
𝑟 =0⟹
/
No Correlation
Interpretation
Perfect Negative
Correlation of
Correlation
0 < 𝑟 < 1Coefficient
⟹ −1 < 𝑟 < 0 ⟹
Positive Negative
Correlation Correlation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Examples of various levels of correlation
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How the Correlation coefficient works!!
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More about Correlation Coefficient

Correlation Coefficient

Sample Correlation ‘r’

Population Correlation ′𝜌′

Note that Sample Correlation is not only used to measure the strength of a relationship
but is also used to construct Confidence intervals and perform Hypothesis testing on the
population correlation.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Some more properties of the Correlation Coefficient

The Correlation coefficient remains unaltered in the following


cases :

❖ Interchanging the values of 𝒙 and 𝒚.

❖ Adding a constant to each value of a variable

❖ Multiplying each value of a variable by a positive constant.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Correlation Coefficient is unitless!!
❖ Consider the Correlation Co-efficient given by,

1 𝑛 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത
𝑟= σ𝑖=1
𝑛−1 𝑆𝑥 𝑆𝑦

𝑟 = 0.80
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Correlation Coefficient Is Unit less
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Correlation Coefficient measures only Linear Association

❖ Initial velocity = 64𝑓𝑡/𝑠


❖ Equation : 𝑦 = 64𝑥 − 16𝑥 2
❖ Correlation coefficient 𝑟 = 0
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Some more properties of the Correlation Coefficient
• the correlation between x and y is equal to 0.
• Is something wrong?
• No. The value of 0 for the correlation indicates that there
is no linear relationship between x and y, which is true.
• The relationship is purely quadratic.
• The correlation coefficient should only be used when the
relationship between the x and y is linear.
• Otherwise the results can be misleading.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Correlation Coefficient - Misleading when outliers are present
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Anscombe’s Quartet
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Anscombe’s Quartet Summary Statistics

❖ The summary statistics


(ie., Mean, Variance,
Pearson’s correlation
coefficient and the
Linear Regression)
for all the four datasets
is the same.
❖ But the datasets are
significantly different and
visually distinct. This can
be observed in their
respective scatter plots.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Anscombe’s Quartet Summary Statistics

Identical statistics
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Visual representation of the Anscombe’s Quartet

❖ The first scatter plot appears to be a simple linear


relationship, corresponding to two variables

❖ The second graph is not distributed normally.


❖ Though a relationship between the two variables is
obvious, it is not linear.
❖ The Pearson correlation coefficient is not relevant.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Visual representation of the Anscombe’s Quartet

❖ In the third graph the linear relationship


is perfect.
❖ But one outlier which exerts enough
influence to lower the correlation
coefficient from 1 to 0.816.

❖ The fourth example shows another


example when one outlier is enough to
produce a high correlation coefficient,
even though the relationship between
the two variables is not linear.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Conclusions from the Anscombe’s Quartet

Takeaways from Anscombe's Quartet:


Visualize your data: Graphs reveal patterns that summary statistics can miss,
such as nonlinear relationships or the influence of outliers.
Summary statistics aren't enough: The datasets show that datasets with
identical statistics can have very different distributions and relationships.
Be cautious with outliers: A single outlier can strongly influence the results of
statistical analyses, especially correlation and regression.
Anscombe's Quartet is a classic example used in statistics to highlight the
importance of data visualization alongside numerical analysis.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Remark :

X causes Y
Correlation between X & Y

Y causes X

X causes Y & Y causes X

Some third variable Z causes X and Y

It is just a coincidence and there is no


casual relationship between X and Y
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Correlation is not causation!!
Example : Relationship between drug related tweets and crime
rates (Strong Positive Correlation).

❖ A strong Positive relationship


between tweets and crime has

Crimes
been found.
❖ But there is no evidence to suggest
that tweets are causing more crime
and tweets about crime do not Tweets
necessarily reflect the crime rate.

Reference : The Relationship Between Social Media Data and Crime Rates in the United States Yan
Wang1 , Wenchao Yu1 , Sam Liu2 , and Sean D. Y Social Media + Society January-March 2019: 1–9
©
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Correlation is not causation!!

Example 2. : Relationship between wearing seat belt and


astronaut deaths (Strong Positive Correlation).
Use your seatbelt and save
an astronaut life!
❖The graph shows that an increase
in wearing car seat belt results in a
lower number of astronaut deaths.
❖ Obviously there isn't a
real correlation here: putting
your seat belt on in a car has
nothing to do with the odds of an
accident in space.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Confounding Variable

❖ Confounding Variable is a variable that influences both the


independent variable as well as the dependent variable
causing a spurious correlation.
❖This may interfere in your analysis and ruin your experiment
by giving useless results.

Crimes
❖Confounding variables can cause two major problems:
▪ Increase variance
▪ Introduce bias.
❖ A confounding variables are like extra independent
Tweets variables
that are having a hidden effect on your dependent variables.
❖A confounding variable can be what the actual cause of a
correlation is, hence any studies must take these into
account and find ways of dealing with them.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Confounding Variable!!

Example 1. : Relationship between reading ability and shoe


size. (Strong Positive Correlation).
❖ You collect data on reading ability and shoe size
❖ You find that bigger the shoe size the better is the reading
ability.
❖ Does that mean bigger shoe size leads to better reading abilities?
❖ Should children be hence fed growth hormones so that the
reading abilities improve?
❖ Should children start focusing on their reading abilities to
increase their shoe size?
Confounding Variable : There is a third variable—a confounding
variable—which causes the increase in both reading ability and
shoe size.
Age : As the child’s age increase, the foot size increases and also the
reading ability increases since the child goes to higher classes.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Confounding Variable!!

Example 2. : Relationship between sun burns and ice – cream


consumption. (Strong Positive Correlation).
❖ You collect data on sunburns and ice cream
consumption.
❖ You find that higher ice cream consumption is
associated with a higher probability of sunburn.
❖ Does that mean,
Possibility #1: Sun burns cause consumption of ice
cream.
Possibility #2: Eating ice cream causes sun burns.
Possibility #3: There is a third variable—a confounding
variable—which causes the increase in both ice cream
sales and sun burn.
Confounding Variable : Hot temperatures
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Confounding Variable!!

Example 3. : Relationship between the force you apply to a


ball and the distance the ball travels.
❖ Naturally, you predict that the
more force you apply, the
further the ball will travel.
❖ After you run your experiment,
you observe that the ball
travels further in Condition 2
than it does in Condition 1.
❖ In other words, you find that the
less force you apply, the further
the ball travels.
❖ Confounding variable : the angle
of the slope.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example Problem 1.

• An environmental scientist is Volume (ml) Time (h) Percent Absorbed

studying the rate of absorption of a 0.05 2 48.3


certain chemical into skin.
0.05 2 51.0
• She places differing volumes of
0.05 2 54.7
the chemical on different pieces of
skin and allows the skin to remain 2.00 10 63.2

in contact with the chemical for 2.00 10 67.8


varying lengths of time.
2.00 10 66.2
• She then measures the volume of
5.00 24 83.6
chemical absorbed into each piece
of skin. 5.00 24 85.1

• She obtains the results shown in 5.00 24 87.8


the following table.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example Problem
Correlation between Volume & Correlation between Time & Percent
Percent Absorbed Absorbed
❖ Scatter Plot :
❖ Scatter Plot :

Are these conclusions


Justified???

❖ Correlation , 𝑟 = 0.987
❖ Correlation , 𝑟 = 0.988
❖ Positive Correlation ❖ Positive Correlation
❖ Increasing the time that the skin is in
❖ Increasing the volume causes
the percentage absorbed to contact with chemical causes the
increase. percentage absorbed to increase.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example Problem
No! The conclusions are not justified!
Suggested Solution :
❖ The correlation between time & volume has to be explored.
❖ The Scatter plot :

❖ The correlation, 𝑟 = 0.999


❖ Conclusion : These 2 variables are completely confounded.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example Problem 2.
• The Scientist in Example Problem 1 Volume (ml) Time (h) Percent Absorbed
has repeated the experiment, this
0.05 2 49.2
time with a new design.
0.05 10 51.0
• The results are presented in the 0.05 24 84.3
table.
2.00 2 54.1

2.00 10 68.7

2.00 24 87.2

5.00 2 47.7

5.00 10 65.1

5.00 24 88.4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example Problem 2.
Correlation between Volume & Correlation between Time & Percent
Percent Absorbed Absorbed
❖ Scatter Plot : ❖ Scatter Plot :

Are these conclusions


Justified???

❖ Correlation , 𝑟 = 0.952
❖ Correlation, 𝑟 = 0.121 ❖ Strong Positive Correlation
❖ Weak Positive Correlation ❖ Increasing the time that the
❖ Hence increase of volume has skin is in contact with the
little or no effect on the chemical will cause the
percentage absorbed. percentage absorbed to increase.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example Problem 2.
Correlation between Volume & Time
❖ Scatter Plot :

❖ Time & Volume are not correlated in this case


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Controlled Experiments reduce the risk of Confounding
Confounding
variable

Independent Dependent
variable Spurious Correlation variable

❖One of the ways by which confounding can be avoided in


controlled experiments by choosing values for certain factors
in such a way that there exists no correlation between those
factors.
❖ For instance in Example Problem 1. & 2. the environmental
scientist reduced confounding by assigning values to volume
and time such that they were uncorrelated.
❖But this is not possible in all cases.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Controlled Experiments reduce the risk of Confounding
❖The values of factors cannot be chosen by the observer in
case of observational studies/ experiments.
Then how can confounding
❖For instance, in studies involving public health issues like be avoided in such cases ???
impact of environmental pollutants on human health, the
observer cannot assign values to any of the factors.
❖Hence it becomes difficult to avoid confounding.
❖Example : People who live in areas with higher level
pollutants may tend to have lower socio-economic status,
which may affect their health.
❖ In observational studies to avoid or reduce confounding the
study must be repeated a number of times under a variety of
conditions before drawing reliable conclusions !!!
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
MCQ

Solution A
THANK YOU

Dr. Karthiyayini
Department of Science & Humanities
[Link]@[Link]
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
Simple Linear Regression

[Link] H R
Department of Computer Science and Engineering
Dr. Karthiyayini
Department of Science and Humanities
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Linear Regression: Correlation &
Regression Analysis
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Regression Analysis

❖Regression Analysis is basically the study of a set of data to make the


best guess or some kind of prediction.

▪ For Example : By studying a data which provides information of how


much you eat and how much you weigh, you can conclude that there
exists a relationship between the two.

▪ Regression analysis can help you to quantify that and can help you to
predict how much you will weigh in 10 years time if you continue to put
on weight at the same rate.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Prediction of Floods / Droughts

Impact of Global warming :

❖ Increase in rainfall resulting in Floods

❖ Increase in amount of dry land leading


to droughts
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Impact of Global Warming

Global Warming in Global Warming in


Wet areas Dry areas
Evaporation of water Increase in evaporation of
from land and sea water from land , water
surfaces and plants

More rainfall Dry areas become drier

Increase in Floods Increase in droughts


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Other factors influencing Floods !!!

Causes of Floods!!! Global Geography


Warming of the
area

Urbanisation
Floods

Other Deforestation
Human
Factors
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Regression Analysis

❖ In statistical modeling, regression analysis is a set of statistical processes


for estimating the relationships between a dependent variable and one or
more independent variables.

❖It is a way of mathematically sorting out which of those variables indeed have
an impact

❖Which factors matter most ?

❖Which can we ignore ?

❖How do the factors interact with each other?

❖And most importantly, how certain are we about all these factors?
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Regression Analysis – A Broad Classification

Simple Regression Linear


Models
(One independent
Variable) Non Linear
Regression
Models Multiple Regression Linear
Models
(Several Independent
Variables) Non Linear
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Some Inputs !!!

❖Regression analysis is widely used for prediction and forecasting,


where its use has substantial overlap with the field of machine
learning.
❖In some situations regression analysis can be used to infer causal
relationships between the independent and dependent variables.
❖The term "regression" was coined by Francis Galton in the
nineteenth century to describe a biological phenomenon.
❖The earliest form of regression analysis is linear regression, in which
a researcher finds the line that most closely fits the data according to
a specific mathematical criterion.
❖This line is referred to as the line of least squares, which was
published by Legendre in 1805, and by Gauss in 1809.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Least – Squares Line

❖When two variables have a


linear relationship, the
scatter plot tends to be
clustered around a straight
line.

❖This line is referred to as


the Least Squares Line.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The least squares line

❖ Consider the least square line given by,

෢0 + 𝛽
𝑦𝑖 = 𝛽 ෢1 𝑥𝑖

where,

σ𝑛 ҧ
𝑖=1(𝑥𝑖 −𝑥)(𝑦 ത
𝑖 −𝑦)
෢1 =
▪ 𝛽 σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 −𝑥)

෢0 = 𝑦ത − 𝛽
▪ 𝛽 ෢1 𝑥ҧ
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
❖The details pertaining to the no. of SL No. No. of hours Marks
hours spent by students in preparing spent Scored
for an entrance exam and the marks 1 6 82
scored (on a scale of (0 – 100) is 2 10 88
provided in the following table. 3 2 56
Using these values, 4 4 64
i. Estimate the marks scored by a 5 6 77
student who has spent 2.35 6 7 92
hours. 7 0 23
ii. Predict the marks that a student 8 1 41
can score if he/she invests 20 hours. 9 8 80
10 5 59
11 3 47
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Computing the least squares line

❖We need to first obtain the least square line which is given by,

෢𝟎 + 𝜷
𝒚=𝜷 ෢𝟏 𝒙

σ𝒏
𝒊=𝟏(𝒙𝒊 −ഥ
𝒙)(𝒚𝒊 −ഥ𝒚)
▪ ෢𝟏 =
𝜷 σ𝒏 𝒙 )𝟐
𝒊=𝟏(𝒙𝒊 −ഥ

▪ ෢𝟎 = 𝒚
𝜷 ෢𝟏 𝒙
ഥ−𝜷 ഥ
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
SL No. No. of hours Marks
spent (𝑥) Scored(𝑦)
𝑥 − 𝑥ҧ (𝑥 − 𝑥)ҧ 2 𝑦 − 𝑦ത (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

1 6 82 1.27 1.6129 17.55 22.33
2 10 88 5.27 27.7729 23.55 124.15
3 2 56 -2.73 7.4529 -8.45 23.06
4 4 64 -0.73 0.5329 -0.45 0.33
5 6 77 1.27 1.6129 12.55 15.97
6 7 92 2.27 5.1529 27.55 62.60
7 0 23 -4.73 22.3729 -41.45 195.97
8 1 41 -3.73 13.9129 -23.45 87.42
9 8 80 3.37 11.3569 15.55 50.88
10 5 59 0.27 0.0729 -5.45 -1.49
11 3 47 -1.73 2.9929 -17.45 30.15
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
SL No. No. of hours Marks
spent (𝑥) Scored(𝑦)
𝑥 − 𝑥ҧ (𝑥 − 𝑥)ҧ 2 𝑦 − 𝑦ത (𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

1 6 82 1.27 1.6129 17.55 22.33
2 10 88 5.27 27.7729 23.55 124.15
3 2 56 -2.73 7.4529 -8.45 23.06
4 4 64 -0.73 0.5329 -0.45 0.33
5 6 77 1.27 1.6129 12.55 15.97
6 7 92 2.27 5.1529 27.55 62.60
7 0 23 -4.73 22.3729 -41.45 195.97
8 1 41 -3.73 13.9129 -23.45 87.42
9 8 80 3.37 11.3569 15.55 50.88
10 5 59 0.27 0.0729 -5.45 -1.49
11 3 47 -1.73 2.9929 -17.45 30.15
4.73 64.45 94.8459 611.37
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
From the table we have,
𝑥ҧ = 4.73 ; 𝑦ത =64.45

▪ σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത =611.37

▪ σ𝑛𝑖=1(𝑥𝑖 − 𝑥)ҧ 2 =94.8459

σ𝑛 ҧ
𝑖=1(𝑥𝑖 −𝑥)(𝑦 ത
𝑖 −𝑦)
▪ ෢
𝛽1 = σ𝑛 ҧ 2
=611.37/94.8459=6.49
𝑖=1(𝑥𝑖 −𝑥)

෢0 = 𝑦ത − 𝛽
▪ 𝛽 ෢1 𝑥ҧ =64.45-[6.49x4.73]=33.7523

▪ The equation of the least squares line is given by,


෢0 + 𝛽
𝑦𝑖 = 𝛽 ෢1 𝑥𝑖 ⇒33.7523+6.79x
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Example :
▪ The equation of the least squares line is given by,
𝑦 = 33.7523 + 6.49𝑥

i. To estimate the marks scored by a student who has spent 2.35 hours.

Y=33.7523+[6.35x2.35]=48.6748
ii. To predict the marks that a student can score if he/she invests 20
hours.
Y=33.7523+[6.35x20]=163.5523
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS

❖ How to compute the Least Squares Line

❖ Residuals and Errors

❖ Measuring Goodness of fit


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How to compute the Least – Squares Line ???

𝒍𝟏
𝒍𝟐
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How to compute the Least – Squares Line ???

𝒍𝟏 • 𝑙𝑖 → true length
𝒍𝟐

Linear Model
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scenario # 1 : No Errors!!

Weight (lb) (x) /Length (in.) (y)


Weight (𝑙𝑏) Length (𝑖𝑛. ) 5.25

(𝑥) (𝑦)
0.0 5.02
5.2
0.2 5.04
0.4 5.06
0.6 5.08 5.15

0.8 5.10
1.0 5.12 5.1

1.2 5.14
1.4 5.16
5.05
1.6 5.18
1.8 5.20
5
2.0 5.22 0 0.5 1 1.5 2 2.5
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scenario #2 : Measurement has Errors!!

WEight (lb) (x)/Length (in.) (y)


Weight (𝑙𝑏) Length (𝑖𝑛. ) Weight (𝑙𝑏) Length (𝑖𝑛. ) 5.9
(𝑥) (𝑦) (𝑥) (𝑦)
5.8
0.0 5.06 2.0 5.40
5.7
0.2 5.01 2.2 5.57
5.6
0.4 5.12 2.4 5.47
5.5
0.6 5.13 2.6 5.53
5.4
0.8 5.14 2.8 5.61
5.3
1.0 5.16 3.0 5.59
5.2
1.2 5.25 3.2 5.61
5.1
1.4 5.19 3.4 5.75
5
1.6 5.24 3.6 5.68
4.9
1.8 5.46 3.8 5.80 0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Scenario #2 : Measurement has Errors!!

Weight (lb) (x)/Length (in.) (y)


Weight (𝑙𝑏) Length (𝑖𝑛. ) Weight (𝑙𝑏) Length (𝑖𝑛. ) 5.9
(𝑥) (𝑦) (𝑥) (𝑦)
5.8
0.0 5.06 2.0 5.40
5.7
0.2 5.01 2.2 5.57
5.6
0.4 5.12 2.4 5.47
5.5 ↓
0.6 5.13 2.6 5.53
5.4
0.8 5.14 2.8 5.61
5.3
1.0 5.16 3.0 5.59
5.2
1.2 5.25 3.2 5.61
5.1
1.4 5.19 3.4 5.75
5
1.6 5.24 3.6 5.68
4.9
1.8 5.46 3.8 5.80 0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Residual :
Weight (lb) (x)/Length (in.) (y)
❖𝑒𝑖 = 𝑦𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝑦𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑
5.9

5.8

5.7

5.6

5.5

5.4

5.3

5.2

5.1

4.9
0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Least Square Line :
WEight (lb) (x)/Length (in.) (y)
5.9
NOTE : The least square line is defined to be the line
5.8
for which the sum of squared residuals is minimum.

5.7 ❖That is, it is the line for which σ𝑛𝑖=1 𝑒𝑖 2 is minimum.


5.6

5.5

5.4
❖Using some Mathematical computations it can be shown
5.3
that,

5.2

5.1

4.9
0 0.5 1 1.5 2 2.5 3 3.5 4
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Least Squares Line : Summary
Scenario #1 : If there is no measurement error then the data points lie on the straight line
𝑦 = 𝛽0 + 𝛽1 𝑥 and values of 𝛽0 and 𝛽1 can be obtained easily by calculating the slope and the
intercept.
Scenario #2 : If there is a measurement error 𝜀𝑖 , then
❖ the exact value of 𝛽0 and 𝛽1 cannot be determined
❖ the values of 𝛽0 and 𝛽1 are computed by calculating the least square line.
෢0 + 𝛽
❖ The least square line is given by 𝑦ෝ𝑖 = 𝛽 ෢1 𝑥𝑖

where
෢0 → the 𝑦 − intercept of the least square line
▪ 𝛽
→ gives an estimate of 𝛽0 , the initial length of the spring.
෢1 →the slope of the least square line
▪ 𝛽
→ gives an estimate of the actual value of the spring constant 𝛽 .
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Computing formulas
Remark :

❖ σ𝑛𝑖=1(𝑥𝑖 − 𝑥)(𝑦 ത = σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത


ҧ 𝑖 − 𝑦)

❖ σ𝑛𝑖=1(𝑥𝑖 − 𝑥)ҧ 2 = σ𝑛𝑖=1 𝑥𝑖 2 − 𝑛𝑥ҧ 2

❖ σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2 = σ𝑛𝑖=1 𝑦𝑖 2 − 𝑛𝑦ത 2

For computational purposes we use the equivalent formula that is


specified in the RHS.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Try This !!!
Using the Hooke’s law data given in
the table Weight (𝑙𝑏) Length (𝑖𝑛. ) Weight (𝑙𝑏) Length (𝑖𝑛. )
(𝑥) (𝑦) (𝑥) (𝑦)
0.0 5.06 2.0 5.40
i. Compute the least squares
0.2 5.01 2.2 5.57
estimates of the spring constant
and the unloaded length of the 0.4 5.12 2.4 5.47
spring. 0.6 5.13 2.6 5.53
ii. Write the equation of the least 0.8 5.14 2.8 5.61
squares line. 1.0 5.16 3.0 5.59
iii. Estimate the length of the 1.2 5.25 3.2 5.61
spring under a load of 1.3 lb. 1.4 5.19 3.4 5.75
iv. Estimate the length of the 1.6 5.24 3.6 5.68
spring under a load of 1.4 lb.
1.8 5.46 3.8 5.80
v. Obtain the Residuals
corresponding to all the points
𝑥 ,𝑦 .
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Some Observations :

❖ The Estimates are not the same as true values

❖ The Residuals are not the same as the Errors.

❖ Don’t extrapolate outside the range of the data.

❖ Don’t use the Least Squares line when the data aren’t linear.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Estimates are not the same as true values

True Weight/True length ; Weight/Observed Length


5.5
Weight (𝑙𝑏) Length (𝑖𝑛. ) Length (𝑖𝑛. )
(𝑥) (𝑦) (𝑦) 5.45

0.0 5.02 5.06 5.4


y = 0.1859 + 5.0105x
0.2 5.04 5.01 5.35
→Least Square Line
0.4 5.06 5.12 5.3

0.6 5.08 5.13

Length(y)
5.25

y = 0.1 + 5.02x
0.8 5.10 5.14 5.2 → 𝐓𝐫𝐮𝐞 Line
1.0 5.12 5.16 5.15

1.2 5.14 5.25 5.1 True Values

1.4 5.16 5.19 5.05

1.6 5.18 5.24 5

1.8 5.20 5.46 4.95


0 0.5 1 1.5 2 2.5
Weight (X)
2.0 5.22 5.40
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
The Residuals are not the same as Errors
Weight/True length ; Weight/Observed Length
5.5
Weight (𝑙𝑏) Length (𝑖𝑛. ) Length (𝑖𝑛. )
(𝑥) (𝑦) (𝑦) 5.45
𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍
0.0 5.02 5.06 5.4
y = 0.1859x + 5.0105

0.2 5.04 5.01 5.35 →Least Square Line


0.4 5.06 5.12 5.3

0.6 5.08 5.13

Length(y)
5.25
y = 0.1x + 5.02
0.8 5.10 5.14 5.2
→ 𝐓𝐫𝐮𝐞 Line
1.0 5.12 5.16 5.15

1.2 5.14 5.25 5.1

1.4 5.16 5.19 5.05

1.6 5.18 5.24 5

1.8 5.20 5.46 4.95


0 0.5 1 1.5 2 2.5
Weight (X)
2.0 5.22 5.40
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Don’t Extrapolate outside the range of the data!!
Eg. 1: The details pertaining to the no. SL No. No. of hours Marks
of hours spent by students in spent Scored
preparing for an entrance exam and 1 6 82
the marks scored (on a scale of (0 – 2 10 88
100) is provided in the following table. 3 2 56
Using these values, 4 4 64
i. Estimate the marks scored by a 5 6 77
student who has spent 2.35 6 7 92
hours. 48.6748 7 0 23
ii. Predict the marks that a student 8 1 41
can score if he/she invests 20 hours. 9 8 80

163.5523 10 5 59
11 3 47
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Don’t Extrapolate outside the range of the data!!
𝐄𝐠. 𝟐:
Weight (𝑙𝑏) Length (𝑖𝑛. ) Weight (𝑙𝑏) Length (𝑖𝑛. )
(𝑥) (𝑦) (𝑥) (𝑦) Least square line: y=0.2046x+4.997
0.0 5.06 2.0 5.40
0.2 5.01 2.2 5.57
0.4 5.12 2.4 5.47
For weight, x=100lb
0.6 5.13 2.6 5.53 Length y=0.2046x100+4.997
0.8 5.14 2.8 5.61 =25.46in
1.0 5.16 3.0 5.59
1.2 5.25 3.2 5.61
1.4 5.19 3.4 5.75
1.6 5.24 3.6 5.68
1.8 5.46 3.8 5.80
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Don’t use the Least Squares Line when the data aren’t linear

Scatter plot of Projectile Motion


70

60

50

40

30

20

10

0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Note : In some cases the Least – Squares line can be used for non linear data, but only after
variable transformation is applied.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Measuring goodness of fit
❖ A goodness of fit statistic is a quantity that measures how well a
model explains a given set of data.
❖ A linear model fits well if there is a strong relationship between the
variables involved.
❖ The strength of a linear relationship can be measured by
considering,
σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2 − σ𝑛𝑖=1(𝑦𝑖 − 𝑦ෝ𝑖 )2 .
❖ The above relation is also referred to as a goodness-of-fit statistic.
❖ The draw back of this statistic relation is that it cannot be used to
compare the goodness-of-fit of two models which have different
data set. (That is, data sets having different units)
σ𝑛 ത 2 − σ𝑛
𝑖=1(𝑦𝑖 −𝑦) 𝑦𝑖 )2
𝑖=1(𝑦𝑖 −ෞ
❖ Hence we use the relation, 𝑟 2 = σ𝑛 ത 2
𝑖=1(𝑦𝑖 −𝑦)
which is obtained by using the correlation coefficient.
❖ This is also referred to as the co-efficient of determination.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Visualisation of 𝒓𝟐
t
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Some special terminologies!

σ𝑛 (𝑦 − ത
𝑦) 2 − σ𝑛 (𝑦 −ෞ𝑦 ) 2
❖𝑟 2 = 𝑖=1 𝑖
σ𝑛
𝑖=1
ത 2
𝑖 𝑖
𝑖=1(𝑦𝑖 −𝑦)

❖ σ𝑛𝑖=1(𝑦𝑖 − 𝑦)
ത 2 − σ𝑛𝑖=1(𝑦𝑖 − 𝑦ෝ𝑖 )2 : Regression sum of
squares

❖ Therefore, Total sum of squares = Regression sum of squares


+ Error sum of squares
2 Regression sum of squares
❖ And , 𝑟 =
Total sum of squares
❖ 𝑟 2 is also referred to as the proportion of the variance in y
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More about 𝒓𝟐

❖ Is a quantity that indicates how well a statistical model fits a


data set. In other words, it is a statistical measure of how close
the observed data are to the fitted regression line.

❖ It explains how much variation in the dependent variable 𝑦 is


characterized by a variation in the independent variable 𝑥.

❖ It is used to forecast or predict the possible outcomes.

❖ Its value lies between 0 and 1.

❖ The higher the value of 𝒓𝟐 , the better the prediction.


THANK YOU

Dr. Mamatha H R
Department of Computer Science and Engineering
MATHEMATICS FOR COMPUTER SCIENCE
ENGINEERS
Checking Assumptions and Transforming Data

[Link] H R
Department of Computer Science and Engineering
Dr. Karthiyayini
Department of Science and Humanities
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Simple Linear Regression: Correlation &
Regression Analysis
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients

The least square line is given as: 𝐲𝐢 = 𝛃𝟎 + 𝛃𝟏 𝐱 𝐢 + 𝛜𝐢

• The errors εi create uncertainty in the estimates β0 and β1.

• It is intuitively clear that if εi tend to be small in magnitude, the


points will be tightly clustered around the line, and the uncertainty in
the least-squares estimates β0 and β1 will be small.

• On the other hand, if εi tend to be large in magnitude, the points will


be widely scattered around the line, and the uncertainties (standard
deviations) in the least-squares estimates β0 and β1 will be larger.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients

• Assume we have n data points (x1, y1), . . . , (xn, yn), and we plan
to fit the least squares line.

• In order for the estimates β1 and β0 to be useful, we need to


estimate just how large their uncertainties are. In order to do this,
we need to know something about the nature of the errors εi .

• We will begin by studying the simplest situation, in which four


important assumptions are satisfied.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients
Assumptions for Errors in Linear Models:
In the simplest situation, the following assumptions are satisfied:
1. The errors 1,…,n are random and independent. In particular, the magnitude of any
error i does not influence the value of the next error i + 1.

2. The errors 1,…,n all have mean 0.

3. The errors 1,…,n all have the same variance, which we denote by 2.

4. The errors 1,…,n are normally distributed.


• When the sample size is large, the normality assumption (4) becomes less
important.
• Mild violations of the assumption of constant variance (3) do not matter too
much, but severe violations should be corrected.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients
• Under these assumptions, the effect of the εi is largely governed by the
magnitude of the variance σ2, since it is this variance that determines how
large the errors are likely to be.
• Therefore, in order to estimate the uncertainties in β0 and β1, we must first
estimate the error variance σ2.
• Since the magnitude of the variance is reflected in the degree of spread of
the points around the least-squares line, it follows that by measuring this
spread, we can estimate the variance.
Specifically, the vertical distance from each data point (xi , yi ) to the least-
squares line is given by the residual ei.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Uncertainties in the Least-Squares Coefficients

• The spread of the points around the line can be measured by the sum of the
squared residuals
• The estimate of the error variance σ2 is the quantity s2 given by
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Distribution

In the linear model yi = 0 +1xi +i, under assumptions 1 through 4, the


observations y1,…, yn are independent random variables that follow the normal
distribution. The mean and variance of yi are given by
 y = 0 + 1 xi
i

 y2 =  2
i

The slope represents the change in the mean of y associated with an increase in
one unit in the value of x.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More Distributions
Under assumptions 1 – 4:
• The quantitiesˆ0 and ˆ1 are normally distributed random variables.

• The means of ˆ 0
and ˆ1 are the true values 0 and 1, respectively.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More Distributions (cont.)
• The standard deviations of ˆ0 and ˆ1 are estimated with
s
1 x 2
sˆ =
sˆ = s + and 1 n

 i
n
0
n
 (x − x) 2 ( x − x ) 2
i
i =1
i =1

n
(1 − r ) 2
(y i
− y ) 2

where s = i =1
is an estimate of the
n−2

error standard deviation .


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Notes
1. Since the quantity appears in the denominators of
, it follows that the more spread out the x’s are, the smaller the
uncertainties in will be ˆ and ˆ
0 1

2. Use caution: if the range of x values extends beyond the range where
the linear model holds, the results will not be valid.
3. The quantities ( ˆ0 −  0 ) / sˆ and
0
( ˆ1 − 1 ) / sˆ
1
have Student’s t
distribution with n – 2 degrees of freedom.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions and Transforming Data

• We stated some assumptions for the errors. Here we want to see if any of
those assumptions are violated.

• The single best diagnostic for least-squares regression is a plot of residuals


versus the fitted values, sometimes called a residual plot.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
More of the Residual Plot

• When the linear model is valid, and assumptions 1 – 4 are satisfied, the
plot will show no substantial pattern. There should be no curve to the
plot, and the vertical spread of the points should not vary too much over
the horizontal range of the data.

• A good-looking residual plot does not by itself prove that the linear
model is appropriate. However, a residual plot with a serious defect does
clearly indicate that the linear model is inappropriate.

• When the vertical spread in a scatterplot doesn’t vary too much, the
scatterplot is said to be homoscedastic. The opposite of homoscedastic is
heteroscedastic.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Residual Plots

A: No noticeable pattern
B: Heteroscedastic
C: Trend
D: Outlier
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions to form a Linear Model
• Example of a residual plot: On the left is the plot of x versus the values
of y, on the right the residual with the fitted values of y
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Checking Assumptions to form a Linear Model
• Below on the left the plot is homoscedastic, while on the
right the spread increases with the fitted value and is thus
heteroscedastic.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Homoscedasticity or Heteroscedasticity? The way forward...

• If the residual plot is homoscedastic, and shows no


substantial trend or curve, then a linear model can be found
for the data plotted.
• If the residual plot is heteroscedastic, or shows a substantial
trend or curve, then the assumptions for a linear model
certainly do NOT hold! In such cases we need to transform
the data or pursue other methods.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Transforming the Variables
• If we fit the linear model y = 0 +1x + and find that the residual
plot exhibits a trend or pattern, we can sometimes fix the
problem by raising x, y, or both to a power.

• It may be the case that a model of the form


ya = 0 +1xb + fits the data well.

• Replacing a variable with a function of itself is called transforming


the variable. Specifically, raising a variable to a power is called a
power transformation.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Which transformation to apply?
It is possible with experience to look at a scatterplot, or a residual plot, and
make an educated guess as to how to transform the variables.
Mathematical methods are also available to determine a good
transformation.
Trial and Error is fine – Try various powers on both x and y (including
ln x and ln y), look at the residual plots, and hope to find a homoscedastic
one with no discernible pattern.
More advanced discussion in Draper and Smith (1998).
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Which transformation to apply?

Recall the earlier example of a scatter plot (O3 concentration vs


NOX concentration) whose residual plot on the right is
heteroscedastic as shown below. Linear model NOT GOOD! Uh oh!
Also notice the outlier with ozone concentration nearly 100.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Logarithm Transformation on One Axis

Applying the logarithm on y-axis (O3 concentration) and obtain


the following scatter plot and its residual on the right. Linear
model looks GOOD! YAY! The outlier is less prominent too!
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Logarithm Transformation on One Axis
Now consider an example below where The plot on the left is
Production (ft3/ft) vs Fracture fluid (gal/ft) and the residual plot is
largely heteroscedastic! Not good for a linear model.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Logarithm Transformation on One Axis
Below is a plot of ln (production) vs ln (fracture fluid) for the same
data. This time the residual plot is homoscedastic, good for linear
model!
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – The reciprocal
Below (left side) is a plot of Rockwell (B scale) hardness of welds
versus their Ogden-Jaffe number. The residual plot (right side)
shows a pattern where negative residual is observed for the
extreme fitted values and positive residual for the ones in the
middle. Linear model NOT OK.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – The reciprocal
We plot the graph of Rockwell Hardness vs (Ogden-Jaffe)-1 for the
same data (below, left side) and find that the residual plot (below,
right side) is homoscedastic, having no discernible pattern.
Linear model is OK.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – Positive Powers
Plot y vs x and its residual plot which
exhibits a discrenible pattern.
Linear model is NOT OK.

x y x y
1 2.2 11 31.5
2 9 12 32.7
3 13.5 13 34.9
4 17 14 36.3
5 20.5 15 37.7
6 23.3 16 38.7
7 25.2 17 40
8 26.4 18 41.3
9 27.6 19 42.5
10 30.2 20 43.7
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Power Transformations – Positive Powers
Plot y2 vs x and its homoscedastic
residual plot which exhibits no
discernible pattern.
Linear model is OK.
x y2 x y2
1 4.84 11 992.25
2 81 12 1069.29
3 182.25 13 1218.01
4 289 14 1317.69
5 420.25 15 1421.29
6 542.89 16 1497.69
7 635.04 17 1600
8 696.96 18 1705.69
9 761.76 19 1806.25
10 912.04 20 1909.69
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Transformations – Do they always work?

It is important to remember that power transformations don’t always


work.

Sometimes, none of the residual plots looks good, no matter what


transformations are tried. In these cases, other methods should be
used. One of these is multiple regression which is not covered here.

Some other methods are briefly mentioned in the next slide.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Alternatives to Transformations
The popular methods other than transformation are:

• Weighted Least Squares


➔ We assign greater weights to points in regions where the

vertical spread is smaller and vice versa.

• Multiple Regression
➔ We add more independent variables in order to explain the

variation in the dependent variable.


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How Many Points Make a Reliable Residual Plot?

When there are too few points on the residual plot, then…

➢ … it may appear to have a pattern or be heteroscedastic in


spite of that being just a visual effect created by one or two
points.

➢ … detecting outliers may become difficult

What to do if you can’t interpret a residual plot reliably?

You can start by fitting a linear model but declare your result
tentative; wait for more data and then a reliable decision can be
made.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
How Many Points Make a Reliable Residual Plot?

NOT all residual plots with few points turn out to be hard to interpret.

Some of these show a pattern which cannot be changed by relocating just one
or two points.

In such a case a linear model should NOT be used!


MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Outliers

• Outliers are points that are detached from the bulk of the data.

• Both the scatter plot and the residual plot should be examined for
outliers.

• The first thing to do with an outlier is to determine why it is different


from the rest of the points.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Outliers

• Sometimes outliers are caused by data-recording errors or


equipment malfunction. In this case, the outlier can be deleted
from the data set. In this case, you may present results that do
not include the outlier.

• If it cannot be determined why there is an outlier, then it is not


wise to delete it. Here the results presented, should be the ones
from analysis with the outlier included in the data set.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Influential Point

• If there are outliers that cannot be removed from the data set,
then the best thing to do is fit the whole data set and then
remove the outlier and fit a line to the data set.

• If none of the outliers upon removal make a noticeable difference


to the least-squares line or to the estimated standard deviation of
the slope and intercept, then use the fit with the outliers
included.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Influential Point

• If one or more outlier does make a difference, then the range


of values for the least-squares coefficients should be reported.
Avoid computing confidence and prediction intervals and
performing hypothesis tests.

• An outlier that makes a considerable difference to the least-


squares line when removed is called an influential point.
• In general, outliers with unusual x values are more likely to be
influential than those with unusual y values, but every outlier
should be checked.

• Some authors restrict the definition of outliers to points that


have unusually large residuals.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Comments

• Transforming the variables is not the only method for analyzing


data when the residual plot indicates a problem.

• There is a technique called weighted least squares regression. The


effect is to make the points whose error variance is smaller have
greater influence in the computation of the least-squares line.

• When the residual plot shows a trend, this sometimes indicates


that more than one independent variable is needed to explain the
variation in the dependent variable.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Comments

• If the relationship is nonlinear, then a method called nonlinear


regression can be applied.

• If the plot of residuals versus fitted values looks good, it may be


advisable to perform additional diagnostics to further check the fit of
the linear model. A time series plot is used to see if time should be
included in the model. A normal probability plot can be used to
check the normality assumption.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Independence of Observations

• If the plot of residuals versus fitted values looks good, then further
diagnostics may be used to further check the fit of the linear model.

• A time order plot of the residuals versus order in which observations


were made.

• If there are trends in this plot, then x and y may be varying with time.
In this case, adding a time term to the model as an additional
independent variable.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
Normality Assumption

• To check that the errors are normally distributed, a normal probability


plot of the residuals can be made.

• If the plot looks like it follows a rough straight line, then we can
conclude that the residuals are approximately normally distributed.
MATHEMATICS FOR COMPUTER SCIENCE ENGINEERS
MCQ

Solution D
THANK YOU

You might also like