0% found this document useful (0 votes)
9 views66 pages

C22 Inferential Statistics DXB

Uploaded by

abcx54318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views66 pages

C22 Inferential Statistics DXB

Uploaded by

abcx54318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Mathematics for Science and Technology

Week 22
Inferential Statistics
Inferential Statistics

• A branch of statistics concerned with


drawing conclusions and making
inferences about a population based
on a sample of data.
• Allows us to make predictions, test
hypotheses, and generalize findings
based on the specific data we have
collected.
Builds on concepts…
 Sampling
 Descriptive Statistics
 Probability Distributions
 Sampling Distribution
 Central Limit Theorem
 Confidence Intervals
 Hypothesis Testing
• We have already discussed about some basic
concepts like sampling, types of data, data
summarizing techniques, descriptive statistics
like mean, standard deviation, probability
distributions etc… In past few sessions.
• In this session, we will take an overview of
central Limit theorem, confidence intervals,
hypothesis testing, correlation and regression.
Sampling rules of thumb
• Minimum of 30 observations per
sample or subsample
• Aim for 10 – 20% response rate
• Find balance between sufficiently
large for representativeness and
small enough to manage
Point estimate (i.e. sample mean)for an
unknown population mean
Multiple Sample means for n=30
Dotplot of Sample_30_1, Sample_30_2, Sample_30_3, Sample_30_4, ...

Sample_30_1

Sample_30_2

Sample_30_3

Sample_30_4

Sample_30_5

Sample_30_6

Sample_30_7

Sample_30_8
Sample_30_9

Sample_30_10
-4.8 -3.2 -1.6 0.0 1.6 3.2 4.8
Data
Sample mean for n=100
Dotplot of Sample_100

-5.6 -4.2 -2.8 -1.4 0.0 1.4 2.8 4.2


Sample_100

Variable Mean StDev Minimum Maximum


Sample_100 -0.051 2.110 -5.943 4.234
Sample mean for n=500
Dotplot of Sample_500

-4.8 -3.2 -1.6 0.0 1.6 3.2 4.8 6.4


Sample_500

Variable Mean StDev Minimum Maximum


Sample_500 -0.1096 1.9362 -5.8696 6.1619
Sample mean for n=1000
Dotplot of Sample_1000

-5.4 -3.6 -1.8 0.0 1.8 3.6 5.4


Sample_1000

Each symbol represents up to 2 observations.


Variable Mean StDev Minimum Maximum
Sample_1000 -0.0229 1.9641 -6.2236 6.3909
Sample mean for n=10000
Dotplot of Sample_10000

-8 -6 -4 -2 0 2 4 6
Sample_10000

Each symbol represents up to 26 observations.


Sample mean for n=10000
Dotplot of Sample_10000

-8 -6 -4 -2 0 2 4 6
Sample_10000

Each symbol represents up to 26 observations.


Variable Mean StDev Minimum Maximum
Sample_10000 -0.0092 1.9843 -7.4340 7.0002
Sampling distribution of
• A distribution of sample means is called
sampling distribution.
• If we take lots of small samples, will vary
• As sample size increases, we expect to
approach
Sampling distribution of
Dotplot of Mean

-0.9 -0.6 -0.3 0.0 0.3 0.6 0.9


Mean

Each symbol represents up to 2 observations.

Variable Mean StDev Minimum Maximum


Mean -0.0143 0.3601 -1.1190 1.1060
Sampling distribution of
Dotplot of Mean

-0.9 -0.6 -0.3 0.0 0.3 0.6 0.9


Mean

Each symbol represents up to 2 observations.

Variable Mean StDev Minimum Maximum


Mean -0.0143 0.3601 -1.1190 1.1060
Sampling distribution of
• Example – normal distribution with mean = 0
and s.d. = 2
• 1000 samples with n = 30
Dotplot of Mean

-0.9 -0.6 -0.3 0.0 0.3 0.6 0.9


Mean

Each symbol represents up to 2 observations.


Sampling distribution of
• Example – normal distribution with mean = 0
and s.d. = 2
• 1000 samples with n = 30
• Mean
• S.D. definitely doesn’t equal 2

Variable Mean StDev Minimum Maximum


Mean -0.0143 0.3601 -1.1190 1.1060
The standard error
• Standard deviation of sample means
• Calculated as
S.D. of sampling distribution
• Called Standard error (SE)
• Calculated as

• In theory

𝜎 2 2
SE= = ≈ =0.364
√ 𝑛 √ 30 5.5
S.D. of samples
• Standard error (SE)
• Calculated as

• In theory

𝜎 2 2
SE= = ≈ =0.364
• In practice,√we √ 30
𝑛 used 5.5
instead of
Central limit theorem
• The sampling distribution of sample maeans
( is a normal distribution.
• The mean of the sampling distribution
approaches population mean (i.e. ), as n
increases
• The standard deviation of the sampling
distribution is the standard error
Confidence Intervals
 A range of values within which we
infer with a certain level of
confidence, that the population
parameter (e.g. population mean)
exists.
Interpreting confidence intervals

If we perform a test 100 times


we will be right 95 times
Example 1
Suppose we want to know the mean income of all single males
in the U.K.
 To answer this question, we decide to take a random
 sample of 20 single males and ask them their income,
 This produces the following results (in £).
12,000 26,000 35,000 24,500 18,000 15,500 28,500 18,000 54,000 43,000
17,500 22,000 21,500 26,000 16,000 16,500 24,500 27,500 29,000 17,000

𝒙
¯=
∑ 𝒙 £ 𝟒𝟗𝟐 , 𝟎𝟎𝟎
= =£ 𝟐𝟒 ,𝟔𝟎𝟎
𝒏 𝟐𝟎
Example 1
12,000 26,000 35,000 24,500 18,000 15,500 28,500 18,000 54,000 43,000
17,500 22,000 21,500 26,000 16,000 16,500 24,500 27,500 29,000 17,000

𝑥=
¯
∑ 𝑥 = £ 492,000 = £ 24,600
𝑛 20

Standard Error =
Example 1
12,000 26,000 35,000 24,500 18,000 15,500 28,500 18,000 54,000 43,000
17,500 22,000 21,500 26,000 16,000 16,500 24,500 27,500 29,000 17,000

Assume we want 99.97% Confidence Interval (CI)

Lower Limit of interval = ,

Upper Limit of Interval =


,
Example 1
12,000 26,000 35,000 24,500 18,000 15,500 28,500 18,000 54,000 43,000
17,500 22,000 21,500 26,000 16,000 16,500 24,500 27,500 29,000 17,000

Assume we want 99.97% CI

£31,397.43
Example 1
12,000 26,000 35,000 24,500 18,000 15,500 28,500 18,000 54,000 43,000
17,500 22,000 21,500 26,000 16,000 16,500 24,500 27,500 29,000 17,000

Confidence level Alpha z value Confidence interval

Lower Upper
99% 0.01 2.576 18763.78 30436.22
95% 0.05 1.96 20159.19 29040.81
90% 0.01 1.645 20873.20 28326.85

Note: Z Value is the count of standard errors a sample mean is away from
population mean.
Inference
• Test a sample
• Assume the results are applicable to the
population
– Relies on sufficiently large sample
– Requires estimate of sample variability
Inference

A statistical result on a sample does not


imply the result is also true for the
population from which the sample was
drawn
Inference
 A statistical result on a sample does not
imply the result is also true for the
population from which the sample was
drawn
 If we have an acceptable level of
confidence in the representativeness of
the sample, we may infer that it is true
for the population
Statistical tests - Hypothesis testing

• A method used to draw conclusions


about a population based on sample
data.
• Involves:
 setting up null and alternative
hypotheses,
 selecting a significance level,
 conducting a statistical test, and
 interpreting the results.
Statistical tests - Hypothesis testing

• Built upon an assumed hypothesis


• We look for evidence to support rejecting
this in favour of a stated alternative
hypothesis
• We never claim to prove the initial
hypothesis
Hypothesis testing

• Null hypothesis (starting assumption)


E.g.
– The value of population mean is equal to
sample mean.
– There is no difference between population
means of two populations
– There is no relationship between the
variables;
• Denoted by
Hypothesis testing

• Alternative hypothesis
• Denoted or
Hypothesis testing
Single sample
Null Hypothesis
: the mean = 0
: the mean =
Alternative Hypothesis
One tailed alternative Hypothesis

Two tailed alternative Hypothesis


Hypothesis testing
Comparing two samples
Null Hypothesis
: the two means are equal or
Alternative Hypothesis
One tailed alternative

Two-tailed alternative
:
Hypothesis testing -Which test to choose?
1.Parametric Tests:
 Assume certain characteristics about the
population distribution, such as normality and
homogeneity of variance.
 Examples include t-tests, analysis of variance
(ANOVA), and linear regression.
2.Non-parametric Tests:
 Do not make assumptions about the shape of
population distribution.
 Used when the data does not meet the
assumptions of parametric tests.
 Examples include the Wilcoxon rank-sum test,
Mann-Whitney U test, and Kruskal-Wallis test.
Hypothesis testing

One tail or two?


Hypothesis testing

One tail or two?


 Decision must be made before analysis
 Based on assumptions made when
collecting data
 Decision should not be based on
convenience of results
Null hypothesis
significance testing
(NHST)
NHST
• Assume null hypothesis is true
• Choose a critical probability value known as:
– significance level (usually denoted by α)
• The critical probability value/Significance level
frequently set at (), but this is not a rule
• Find the probability value i.e. p-value for the
test statistic
• If p-value < α, we reject the null hypothesis in
favour of the alternative
Statistical tests
• Assumes null hypothesis is true
• Looks for evidence that it isn’t
• If we find sufficient evidence we reject the null
in favour of the alternative
• If we don’t find sufficient evidence we fail to
reject the null
• We never prove the null hypothesis to be true
Statistical tests
• British legal system
• Assumption of innocence
: Accused didn’t commit the crime
: Accused did commit the crime
• If we fail to find sufficient evidence of guilt we
fail to reject
• But we don’t find the defendant innocent
because we didn’t test for that
Type of errors
• Errors caused when sample doesn’t properly
represent the population
• Type I error:
– Incorrectly reject the null hypothesis when it is
true
• Type II error:
– Incorrectly fail to reject the null hypothesis when
the alternative is true
Errors
• Reducing probability of Type I errors increases
probability of making Type II errors
• Reducing probability of Type II errors increases
probability of making Type I errors
• Type II errors are more difficult to calculate –
beyond the scope of this module
Errors
Null Hypothesis is True Null Hypothesis is False

Reject Null Hypothesis Type I Error Correct Outcome

Do not Reject Null Hypothesis Correct Outcome Type II Error


Decision making
Consider consequences of making incorrect
decision
Using the t distribution
• Preferred to test small samples
• Robust against deviations from normality
• Doesn’t require estimating the population s.d.
• Easy to test using software
Types of t-test
• One-sample t-test
– Equivalent to the z test which uses the standardised
normal distribution
• Paired two-sample test
– Essentially a one-sample test of differences
• Independent two-sample test
• We will only consider the one-sample test –
two-sample tests are discussed in the
supplementary material for this week
One-sample 2-tailed t-test

Based on above shown Salaries data:


Sample Mean salary = 26,400
• Tests the null hypothesis that the population
mean ()is equal the sample mean
With no prior assumptions we use a 2-tailed test:
One-sample 2-tailed t-test

Descriptive Statistics
N Mean StDev SE Mean 95% CI for μ
20 24600 10133 2266 (19858, 29342)

μ: mean of Salaries

Test
Null hypothesis H₀: μ = 26400 P-Value>0.05,
Alternative hypothesis H₁: μ ≠ 26400 hence we can
not reject the
T-Value P-Value null Hypothesis.
-0.79 0.437
One-sample 1-tailed t-test
• We can test against the alternative hypothesis
that the population mean is either less than or
more than £29,000
• If we choose the wrong alternative, the results
will not tell us anything useful
• We begin by testing against the hypothesis that
the population mean is greater than £29,000,
although there is no clear evidence to support
this in the data
One-sample 1-tailed t-test
(Alternative hypothesis that the population
mean is greater than £29,000)

Descriptive Statistics
95% Lower Bound
N Mean StDev SE Mean for μ
20 24600 10133 2266 20682

μ: mean of Salaries

Test P-Value>0.05,
hence we can
Null hypothesis H₀: μ = 29000
Alternative hypothesis H₁: μ > 29000
not reject the
null Hypothesis.
T-Value P-Value
-1.94 0.966
One-sample 1-tailed t-test

• We completely fail to find evidence in


support of the alternative hypothesis, but
that isn’t surprising
• The hypothesised population mean is well
below the sample mean, and only two of the
20 observations exceed it
One-sample 1-tailed t-test
• Given this information, a more sensible test
would be against the alternative hypothesis that
the population mean is lower then £29,000
One-sample 1-tailed t-test

Descriptive Statistics
95% Upper Bound
N Mean StDev SE Mean for μ
20 24600 10133 2266 28518

μ: mean of Salaries

Test P-Value<0.05,
Null hypothesis H₀: μ = 29000 hence we now
Alternative hypothesis H₁: μ < 29000 reject the null
Hypothesis.
T-Value P-Value
-1.94 0.034
One-sample 1-tailed t-test

• Now we do have evidence against the null


hypothesis and in favour of the alternative
• The p-value has halved, now at 0.034, an
29,000 is clearly now in the rejection region
Correlation
A statistical technique used to
measure the strength and nature of
the relationship between two
variables.
Types of Correlation
1.Positive Correlation
2.Negative Correlation
3.No Correlation
Techniques to find correlation
Scatterplots:
 Graphical representation of the relationship
between two variables.
 The x-axis represents independent variable,
and
 the y-axis represents the dependent variable.
Correlation Coefficient (r):
 Measures the linear relationship between two
numerical variables (and in certain cases for
ordinal variables).
 Its value Ranges from -1 to +1, where:
+1 indicates a perfect positive correlation,
-1 indicates a perfect negative correlation,
0 indicates no correlation.
Regression Analysis
A statistical method to model the
relationship between two variables.
Types of Regression
 Simple Linear Regression: When
there is only one independent
variable.

 Multiple Linear Regression:


When there are multiple
independent variables.
Simple Linear Regression
Simple Linear Regression
 The simple linear regression equation
is represented as:
Y = a + bX
Where Y is the dependent variable, X is
the independent variable, a is the
intercept, and b is the slope of the
regression line.
 Software like Minitab, SPSS can be used to find
the regression equation
Hypothesis testing and
Correlation/Regression Analysis
• Assume Null Hypothesis as there is no
correlation
• Find the P-Value for test statistic (test statistic
is calculated using the correlation coefficient r)
• Compare p-value with α
• Reject or do not reject the hypothesis.
In this week’s Workshop
• We will look at some hypothesis testing tools
in Minitab.

You might also like