Session 2 Inferential Statistics Slides
Session 2 Inferential Statistics Slides
📧 Email: [email protected]
🕒 Office Hours:
(continued)
Overview
Central Tendency
x i
x i1
n
Arithmetic Midpoint of Most frequently
average ranked values observed value
(if one exists)
Example
• 3, 7, 8, 5, 12, 7, 7, 8
• Mean (Average):
x i
x1 x 2 x N Population
μ
i1
values
N N
Population size
– For a sample of size n:
n
x i
x1 x 2 x n Observed
x i1
values
n n
Sample size
Arithmetic Mean
(continued)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Mean = 4
1 2 3 4 5 15 1 2 3 4 10 20
3 4
5 5 5 5
Median
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Median = 3 Median = 3
n 1
• Note that 2 is not the value of the median, only the
position of the median in the ranked data
Mode
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
No Mode
Mode = 9
Review Example
$2,000,000
500,000 $500 K
300,000 $300 K
100,000
100,000
$100 K
$100 K
Review Example:Summary Statistics
• Mean: ($3,000,000/5)
House Prices:
= $600,000
$2,000,000
500,000
300,000 • Median: middle value of ranked data
100,000 = $300,000
100,000
Sum 3,000,000
• Mode: most frequent value
= $100,000
Which measure of location is the “best”?
rg (x1 x 2 )1/n 1
Geometric
mean rate [(50) (20)]1/2 1
of return: Accurate
(1000)1/2 1 31.623 1 30.623% result
Percentiles and Quartiles
Q1 Q2 Q3
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are
larger)
Only 25% of the observations are greater than the third
quartile
Ch. 2-21
Quartile Formulas
(n = 9)
Q1 = is in the 0.25(9+1) = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so Q1 = 12.5
Ch. 2-23
Five-Number Summary
Ch. 2-24
2.2 Measures of Variability
Variation
Same center,
different variation Ch. 2-25
Range
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Ch. 2-26
Disadvantages of the Range
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
• Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Ch. 2-27
Interquartile Range
IQR = Q3 - Q1
Ch. 2-29
Box-and-Whisker Plot
Example:
Median X
X Q1 Q3 maximum
minimum (Q2)
25% 25% 25% 25%
12 30 45 57 70
Ch. 2-31
Population Variance
N
– Population variance:
2
(x μ)
i
2
σ i1
N
Where μ = population mean
N = population size
xi = ith value of the variable x
Ch. 2-32
Example
Ch. 1-33
Sample Variance
s i1
n -1
Where X = arithmetic mean
n = sample size
Xi = ith value of the variable X
Ch. 2-34
we have the following sample data points:
5,7,10,12,155, 7, 10, 12, 155,7,10,12,15
Step 1: Find the Sample Mean ( xˉ\bar{x}xˉ )
The mean is the sum of the values divided by the number of data points:xˉ=5+7+10+12+155=495=9.8\bar{x} = \frac{5 + 7 + 10 +
12 + 15}{5} = \frac{49}{5} = 9.8xˉ=55+7+10+12+15=549=9.8
Step 2: Find the Differences from the Mean
Subtract the mean (9.8) from each data point:
• 5−9.8=−4.85 - 9.8 = -4.85−9.8=−4.8
• 7−9.8=−2.87 - 9.8 = -2.87−9.8=−2.8
• 10−9.8=0.210 - 9.8 = 0.210−9.8=0.2
• 12−9.8=2.212 - 9.8 = 2.212−9.8=2.2
• 15−9.8=5.215 - 9.8 = 5.215−9.8=5.2
Step 3: Square the Differences
Now square each difference:
• (−4.8)2=23.04(-4.8)^2 = 23.04(−4.8)2=23.04
• (−2.8)2=7.84(-2.8)^2 = 7.84(−2.8)2=7.84
• (0.2)2=0.04(0.2)^2 = 0.04(0.2)2=0.04
• (2.2)2=4.84(2.2)^2 = 4.84(2.2)2=4.84
• (5.2)2=27.04(5.2)^2 = 27.04(5.2)2=27.04
Step 4: Find the Average of the Squared Differences
Now sum all the squared differences:23.04+7.84+0.04+4.84+27.04=62.823.04 + 7.84 + 0.04 + 4.84 + 27.04 =
62.823.04+7.84+0.04+4.84+27.04=62.8
Now, divide by n−1n - 1n−1, which is 5−1=45 - 1 = 45−1=4:Sample Variance=62.84=15.7\text{Sample Variance} = \frac{62.8}{4}
= 15.7Sample Variance=462.8=15.7
Population Standard Deviation
(x i μ)2
σ i 1
N
Ch. 2-36
Sample Standard Deviation
S i1
n -1
Ch. 2-37
Calculation Example:
Sample Standard Deviation
Sample
Data (xi) : 10 12 14 15 17 18 18 24
n=8 Mean = x = 16
Ch. 2-39
Comparing Standard Deviations
11 12 13 14 15 16 17 18 19 20 21
s = 3.338
(compare to the two
Data A cases below)
11 12 13 14 15 16 17 18 19 20 21
s = 0.926
(values are concentrated
Data B near the mean)
11 12 13 14 15 16 17 18 19
s = 4.570
20 21 (values are dispersed far
Data C from the mean)
Ch. 2-40
Advantages of Variance and
Standard Deviation
Ch. 2-41
Coefficient of Variation
• Stock A:
– Average price last year = $50
– Standard deviation = $5
s $5
CVA 100% 100% 10%
x $50 Both stocks
• Stock B: have the same
standard
– Average price last year = $100 deviation, but
– Standard deviation = $5 stock B is less
variable relative
to its price
s $5
CVB 100% 100% 5%
x $100
Ch. 2-43
2.3 Weighted Mean and Measures of Grouped Data
w x i i
w 1x1 w 2 x 2 w n x n
x i1
n n
• Where wi is the weight of the ith observation
and n w i
• Use when data is already grouped into n classes, with wi
values in the ith class
Ch. 2-44
Example
Ch. 1-45
Approximations for Grouped Data
fm i i
K
where n fi
x i1
i1
n
Approximations for Grouped Data
i i
f (m x) 2
s2 i1
n 1
Ch. 2-48
Example
2.4 Measures of Relationships Between Variables
• Correlation Coefficient
– a measure of both the direction and the
strength of a linear relationship between two
variables
Ch. 2-50
Covariance
(x i x )(y i y )
Cov (x , y) xy i1
N
• The sample covariance:
n
(x x)(y y)
i i
Cov (x , y) s xy i1
n 1
– Only concerned with the strength of the relationship
– No causal effect is implied
Ch. 2-51
Example
Interpreting Covariance
Ch. 2-53
Coefficient of Correlation
Cov (x , y)
ρ
σXσY
• Sample correlation coefficient:
Cov (x , y)
r
sX sY
Ch. 2-54
Features of Correlation Coefficient, r
• Unit free
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear
relationship
• The closer to 1, the stronger the positive linear
relationship
• The closer to 0, the weaker any positive linear
relationship
Ch. 2-55
Scatter Plots of Data with Various Correlation
Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r = 0 Ch. 2-56
11.1 Overview of Linear Models
Y = β0 + β1X
yˆ b0 b1x
• Where b1 is the slope of the line and b0 is the y-intercept:
Cov(x, y) s y b0 y b1x
b1 2
r
sx sx
Ch. 11-58
Introduction to Regression Analysis
y i β0 β1x i ε i
• Where 0 and 1 are the population model
coefficients and is a random error term.
Ch. 11-60
Simple Linear Regression Model
y i β0 β1x i ε i
Linear component Random Error
component
Ch. 11-61
Simple Linear Regression Model
(continued)
Y Yi β0 β1Xi ε i
Observed Value
of Y for xi
εi Slope = β1
Predicted Value Random Error
of Y for xi
for this Xi value
Intercept = β0
xi X
Ch. 11-62
Least Squares Coefficient Estimators
(continued)
• The slope coefficient estimator is
(x x)(y y)
i i
Cov(x, y) sy
b1 i1
n
2
r
sx sx
i
(x
i1
x) 2
b0 y b1x
• The regression line always goes through the mean x, y
Ch. 11-64
Least Squares Coefficient Estimators
(continued)
• The slope coefficient estimator is
(x x)(y y)
i i
Cov(x, y) sy
b1 i1
n
2
r
sx sx
i
(x
i1
x) 2
b0 y b1x
• The regression line always goes through the mean x, y
Ch. 11-65
Computer Computation of Regression
Coefficients
Ch. 11-66
Interpretation of the Slope and the
Intercept
Ch. 11-67
Interpretation of the Slope and the
Intercept
Ch. 11-68
Simple Linear Regression Example
Ch. 11-69
Sample Data for House Price Model
Ch. 11-70
Graphical Presentation
350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Ch. 11-71
Regression Using Excel
(continued)
• Data / Data Analysis / Regression
Ch. 11-72
Excel Output
Ch. 11-73
Excel Output
(continued)
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 house price 98.24833 0.10977 (square feet)
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Ch. 11-74
Interpretation of the Intercept, b0
xi X
Ch. 11-79
Coefficient of Determination, R2
Y
r2 = 1
X
r =1
2
Ch. 11-81
Examples of Approximate r2 Values
Y
0 < r2 < 1
X
Ch. 11-82
Examples of Approximate r2 Values
r2 = 0
Y
No linear relationship between
X and Y:
Ch. 11-83
Excel Output
2SSR 18934.9348
Regression Statistics
R 0.58082
Multiple R 0.76211
SST 32600.5000
R Square 0.58082
Adjusted R Square 0.52842
58.08% of the variation in
Standard Error 41.33032
Observations 10
house prices is explained by
variation in square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Ch. 11-84
Correlation and R2
Ch. 11-85
Estimation of Model Error Variance
i
e 2
SSE
ˆσ 2 s2e i1
n 2 n 2
• Division by n – 2 instead of n – 1 is because the simple regression
model uses two estimated parameters, b0 and b1, instead of one
Ch. 11-86
Excel Output
Regression Statistics
Multiple R 0.76211 se 41.33032
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Ch. 11-87
Comparing Standard Errors
small se X large se X
House Price
Square Feet
Estimated Regression Equation:
in $1000s
(x)
(y)
house price 98.25 0.1098 (sq.ft.)
245 1400
312 1600
279 1700
308 1875
The slope of this model is 0.1098
199 1100
219 1550 Does square footage of the house
405 2350 significantly affect its sales price?
324 2450
319 1425
255 1700
Ch. 11-89
Inferences about the Slope: t Test Example
b1 sb1
H0: β1 = 0 From Excel output:
H1: β1 0 Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
b1 β1 0.10977 0
t t 3.32938
sb1 0.03297
Ch. 11-90
11.8 Beta Measure of Financial Risk
Ch. 11-91
Beta Coefficient Example
Information about
the quality of the
regression
model that
provides the
estimate of beta
Ch. 11-92
Questions and Answers