0% found this document useful (0 votes)
95 views

Part 2 Exploring Relationships Among Variables

This document discusses scatter plots and correlation analysis. It defines scatter plots and their use in analyzing relationships between two quantitative variables. Characteristics of scatter plots like linearity, slope, and strength are described. Correlation coefficients are introduced as a measure of both the direction and strength of association between two variables. The ranges and interpretations of correlation coefficients are explained. Examples are provided to illustrate positive and negative correlation as well as the effect of outliers on correlation.

Uploaded by

pu3bsd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Part 2 Exploring Relationships Among Variables

This document discusses scatter plots and correlation analysis. It defines scatter plots and their use in analyzing relationships between two quantitative variables. Characteristics of scatter plots like linearity, slope, and strength are described. Correlation coefficients are introduced as a measure of both the direction and strength of association between two variables. The ranges and interpretations of correlation coefficients are explained. Examples are provided to illustrate positive and negative correlation as well as the effect of outliers on correlation.

Uploaded by

pu3bsd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

6/1/2020

Ho Chi Minh City University of Technology –


Bach Khoa
Content
 Part 2: Exploring relationships between variables
 2.1 Scatter plots
 2.2 Correlation
Applied Statistics in  2.3 Linear regression
Construction Management  2.4 Multiple regression

Lecturer: Nguyen Hoai Nghia (Jack), Ph.D.


Email: [email protected],vn
[email protected]

1 2

2.1 Scatter plots 2.1 Scatter plots


2.1.1 Definition 2.1.1 Definition
A scatterplot is a graphic tool used to display the relationship
between two quantitative variables.
 A scatterplot consists of an X axis (the horizontal axis), a Y
axis (the vertical axis), and a series of dots.
 Each dot on the scatterplot represents one observation from
a data set.

3 4

3 4
6/1/2020

2.1 Scatter plots 2.1 Scatter plots


2.1.2 Characteristics 2.1.2 Characteristics
 Scatterplots are used to analyze patterns in bivariate data.
These patterns are described in terms of linearity, slope, and
strength.
 Linearity refers to whether a data pattern is linear
(straight) or nonlinear (curved).
 Slope refers to the direction of change in variable Y when
variable X gets bigger  positive/ negative.
 Strength refers to the degree of "scatter" in the plot. If the
dots are widely spread, the relationship between variables
is weak. If the dots are concentrated around a line, the
relationship is strong.

5 6

5 6

2.2 Correlation 2.2 Correlation


2.2.1 Relationships 2.2.2 Strength of correlation
 Let consider data collected students’ Height (in inches) and  The green points in the upper right and lower left quadrants
Weight (in pounds)  a positive association between the two. are consistent with a positive association while the red
 the form of the scatterplot is fairly straight as well. points in the other two quadrants are consistent with a
negative association

7 8

7 8
6/1/2020

2.2 Correlation 2.2 Correlation


2.2.3 Correlation coefficients 2.2.3 Correlation
 Correlation coefficients measure both the direction and coefficients (cont.)
strength of association between two variables.

𝑋 µ 𝑌 µ

𝜎 𝜎
 Population: 𝑅
𝑁

∑ 𝑥 𝑥̅ 𝑦 𝑦
 Sample: 𝑟
∑ 𝑥 𝑥̅ ∑ 𝑦 𝑦

9 10

9 10

2.2 Correlation 2.2 Correlation


2.2.3 Correlation coefficients (cont.) 2.2.3 Correlation coefficients (cont.)

11 12

11 12
6/1/2020

2.2 Correlation 2.2 Correlation


2.2.3 Correlation coefficients (cont.) 2.2.3 Correlation coefficients (cont.)
 The ranges of a correlation coefficient is -1 and 1.  The correlation becomes weaker as the data points become
 The greater the absolute value of a correlation coefficient, the more scattered.
stronger linear relationship.  If the data points fall in a random pattern, the correlation is
 The strongest linear relationship is indicated by a correlation equal to zero.
coefficient of -1 or 1 when data points fall exactly on a  Correlation is sensitive to outliers. Compare the first
straight line. scatterplot with the last scatterplot. The single outlier in the
 The weakest linear relationship is indicated by a correlation last plot greatly reduces the correlation (from 1.00 to 0.71).
coefficient equal to 0.  Correlation has no units.
 A positive correlation means that if one variable gets bigger,  Correlation is not affected by changes in the center or scale
the other variable tends to get bigger. of either variable. Changing the units or baseline of either
 A negative correlation means that if one variable gets bigger, variable has no effect on the correlation coefficient.
the other variable tends to get smaller. 13
 Correlation depends only on the z-scores. 14

13 14

2.2 Correlation 2.2 Correlation


2.2.3 Correlation coefficients (cont.) 2.2.3 Correlation coefficients (cont.)
Question: Question: A national consumer magazine reported the following
 Is it ok if we say a correlation of 0 mean zero relationship correlations.
between two variables?  The correlation between car weight and car reliability is -0.30.
 The correlation between car weight and annual maintenance
cost is 0.20.
 Which of the following statements are true?
I. Heavier cars tend to be less reliable.
II. Heavier cars tend to cost more to maintain.
III. Car weight is related more strongly to reliability than
to maintenance cost

15 16

15 16
6/1/2020

2.2 Correlation 2.3 Linear regression


2.2.4 Conditions for Correlation 2.3.1 Definition
 Quantitative Variables Condition  Correlation is only about  In a cause and effect relationship, the independent
quantitative variables. variable is the cause, and the dependent variable is the
 Straight Enough Condition  to look at the scatterplot to see effect.
whether it looks reasonably straight. That’s a judgment call,  Least squares linear regression is a method for predicting the
but not a difficult one. value of a dependent variable Y, based on the value of an
 No Outliers Condition  Outliers can distort the correlation independent variable X.
dramatically, making a weak association look strong or a
strong one look weak. Outliers can even change the sign of Y = Β0 + Β1X
the correlation. But it’s easy to see outliers in the scatterplot.

17 18

17 18

2.3 Linear regression 2.3 Linear regression


2.3.2 Conditions to apply simple linear regression 2.3.2 Conditions to apply simple linear regression
 The dependent variable Y has a linear relationship to the  For any given value of X,
independent variable X  make sure that the  The Y values are independent, as indicated by a random
XY scatterplot is linear and that the residual plot shows a pattern on the residual plot.
random pattern.  The Y values are roughly normally distributed
 For each value of X, the probability distribution of Y has the (i.e., symmetric and unimodal). A little skewness is ok if
same standard deviation σ. When this condition is satisfied, the sample size is large. A histogram or a dotplot will show
the variability of the residuals will be relatively constant the shape of the distribution.
across all values of X, which is easily checked in a residual
plot.

19 20

19 20
6/1/2020

2.3 Linear regression 2.3 Linear regression


2.3.3 Least Squares Regression Line 2.3.3 Least Squares Regression Line (cont.)
Linear regression finds the straight line, called the least squares  Normally, you will use a computational tool - a software
regression line or LSRL, that best represents observations in package or a graphing calculator - to find b0 and b1. You enter
a bivariate data set the X and Y values into your program or calculator, and the
tool solves for each parameter.
 Population: Y = Β0 + Β1X  Or you can calculate values of b0 and b1 manually using the
equations.


 Sample: ŷ = b 0 + b 1x 𝑏 r * sy / sx

𝑏 𝑦 - 𝑏 𝑥̅
21 22

21 22

2.3 Linear regression 2.3 Linear regression


2.3.4 Properties of the Regression Line 2.3.4 Properties of the Regression Line
 The line minimizes the sum of squared differences between
observed values (the y values) and predicted values (the ŷ
values computed from the regression equation).
 The regression line passes through the mean of the X values
(x) and through the mean of the Y values (y).
 The regression constant (b0) is equal to the y intercept of the
regression line.
 The difference between the observed value and its predicted
value is called its residual. The residual value tells how well
the model predicted the observed value at that point

23 24

23 24
6/1/2020

2.3 Linear regression 2.3 Linear regression


2.3.5 Coefficient of determination 2.3.5 Coefficient of determination (cont.)
 The coefficient of determination (denoted by R2) is a key  An R2 between 0 and 1 indicates the extent to which the
output of regression analysis. It is interpreted as the dependent variable is predictable. An R2 of 0.10 means that
proportion of the variance in the dependent variable that is 10 percent of the variance in Y is predictable from X; an R2 of
predictable from the independent variable. 0.20 means that 20 percent is predictable; and so on.
 The coefficient of determination ranges from 0 to 1.  If you know the linear correlation (r) between two variables,
 An R2 of 0 means that the dependent variable cannot be then the coefficient of determination (R2) is easily computed
predicted from the independent variable. using the following formula: R2 = r2.
 An R2 of 1 means the dependent variable can be predicted
without error from the independent variable.

25 26

25 26

2.3 Linear regression 2.3 Linear regression


2.3.5 Standard error All statistics packages make a table of results for a regression
 The standard error about the regression line (often denoted
by SE) is a measure of the average amount that the
regression equation over- or under-predicts. The higher the
coefficient of determination, the lower the standard error; and
the more accurate predictions are likely to be.

27 28

27 28
6/1/2020

2.3 Linear regression 2.4 Multiple regression


All statistics packages make a table of results for a regression 2.4.1 Multiple regression
 A regression with two or more predictor variables is called a
multiple regression:

ŷ = b0 + b1xi + ... + bkxk.

 We then find the residuals as

e = y - ŷ.

29 30

29 30

You might also like