0% found this document useful (0 votes)
28 views54 pages

Session 19&20

The document provides an overview of regression analysis, focusing on its objectives to understand relationships between variables and make predictions. It discusses types of data, including cross-sectional and time series, and explains concepts such as dependent and independent variables, scatterplots, outliers, and the importance of linear and nonlinear relationships. Additionally, it covers multiple regression, dummy variables, interaction variables, and the use of transformations to improve model fits.

Uploaded by

p44143
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views54 pages

Session 19&20

The document provides an overview of regression analysis, focusing on its objectives to understand relationships between variables and make predictions. It discusses types of data, including cross-sectional and time series, and explains concepts such as dependent and independent variables, scatterplots, outliers, and the importance of linear and nonlinear relationships. Additionally, it covers multiple regression, dummy variables, interaction variables, and the use of transformations to improve model fits.

Uploaded by

p44143
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Regression

Introduction
 The study of relationships
between variables.
 There are two potential objectives:

 to understand how the world operates (present) and


 to make predictions (future).

 Two basic types of data are analyzed:


 Cross-sectional data are usually data gathered from
approximately the same period of time from a population.
 Time series data involve one or more variables that are

observed at several, usually equally spaced, points in


time.
 Time series variables are usually related to their own past
values—a property called autocorrelation—which adds
complications to the analysis.
Introduction
 A single variable that we are trying to explain or predict,
called the dependent variable (DV).
 It is also called the response variable or the target variable.
 To help explain or predict the DV, we use one or more
explanatory variables.
 They are also called independent or predictor variables
(IV).
 If there is a single IV, the analysis called simple
regression.
 If there are several IVs, it is called multiple regression.
 Regression can be linear (straight-line relationships) or
nonlinear (curved relationships).
 Many nonlinear relationships can be linearized mathematically
(transformation).
From Descriptive Stats.

 Drawing
scatterplots is a
good way to begin
regression analysis.
 Graphical plot of two
variables
 If there is any
relationship
between the two
variables, it is usually
apparent from the
scatterplot.
Example: Sales versus Promotions
at Pharmex
 Objective: To use a scatterplot to examine the relationship between
promotional expenditures and sales at Pharmex.
 Solution: Pharmex has collected data from 50 randomly selected
metropolitan regions.
 There are two variables: Pharmex’s promotional expenditures as a
percentage of those of the leading competitor (“Promote”) and Pharmex’s
sales as a percentage of those of the leading competitor (“Sales”).
Example 2: Explaining Overhead
Costs at Bendrix
 Objective: To use scatterplots to examine the relationships
among overhead, machine hours, and production runs at Bendrix.
 Solution: Data file contains observations of overhead costs,
machine hours, and number of production runs at Bendrix.
 Each observation (row) corresponds to a single month.
Example 2: Explaining Overhead
Costs at Bendrix
 Examine scatterplots between each explanatory
variable (Machine Hours and Production Runs) and
the dependent variable (Overhead).
Example 2: Explaining Overhead
Costs at Bendrix
 Check for possible time series patterns, by creating a time series graph
for any of the variables.
 Check for relationships among the multiple explanatory variables
(Machine Hours versus Production Runs).
Linear versus Nonlinear
Relationships
 Scatterplots: for
detecting
relationships, not
obvious otherwise.
 Hope to see is a
linear, relationship.
 This doesn’t mean
that all points lie on a
straight line, but that
the points tend to
cluster around a
straight line.
 This scatterplot
illustrates a nonlinear
relationship.
Outliers

 Scatterplots are especially useful for


identifying outliers.
 If an outlier is clearly not a member of the
population of interest, then delete it from the
analysis.
 If it isn’t clear whether outliers are members

of the relevant population, run the regression


analysis with them and again without them.
 Ifthe results are practically the same in both
cases, then it is probably best to report the results
with the outliers included.
 Otherwise, you can report both sets of results with
an explanation of the outliers.
Outliers
 In the figure below, the outlier (the point at the
top right) is the company CEO, whose salary is
well above that of all of the other employees.
Unequal Variance

 Occasionally, the
variance of the DV
depends on the value
of the explanatory
variable (IV).
 Unequal variance is a
violation of the linear
regression
assumptions, but there
are ways to deal with it.
 Robust regression,
weighted least square
regression, Use log or
square root transform Y
No Relationship
Another concept from Descriptive
Stat.
 Correlations are numerical summary
measures that indicate the strength of linear
relationships between pairs of variables.
 summarizes the information in a scatterplot.
 linear relationships only.

 The usual notation for a correlation between

variables X and Y is rxy.


Correlations

 By looking at the sign of the covariance or


correlation—plus or minus—you can tell whether
the two variables are positively or negatively
related.
 Unlike covariances, correlations are completely
unaffected by the units of measurement.
 A correlation equal to 0 or near 0 indicates
practically no linear relationship.
 A correlation with magnitude close to 1 indicates a

strong linear relationship.


 A correlation equal to -1 (negative correlation) or

+1 (positive correlation) occurs only when the linear


relationship between the two variables is perfect.
Simple Linear Regression
 Scatterplots and
correlations indicate
linear relationships and
the strengths of these
relationships, but they
do not quantify them.
 Simple linear
regression quantifies
the relationship where
there is a single
explanatory variable.
 A straight line is
fitted through the
Simple Linear Regression
Least Squares Estimation


The residual is actual (observed) minus
fitted (predicted) value.
 Fundamental Equation for Regression:

Observed Value = Fitted Value +


Residual

 The best-fitting line through the points of a


scatterplot is the line with the smallest sum
of squared residuals.
 This is called the least squares line.
Least Squares Estimation

□ The least squares line is specified by its


slope and intercept.
 Equation for Slope in Simple Linear Regression:

 Equation for Intercept in Simple Linear Regression:


Example 1: Sales versus Promotions
at Pharmex (Continued)

 The equation for the least squares line is:


Predicted Sales = 25.1264 + 0.7623Promote
Example 2: Explaining Overhead
Costs at Bendrix (contd…)
 The regression output for Overhead with
Machine Hours as the single explanatory
variable is shown below.
Example 2: Explaining Overhead
Costs at Bendrix (contd…)
 The output when Production Runs is the only
explanatory variable is shown below.

 The two least squares lines are therefore:


Predicted Overhead = 48621 +
34.7MachineHours
Predicted Overhead = 75606 +
Standard Error of Estimate

 The magnitude of the residuals provide a good


indication of how useful the regression line is for
predicting Y values from X values.
 Because there are numerous residuals, it is useful to
summarize them with a single numerical measure.
 This measure is called the standard error of estimate
and is denoted se.
 It is essentially the standard deviation of the residuals,
and is given by this equation:
Standard Error of Estimate

 The usual empirical rules can be applied to the


standard error of estimate.

 About two-thirds of the fitted Yˆ values are


typically within one standard error of the actual
Y values. About 95% are within two standard
errors.

 In general, the standard error of estimate


indicates the level of accuracy of predictions
made from the regression equation.
 The smaller it is, the more accurate
predictions tend to be.

 One measure of comparison is the standard


The Percentage of Variation
Explained:
R-Square

 R2 is an important measure of the goodness


of fit of the least squares line.
 It is the percentage of variation of the
dependent variable explained by the regression
(IV).
 It always ranges between 0 and 1.

 The better the linear fit is, the closer R2 is to 1.

 In simple linear regression, R2 is the square of

the correlation between the dependent variable


and the explanatory variable.
Session 1&2: Statistical Thinking
Session 3&4: Descriptive Stats.
Session 5&6
Session 7&8
Session 9&10
Session 11&12
Session 13&14
Session 15&16
Session 17&18
Regression
Regression results for Sales

Regression summary
Adj R-
Multiple R R-square Square Std Error
0.673 0.453 0.442 7.395

ANOVA table
df SS MS F p-value
2172.88039 2172.88039
Regression 1 2 2 39.737 0.000
2624.73960 54.6820751
Error 48 8 6
Total 49 4797.62

Regression equation 95% conf intervals


Coefficient Std Error t-stat p-value Lower Upper
Intercept 25.126 11.883 2.115 0.040 1.235 49.018
Promote 0.762 0.121 6.304 0.000 0.519 1.005
Multiple Regression

 To obtain improved fits in regression, several


explanatory variables could be included.
 This is the realm of multiple regression.
 Graphically, you are no longer fitting a line to a set of
points. If there are two explanatory variables, you are
fitting a plane to the data in three-dimensional space.
 The regression equation is still estimated by the least

squares method, but it is not practical to do this by


hand.
 There is a slope term for each explanatory variable in

the equation, but the interpretation of these terms is


different.
 The standard error of estimate and R2 summary

measures are almost exactly as in simple regression.


Interpretation of Regression
Coefficients
 If Y is the DV, and X1 through Xk are the IVs, then
a typical multiple regression equation has the
form shown below, where a is the Y-intercept,
and b1 through bk are the slopes.
 General Multiple Regression Equation:
Y = a + b 1 X 1 + b 2 X 2 + … + b kX k
 Collectively, a and bs in the equation are called
the regression coefficients.
 Each slope coefficient is the expected change in
Y when this particular X increases by one unit
and the other Xs in the equation remain
constant.
Example 2: Explaining Overhead
Costs at Bendrix
 The coefficients in the output below indicate
that the estimated regression equation is:
Predicted Overhead = 3997 + 43.54Machine
Hours + 883.62Production Runs.
Interpretation of Standard Error of
Estimate and R-Square
 The standard error of estimate is
essentially the standard deviation of
residuals, but it is now given by the
equation below, where n is the number of
observations and k is the number of IVs:
Interpretation of Standard Error of
Estimate and R-Square
 The R2 value is again the percentage of
variation of the DV explained by the combined
set of IVs,
 but it has a serious drawback: It can only
increase when extra explanatory variables are
added to an equation.
 Adjusted R2 is an alternative measure that
adjusts R2 for the number of explanatory
variables in the equation.
 It is used primarily to monitor whether extra
explanatory variables really belong in the
equation.
 Adjusts for the significance and non-
significance
Modeling Possibilities

 Several types of
explanatory variables
can be included in
regression equations:
 Dummy variables
 Interaction variables
 Nonlinear
transformations
 These techniques
produce much better
fits than you could
obtain without them.
Dummy Variables

 Some potential explanatory variables are categorical


and cannot be measured on a quantitative scale.
 However, these categorical variables are often related to
the dependent variable, so they need to be included.
 The trick is to use dummy variables.
 A dummy variable is a variable with possible values of 0
and 1.
 It equals 1 if the observation is in a particular category,
and 0 if it is not.
 Two situations:
 When there are only two categories (example: gender??)
 When there are more than two categories (example:
quarters)
 In this case, multiple dummy variables must be created.
Example 3: Possible Gender
Discrimination in Bank Salaries
 Objective: To analyze whether the bank discriminates against females
in terms of salary.
 Solution: Data set includes the following variables for each of the 208
employees: Education (categorical), Grade (categorical), Years1 (years
with this bank), Years2 (years of previous work experience), Age, Gender
(categorical with two values), PCJob (categorical yes/no), and Salary.
Example 3: Possible Gender
Discrimination in Bank Salaries
 Create dummy variables for the various
categorical variables, using IF functions.
 Then run a regression analysis with

Salary as the dependent variable, using


any combination of numerical and
dummy explanatory variables.
 Always use one fewer dummy than the
number of categories for any categorical
variable.
Example 3: Possible Gender
Discrimination in Bank Salaries
 The regression output with all variables shown
below.
Interaction Variables

 You can do this by


including an
interaction
variable.
 An interaction
variable is the product
of two explanatory
variables.
 Include an interaction
variable in a
regression equation if
you believe the effect
of one explanatory
variable on Y depends
on the value of
another explanatory
variable.
Example 3: Possible Gender
Discrimination in Bank Salaries
 The multiple regression output
appears below assuming interaction.
Example 3: Possible Gender
Discrimination in Bank Salaries
 The regression equations for Female and Male are
shown graphically below.
Nonlinear Transformations

 The general linear regression equation has the


form:
Y = a + b1X1 + b2X2 + … + bkXk
 It is linear in the sense that the right side of the
equation is a constant plus a sum of products of
constants and variables.
 Variables can be transformations of original
variables.
 Nonlinear transformations of variables are often used
because of curvature detected in scatterplots.
 You can transform the DV Y or any of the IVs, the Xs. You can
also do both.
 Typical nonlinear transformations include: the natural
logarithm, the square root, the reciprocal, and the square.
Example 4: Demand versus Cost
for Electricity
 Solution: The data set lists the
number of units of electricity
produced (Units) and the total
cost of producing these (Cost) for
a 36-month period.
 Start with a scatterplot of Cost
versus Units.
Regression summary
Adj R-
Multiple R R-square Square Std Error
0.858 0.736 0.728 2733.742

ANOVA table
df SS MS F p-value
708085273 708085273
Regression 1 .8 .8 94.748 0.000
254093815 7473347.5
Error 34 .2 06
Total 35 962179089

Regression equation 95% conf intervals


Coefficient Std Error t-stat p-value Lower Upper
Intercept 23651.489 1917.137 12.337 0.000 19755.398 27547.580
Units 30.533 3.137 9.734 0.000 24.158 36.908
Example 4: Demand versus Cost
for Electricity
 Next, request a scatterplot of the residuals versus the fitted values.
 The negative-positive-negative behavior of residuals suggests a parabola—that is, a
quadratic relationship with the square of Units included in the equation.
 Create a new variable Units2 in the data set and then use multiple regression to estimate
the equation for Cost with both explanatory variables, Units, and Units2
Example 4: Demand versus Cost
for Electricity
 Use Trendline option in Excel® to superimpose a quadratic curve on
the scatterplot. This curve is shown below, on the left.
 Finally, try a logarithmic fit by creating a new variable, Log(Units),
and then regressing Cost against this variable. This curve is shown
below, on the right.
 Logarithmic transformations of variables are used widely in regression
analysis because they have a meaningful interpretation.
Q&A

You might also like