0% found this document useful (0 votes)
157 views49 pages

Evans - Analytics2e - PPT - 07 and 08

Uploaded by

MY LE HO TIEU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views49 pages

Evans - Analytics2e - PPT - 07 and 08

Uploaded by

MY LE HO TIEU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Chapter 7

Statistical Inference
Chapter 8
Trendlines and Regression
Analysis
Statistical Inference
 Statistical inference (thống kê suy luận) focuses
on drawing conclusions about populations from
samples.
◦ Statistical inference includes estimation of population
parameters (tham số) and hypothesis testing, which
involves drawing conclusions about the value of the
parameters of one or more populations.
Hypothesis Testing
 Hypothesis testing involves drawing inferences about
two contrasting propositions (each called a hypothesis)
relating to the value of one or more population
parameters.
 H0: Null hypothesis: describes an existing theory
 H1: Alternative hypothesis: the complement of H0
 Using sample data, we either:
- reject H0 and conclude the sample data provides
sufficient evidence to support H1, or
- fail to reject H0 and conclude the sample data
does not support H1.
Example 7.1: A Legal Analogy for Hypothesis
Testing
 In the U.S. legal system, a defendant is innocent
until proven guilty.
◦ H0: Innocent
◦ H1: Guilty
 If evidence (sample data) strongly indicates the
defendant is guilty, then we reject H0.
 Note that we have not proven guilt or innocence!
Hypothesis Testing Procedure

Steps in conducting a hypothesis test:


1. Identify the population parameter and formulate
the hypotheses to test.
2. Select a level of significance (the risk of
drawing an incorrect conclusion).
3. Determine the decision rule on which to base a
conclusion.
4. Collect data and calculate a test statistic.
5. Apply the decision rule and draw a conclusion.
p-Values
 A p-value (observed significance level) is the
probability of obtaining a test statistic value
equal to or more extreme than that obtained
from the sample data when the null hypothesis
is true.
 If p < .10 → “marginally significant”
 If p < .05 → “significant”
 If p < .01 → “highly significant.”

Reject H0 if the p-value < α


Two independent samples t-test
 Testing the relationship between two independent samples is
very common in market research.
 Some common research questions are:
◦ Do heavy and light users’ satisfaction with a product differ?
◦ Do male customers spend more money online than female customers?
◦ Do US teenagers spend more time on Facebook than Australian
teenagers?
 Each of these hypotheses aim at evaluating whether two
populations (e.g., heavy and light users), represented by
samples, are significantly different in terms of certain key
variables (e.g., satisfaction ratings).
 To understand the principles of a two independent samples t-test
Two independent samples t-test
 Steps and results: see the attached pdf files
Analysis of Variance (ANOVA)
 Used to compare the means of two or more population
groups.

 ANOVA derives its name from the fact that we are


analyzing variances in the data.
 ANOVA measures variation between groups relative to
variation within groups.
Analysis of Variance (ANOVA)
 In probability theory and statistics, variance is the
expectation of the squared deviation of a random
variable from its population mean or sample mean.
 Variance is a measure of dispersion, meaning it is a
measure of how far a set of numbers is spread out from
their average value.

Example of samples from two


populations with the same mean but
different variances. The red
population has mean 100 and
variance 100 (SD=10) while the blue
population has mean 100 and
variance 2500 (SD=50).
Assumptions of ANOVA
 The m groups or factor levels being studied
represent populations whose outcome measures
1. are randomly and independently obtained,
2. are normally distributed, and
3. have equal variances.

 If these assumptions are violated, then the level


of significance and the power of the test can be
affected.
Example: Difference in Oddjob Data
 Examine whether customers’ membership status (i.e., status)
relates to their overall price/performance satisfaction (i.e.,
overall_sat) with Oddjob Airways
Example: Difference in Oddjob Data

The model has an F-value of 9.963, which yields a p-value of


0.00 (less than 0.05), suggesting that at least two of the three
groups differ significantly with regard to the overall
price/performance satisfaction.
Modeling Relationships and Trends in
Data

 Create charts to better understand data sets.


 For cross-sectional data, use a scatter chart.
 For time series data, use a line chart.
Scatter chart
Graphs > Chart Builder >
Scatter/Dot > Simple Scatter with Fit Line >
Input X Axis (with Price)
Input Y Axis (with Demand) > Ok
Line chart of historical crude oil prices
Graphs > Chart Builder >
Line > Simple Line >
Input X Axis (with Date)
Input Y Axis (with Price) > Ok
Common Mathematical Functions Used n
Predictive Analytical Models

Linear y = a + bx
 Linear functions show steady increases

or decreases over the range of x.


 This is the simplest type of function used

in predictive models.
 It is easy to understand, and over small

ranges of values, can approximate


behavior rather well.
R2
 R2 (R-squared) is a measure of the “fit” of the line
to the data.
◦ The value of R2 will be between 0 and 1.
◦ A value of 1.0 indicates a perfect fit and all data points
would lie on the line; the larger the value of R2 the
better the fit.
Regression Analysis
 Regression analysis is a tool for building
mathematical and statistical models that
characterize relationships between a dependent
(ratio) variable and one or more independent, or
explanatory variables (ratio or categorical), all of
which are numerical.
 Simple linear regression involves a single

independent variable.
 Multiple regression involves two or more

independent variables.
Simple Linear Regression
 Finds a linear relationship between:
- one independent variable X and
- one dependent variable Y
 First prepare a scatter plot to verify the data has a

linear trend.
 Use alternative approaches if the data is not linear.
Example 8.3: Home Market Value Data
Size of a house is typically
related to its market value.
X = square footage
Y = market value ($)
The scatter plot of the full
data set (42 homes)
indicates a linear trend.
Least-Squares Regression
 Simple linear regression model:

 We estimate the parameters from the sample data:

 Let Xi be the value of the independent variable of the ith


observation. When the value of the independent
variable is Xi, then Yi = b0 + b1Xi is the estimated value
of Y for Xi.
Residuals
 Residuals are the observed errors associated
with estimating the value of the dependent
variable using the regression line:
Least Squares Regression
 The best-fitting line minimizes the sum of squares of the
residuals.
Simple Linear Regression With SPSS
Analyze > Regression > Linear
Input Dependent variable (with Market value)
Input Independent variable (with Square feet)

52% of the variation in home


market values can be explained by
home size.
Home Market Value Regression
Results
Regression Statistics
 Multiple R - | r |, where r is the sample correlation
coefficient. The value of r varies from -1 to +1 (r is
negative if slope is negative)
 R Square - coefficient of determination, R2, which

varies from 0 (no fit) to 1 (perfect fit)


 Adjusted R Square - adjusts R2 for sample size

and number of X variables


 Standard Error - variability between observed

and predicted Y values. This is formally called the


standard error of the estimate, SYX.
Example 8.7: Interpreting Significance of
Regression
Home size is not a significant variable
Home size is a significant variable
 p-value = 3.798 x 10-8
◦ Reject H0: The slope is not equal to zero. Using a linear
relationship, home size is a significant variable in
explaining variation in market value.
◦ Home size has a positive association with market value.
Multiple Linear Regression
 A linear regression model with more than one
independent variable is called a multiple linear
regression model.
Estimated Multiple Regression
Equation
 We estimate the regression coefficients—called
partial regression coefficients — b0, b1, b2,… bk,
then use the model:

 The partial regression coefficients represent the


expected change in the dependent variable when
the associated independent variable is increased
by one unit while the values of all other
independent variables are held constant.
Example 8.12: Interpreting Regression Results
for the Colleges and Universities Data

 Predict student graduation rates using several indicators:

The value of R2 indicates


that 53% of the variation
in the dependent variable
is explained by these
independent variables.
Example 8.12 Continued
 Regression model
Higher SAT scores and lower
acceptance rates suggest higher
graduation rates

 Looking at the p-values for the independent variables in the last


section, we see that all are less than 0.05; therefore, we reject the
null hypothesis that each partial regression coefficient is zero and
conclude that each of them is statistically significant
Interpretation
 The coefficient of Median SAT is statistically significant
and positive. This indicates that the SAT score has a
positive association/influence on the graduation rate. (Hệ
số của biến Median SAT có ý nghĩa thống kê và có có
dấu dương. Điều này cho thấy điểm SAT có tác động
tích cực lên tỷ lệ tốt nghiệp.)
 If the median SAT increases by 1 point, the graduation
rate increases by 0.072%, keeping all other independent
variables constant. (Nếu điểm trung vị SAT (median
SAT) tăng lên 1 điểm, thì tỷ lệ tốt nghiệp tăng lên
0.072%, giữ nguyên các yếu tố khác không thay đổi.)

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 7-33
Multicollinearity (Đa cộng tuyến)
 Multicollinearity occurs when there are strong
correlations among the independent variables, and
they can predict each other better than the dependent
variable.
◦ When significant multicollinearity is present, it becomes difficult to
isolate the effect of one independent variable on the dependent
variable, the signs of coefficients may be the opposite of what
they should be, making it difficult to interpret regression
coefficients, and p-values can be inflated.
 Correlations exceeding ±0.7 may indicate multicollinearity
Example 8.14: Identifying Potential Multicollinearity

 Colleges and Universities correlation matrix; none


exceed the recommend threshold of ±0.7 Analyze > Correlate
> Bivariate

There is no problem of
Example 8.14: Identifying Potential Multicollinearity

 Banking Data correlation matrix; large correlations exist


 We will run the multicollinearity test before we run
the regression.
◦ If there is no problem of multicollinearity, we continue
with the regression estimation.
◦ If there is a problem of multicollinearity, we run the
regression estimation with highly-correlated independent
variables in separate regression.

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 7-37
Regression with Categorical Variables
 Categorical data can be included as independent
variables, but must be coded numeric using
dummy variables.
 For variables with 2 categories, code as 0 and 1.
Example 8.15: A Model with Categorical
Variables
 Employee Salaries provides data for 35 employees

 Predict Salary using Age and MBA (code as


yes=1, no=0)
Example 8.15 Continued
 Salary = 893.59 + 1044.15 × Age + 14767.23 × MBA
◦ If MBA = 0, salary = 893.59 + 1044 × Age
◦ If MBA = 1, salary = 893.59 + 1044 × Age + 14767.23
Interpretation
 The coefficient of MBA is statistically significant and
positive. This indicates that having an MBA degree has a
positive association/influence on salary. (Hệ số của biến
MBA có ý nghĩa thống kê và có dấu dương. Điều này
cho thấy có bằng MBA có tác động tích cực lên lương.)
 If an employee has an MBA degree, the salary increases
by 14767.23 USDs compared with the one without an
MBA degree, keeping all other independent variables
constant. (Nếu một người nhân viên có bằng MBA,
người này có lương cao hơn 14767.23 USD so với
người không có bằng MBA, giữ nguyên các yếu tố khác
không thay đổi.)

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 7-41
Interactions
 An interaction occurs when the effect of
one variable is dependent on another
variable.
 We can test for interactions by defining a new

variable as the product of the two variables,


X3 = X1 × X2 , and testing whether this
variable is significant, leading to an
alternative model.
Example 8.16: Incorporating Interaction
Terms in a Regression Model
 Define an interaction between
Age and MBA and re-run the
regression.
The result shows that the positive impact of
Transform > Compute
Age on Salary is higher when the employee variable >
has an MBA degree. Interaction = Age*MBA
Categorical Variables with More Than Two
Levels
 When a categorical variable has k > 2 levels,
we need to add k - 1 additional variables to
the model.
Example 8.17: A Regression Model with
Multiple Levels of Categorical Variables
 The file Surface Finish
provides measurements
of the surface finish of
35 parts produced on a
lathe (máy tiện), along
with the revolutions per
minute (RPM) of the
spindle (trục quay) and
one of four types of
cutting tools used.
Example 8.17 Continued
 Because we have k = 4 levels of tool type, we will
define a regression model of the form
Example 8.17 Continued

 Add 3 columns to
the data, one for
each of the tool
type variables
Example 8.17 Continued
 Regression results

Surface finish = 24.49 + 0.098 RPM - 13.31 type B - 20.49 type C -


26.04 type D
Interpretation
 The coefficient of Type B is statistically significant and
positive. This indicates that using Type B cutting tool has
a negative association/influence on Surface Finish,
compared with Type A. (Hệ số của biến Type B có ý
nghĩa thống kê và có dấu âm. Điều này cho thấy sử
dụng công cụ cắt Type B có tác động làm giảm Surface
Finish, so với sử dụng công cụ Type A.)

Copyright © 2013 Pearson Education, Inc.


publishing as Prentice Hall 7-49

You might also like