0% found this document useful (0 votes)

157 views49 pages

Evans - Analytics2e - PPT - 07 and 08

Uploaded by

MY LE HO TIEU

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

157 views49 pages

Evans - Analytics2e - PPT - 07 and 08

Uploaded by

MY LE HO TIEU

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Chapter 7

Statistical Inference
Chapter 8
Trendlines and Regression
Analysis
Statistical Inference
 Statistical inference (thống kê suy luận) focuses
on drawing conclusions about populations from
samples.
◦ Statistical inference includes estimation of population
parameters (tham số) and hypothesis testing, which
involves drawing conclusions about the value of the
parameters of one or more populations.
Hypothesis Testing
 Hypothesis testing involves drawing inferences about
two contrasting propositions (each called a hypothesis)
relating to the value of one or more population
parameters.
 H0: Null hypothesis: describes an existing theory
 H1: Alternative hypothesis: the complement of H0
 Using sample data, we either:
- reject H0 and conclude the sample data provides
sufficient evidence to support H1, or
- fail to reject H0 and conclude the sample data
does not support H1.
Example 7.1: A Legal Analogy for Hypothesis
Testing
 In the U.S. legal system, a defendant is innocent
until proven guilty.
◦ H0: Innocent
◦ H1: Guilty
 If evidence (sample data) strongly indicates the
defendant is guilty, then we reject H0.
 Note that we have not proven guilt or innocence!
Hypothesis Testing Procedure

Steps in conducting a hypothesis test:

1. Identify the population parameter and formulate
the hypotheses to test.
2. Select a level of significance (the risk of
drawing an incorrect conclusion).
3. Determine the decision rule on which to base a
conclusion.
4. Collect data and calculate a test statistic.
5. Apply the decision rule and draw a conclusion.
p-Values
 A p-value (observed significance level) is the
probability of obtaining a test statistic value
equal to or more extreme than that obtained
from the sample data when the null hypothesis
is true.
 If p < .10 → “marginally significant”
 If p < .05 → “significant”
 If p < .01 → “highly significant.”

Reject H0 if the p-value < α

Two independent samples t-test
 Testing the relationship between two independent samples is
very common in market research.
 Some common research questions are:
◦ Do heavy and light users’ satisfaction with a product differ?
◦ Do male customers spend more money online than female customers?
◦ Do US teenagers spend more time on Facebook than Australian
teenagers?
 Each of these hypotheses aim at evaluating whether two
populations (e.g., heavy and light users), represented by
samples, are significantly different in terms of certain key
variables (e.g., satisfaction ratings).
 To understand the principles of a two independent samples t-test
Two independent samples t-test
 Steps and results: see the attached pdf files
Analysis of Variance (ANOVA)
 Used to compare the means of two or more population
groups.

 ANOVA derives its name from the fact that we are

analyzing variances in the data.
 ANOVA measures variation between groups relative to
variation within groups.
Analysis of Variance (ANOVA)
 In probability theory and statistics, variance is the
expectation of the squared deviation of a random
variable from its population mean or sample mean.
 Variance is a measure of dispersion, meaning it is a
measure of how far a set of numbers is spread out from
their average value.

Example of samples from two

populations with the same mean but
different variances. The red
population has mean 100 and
variance 100 (SD=10) while the blue
population has mean 100 and
variance 2500 (SD=50).
Assumptions of ANOVA
 The m groups or factor levels being studied
represent populations whose outcome measures
1. are randomly and independently obtained,
2. are normally distributed, and
3. have equal variances.

 If these assumptions are violated, then the level

of significance and the power of the test can be
affected.
Example: Difference in Oddjob Data
 Examine whether customers’ membership status (i.e., status)
relates to their overall price/performance satisfaction (i.e.,
overall_sat) with Oddjob Airways
Example: Difference in Oddjob Data

The model has an F-value of 9.963, which yields a p-value of

0.00 (less than 0.05), suggesting that at least two of the three
groups differ significantly with regard to the overall
price/performance satisfaction.
Modeling Relationships and Trends in
Data

 Create charts to better understand data sets.

 For cross-sectional data, use a scatter chart.
 For time series data, use a line chart.
Scatter chart
Graphs > Chart Builder >
Scatter/Dot > Simple Scatter with Fit Line >
Input X Axis (with Price)
Input Y Axis (with Demand) > Ok
Line chart of historical crude oil prices
Graphs > Chart Builder >
Line > Simple Line >
Input X Axis (with Date)
Input Y Axis (with Price) > Ok
Common Mathematical Functions Used n
Predictive Analytical Models

Linear y = a + bx
 Linear functions show steady increases

or decreases over the range of x.

 This is the simplest type of function used

in predictive models.
 It is easy to understand, and over small

ranges of values, can approximate

behavior rather well.
R2
 R2 (R-squared) is a measure of the “fit” of the line
to the data.
◦ The value of R2 will be between 0 and 1.
◦ A value of 1.0 indicates a perfect fit and all data points
would lie on the line; the larger the value of R2 the
better the fit.
Regression Analysis
 Regression analysis is a tool for building
mathematical and statistical models that
characterize relationships between a dependent
(ratio) variable and one or more independent, or
explanatory variables (ratio or categorical), all of
which are numerical.
 Simple linear regression involves a single

independent variable.
 Multiple regression involves two or more

independent variables.
Simple Linear Regression
 Finds a linear relationship between:
- one independent variable X and
- one dependent variable Y
 First prepare a scatter plot to verify the data has a

linear trend.
 Use alternative approaches if the data is not linear.
Example 8.3: Home Market Value Data
Size of a house is typically
related to its market value.
X = square footage
Y = market value ($)
The scatter plot of the full
data set (42 homes)
indicates a linear trend.
Least-Squares Regression
 Simple linear regression model:

 We estimate the parameters from the sample data:

 Let Xi be the value of the independent variable of the ith

observation. When the value of the independent
variable is Xi, then Yi = b0 + b1Xi is the estimated value
of Y for Xi.
Residuals
 Residuals are the observed errors associated
with estimating the value of the dependent
variable using the regression line:
Least Squares Regression
 The best-fitting line minimizes the sum of squares of the
residuals.
Simple Linear Regression With SPSS
Analyze > Regression > Linear
Input Dependent variable (with Market value)
Input Independent variable (with Square feet)

52% of the variation in home

market values can be explained by
home size.
Home Market Value Regression
Results
Regression Statistics
 Multiple R - | r |, where r is the sample correlation
coefficient. The value of r varies from -1 to +1 (r is
negative if slope is negative)
 R Square - coefficient of determination, R2, which

varies from 0 (no fit) to 1 (perfect fit)

 Adjusted R Square - adjusts R2 for sample size

and number of X variables

 Standard Error - variability between observed

and predicted Y values. This is formally called the

standard error of the estimate, SYX.
Example 8.7: Interpreting Significance of
Regression
Home size is not a significant variable
Home size is a significant variable
 p-value = 3.798 x 10-8
◦ Reject H0: The slope is not equal to zero. Using a linear
relationship, home size is a significant variable in
explaining variation in market value.
◦ Home size has a positive association with market value.
Multiple Linear Regression
 A linear regression model with more than one
independent variable is called a multiple linear
regression model.
Estimated Multiple Regression
Equation
 We estimate the regression coefficients—called
partial regression coefficients — b0, b1, b2,… bk,
then use the model:

 The partial regression coefficients represent the

expected change in the dependent variable when
the associated independent variable is increased
by one unit while the values of all other
independent variables are held constant.
Example 8.12: Interpreting Regression Results
for the Colleges and Universities Data

 Predict student graduation rates using several indicators:

The value of R2 indicates

that 53% of the variation
in the dependent variable
is explained by these
independent variables.
Example 8.12 Continued
 Regression model
Higher SAT scores and lower
acceptance rates suggest higher
graduation rates

 Looking at the p-values for the independent variables in the last

section, we see that all are less than 0.05; therefore, we reject the
null hypothesis that each partial regression coefficient is zero and
conclude that each of them is statistically significant
Interpretation
 The coefficient of Median SAT is statistically significant
and positive. This indicates that the SAT score has a
positive association/influence on the graduation rate. (Hệ
số của biến Median SAT có ý nghĩa thống kê và có có
dấu dương. Điều này cho thấy điểm SAT có tác động
tích cực lên tỷ lệ tốt nghiệp.)
 If the median SAT increases by 1 point, the graduation
rate increases by 0.072%, keeping all other independent
variables constant. (Nếu điểm trung vị SAT (median
SAT) tăng lên 1 điểm, thì tỷ lệ tốt nghiệp tăng lên
0.072%, giữ nguyên các yếu tố khác không thay đổi.)

publishing as Prentice Hall 7-33
Multicollinearity (Đa cộng tuyến)
 Multicollinearity occurs when there are strong
correlations among the independent variables, and
they can predict each other better than the dependent
variable.
◦ When significant multicollinearity is present, it becomes difficult to
isolate the effect of one independent variable on the dependent
variable, the signs of coefficients may be the opposite of what
they should be, making it difficult to interpret regression
coefficients, and p-values can be inflated.
 Correlations exceeding ±0.7 may indicate multicollinearity
Example 8.14: Identifying Potential Multicollinearity

 Colleges and Universities correlation matrix; none

exceed the recommend threshold of ±0.7 Analyze > Correlate
> Bivariate

There is no problem of
Example 8.14: Identifying Potential Multicollinearity

 Banking Data correlation matrix; large correlations exist

 We will run the multicollinearity test before we run
the regression.
◦ If there is no problem of multicollinearity, we continue
with the regression estimation.
◦ If there is a problem of multicollinearity, we run the
regression estimation with highly-correlated independent
variables in separate regression.

publishing as Prentice Hall 7-37
Regression with Categorical Variables
 Categorical data can be included as independent
variables, but must be coded numeric using
dummy variables.
 For variables with 2 categories, code as 0 and 1.
Example 8.15: A Model with Categorical
Variables
 Employee Salaries provides data for 35 employees

 Predict Salary using Age and MBA (code as

yes=1, no=0)
Example 8.15 Continued
 Salary = 893.59 + 1044.15 × Age + 14767.23 × MBA
◦ If MBA = 0, salary = 893.59 + 1044 × Age
◦ If MBA = 1, salary = 893.59 + 1044 × Age + 14767.23
Interpretation
 The coefficient of MBA is statistically significant and
positive. This indicates that having an MBA degree has a
positive association/influence on salary. (Hệ số của biến
MBA có ý nghĩa thống kê và có dấu dương. Điều này
cho thấy có bằng MBA có tác động tích cực lên lương.)
 If an employee has an MBA degree, the salary increases
by 14767.23 USDs compared with the one without an
MBA degree, keeping all other independent variables
constant. (Nếu một người nhân viên có bằng MBA,
người này có lương cao hơn 14767.23 USD so với
người không có bằng MBA, giữ nguyên các yếu tố khác
không thay đổi.)

publishing as Prentice Hall 7-41
Interactions
 An interaction occurs when the effect of
one variable is dependent on another
variable.
 We can test for interactions by defining a new

variable as the product of the two variables,

X3 = X1 × X2 , and testing whether this
variable is significant, leading to an
alternative model.
Example 8.16: Incorporating Interaction
Terms in a Regression Model
 Define an interaction between
Age and MBA and re-run the
regression.
The result shows that the positive impact of
Transform > Compute
Age on Salary is higher when the employee variable >
has an MBA degree. Interaction = Age*MBA
Categorical Variables with More Than Two
Levels
 When a categorical variable has k > 2 levels,
we need to add k - 1 additional variables to
the model.
Example 8.17: A Regression Model with
Multiple Levels of Categorical Variables
 The file Surface Finish
provides measurements
of the surface finish of
35 parts produced on a
lathe (máy tiện), along
with the revolutions per
minute (RPM) of the
spindle (trục quay) and
one of four types of
cutting tools used.
Example 8.17 Continued
 Because we have k = 4 levels of tool type, we will
define a regression model of the form
Example 8.17 Continued

 Add 3 columns to
the data, one for
each of the tool
type variables
Example 8.17 Continued
 Regression results

Surface finish = 24.49 + 0.098 RPM - 13.31 type B - 20.49 type C -

26.04 type D
Interpretation
 The coefficient of Type B is statistically significant and
positive. This indicates that using Type B cutting tool has
a negative association/influence on Surface Finish,
compared with Type A. (Hệ số của biến Type B có ý
nghĩa thống kê và có dấu âm. Điều này cho thấy sử
dụng công cụ cắt Type B có tác động làm giảm Surface
Finish, so với sử dụng công cụ Type A.)

publishing as Prentice Hall 7-49

Evans Analytics1e PPT 14
No ratings yet
Evans Analytics1e PPT 14
74 pages
Decision Tree
No ratings yet
Decision Tree
68 pages
Evans - Analytics2e - PPT - 05 Data Modelling
100% (2)
Evans - Analytics2e - PPT - 05 Data Modelling
98 pages
Logistic Regression
100% (3)
Logistic Regression
41 pages
Camm 4e Ch01 PPT
No ratings yet
Camm 4e Ch01 PPT
48 pages
Regression - Elements of AI 4-2
100% (2)
Regression - Elements of AI 4-2
20 pages
Evans Analytics2e PPT 13
No ratings yet
Evans Analytics2e PPT 13
76 pages
Philosophyofeducation Book
No ratings yet
Philosophyofeducation Book
122 pages
Evans - Analytics2e - PPT - 07 and 08 CH
No ratings yet
Evans - Analytics2e - PPT - 07 and 08 CH
50 pages
1
100% (1)
1
385 pages
Evans Analytics2e PPT 09
100% (3)
Evans Analytics2e PPT 09
50 pages
Research Methodology
No ratings yet
Research Methodology
18 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
73 pages
William Macpherson - The Psychology of Persuasion (1920)
100% (1)
William Macpherson - The Psychology of Persuasion (1920)
272 pages
Business Analytics: Methods, Models, and Decisions: Descriptive Statistics
No ratings yet
Business Analytics: Methods, Models, and Decisions: Descriptive Statistics
100 pages
Chapter 5 - Probability Distributions and Data Modeling
No ratings yet
Chapter 5 - Probability Distributions and Data Modeling
100 pages
Hypothesis Testing
100% (1)
Hypothesis Testing
30 pages
Logistic Regression
100% (1)
Logistic Regression
56 pages
2
0% (1)
2
36 pages
Evans Analytics2e PPT 11
100% (1)
Evans Analytics2e PPT 11
63 pages
Supervised Vs Unsupervised Learning
No ratings yet
Supervised Vs Unsupervised Learning
4 pages
Linear Regression
100% (2)
Linear Regression
28 pages
Scientific Inference Learning From Data 1st Edition Simon Vaughan All Chapter Instant Download
100% (19)
Scientific Inference Learning From Data 1st Edition Simon Vaughan All Chapter Instant Download
84 pages
CPE412 Pattern Recognition (Week 8)
100% (1)
CPE412 Pattern Recognition (Week 8)
25 pages
asset-v1-IIMBx QM901x 3T2015 Type@asset Block@w02 - C03
No ratings yet
asset-v1-IIMBx QM901x 3T2015 Type@asset Block@w02 - C03
6 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Lab Manual Practical 1
100% (1)
Lab Manual Practical 1
11 pages
Machine Learning Theory
100% (1)
Machine Learning Theory
12 pages
Homework 2
100% (1)
Homework 2
12 pages
Sbe13ch17a PP
No ratings yet
Sbe13ch17a PP
48 pages
Chapter 17 - Logistic Regression
No ratings yet
Chapter 17 - Logistic Regression
32 pages
Grammatical Theory and Metascience PDF
100% (1)
Grammatical Theory and Metascience PDF
366 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
Evans Analytics2e PPT 04
No ratings yet
Evans Analytics2e PPT 04
63 pages
Migrations in Balkan History
100% (3)
Migrations in Balkan History
182 pages
Kapalan Gb513 Business Analytics Unit 1 Assignment
No ratings yet
Kapalan Gb513 Business Analytics Unit 1 Assignment
3 pages
IBM SPSS Modeler-Neural Networks
100% (1)
IBM SPSS Modeler-Neural Networks
18 pages
Evans Analytics3e PPT 03 Accessible v2
No ratings yet
Evans Analytics3e PPT 03 Accessible v2
36 pages
Lesson 2 Linear Regression
100% (1)
Lesson 2 Linear Regression
21 pages
Evans Analytics2e PPT 12
100% (1)
Evans Analytics2e PPT 12
63 pages
Hypothesis Testing Results Analysis Using SPSS RM Dec 2017
No ratings yet
Hypothesis Testing Results Analysis Using SPSS RM Dec 2017
66 pages
Evans Analytics2e PPT 14
No ratings yet
Evans Analytics2e PPT 14
89 pages
Evans Analytics2e PPT 06 Final
100% (1)
Evans Analytics2e PPT 06 Final
36 pages
Evans Analytics1e PPT 10
No ratings yet
Evans Analytics1e PPT 10
61 pages
Evans Analytics2e PPT 10 Data Mining
No ratings yet
Evans Analytics2e PPT 10 Data Mining
69 pages
Grammar Module
No ratings yet
Grammar Module
13 pages
Evans Analytics1e PPT 04
No ratings yet
Evans Analytics1e PPT 04
64 pages
Linear Regression PDF
100% (1)
Linear Regression PDF
32 pages
LDA 01 Linear Discriminant Analysis
No ratings yet
LDA 01 Linear Discriminant Analysis
65 pages
Evans Analytics3e PPT 16 Accessible
No ratings yet
Evans Analytics3e PPT 16 Accessible
60 pages
Topic:use Statistical Data Analysis To Drive Fact - Based Decisions
0% (1)
Topic:use Statistical Data Analysis To Drive Fact - Based Decisions
11 pages
Linear Regression
No ratings yet
Linear Regression
83 pages
Statistics For Business and Economics: Estimation: Single Population
No ratings yet
Statistics For Business and Economics: Estimation: Single Population
46 pages
Widiantari
No ratings yet
Widiantari
13 pages
DataScience Unit1 (+notes)
No ratings yet
DataScience Unit1 (+notes)
56 pages
Evans Analytics1e PPT 07
No ratings yet
Evans Analytics1e PPT 07
49 pages
Cluster
100% (1)
Cluster
72 pages
Evans Analytics2e PPT 03
No ratings yet
Evans Analytics2e PPT 03
76 pages
Case Study 4
No ratings yet
Case Study 4
10 pages
Listkova, Alena - 505595 - Senior Project Thesis
No ratings yet
Listkova, Alena - 505595 - Senior Project Thesis
57 pages
Machine Learning: Bilal Khan
100% (2)
Machine Learning: Bilal Khan
20 pages
DSTP2.0-Batch-05 DBI101 3
No ratings yet
DSTP2.0-Batch-05 DBI101 3
3 pages
Chapter12 Solutions PDF
No ratings yet
Chapter12 Solutions PDF
44 pages
Confidence Interval
No ratings yet
Confidence Interval
16 pages
Predictive Analytics - Unit 4 - Week 2 - Questions
No ratings yet
Predictive Analytics - Unit 4 - Week 2 - Questions
3 pages
Engaged Scholarship: A Guide For Organizational and Social Research
No ratings yet
Engaged Scholarship: A Guide For Organizational and Social Research
39 pages
Macroeconomics Chapter 1
No ratings yet
Macroeconomics Chapter 1
24 pages
Introduction To Statistical Learning: With Applications in R
No ratings yet
Introduction To Statistical Learning: With Applications in R
13 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Statistics For Business Analysis: Learning Objectives
No ratings yet
Statistics For Business Analysis: Learning Objectives
37 pages
TSNotes 1
No ratings yet
TSNotes 1
29 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
Minitab 16: ANOVA, Normality, Tukey, Control Charts
No ratings yet
Minitab 16: ANOVA, Normality, Tukey, Control Charts
63 pages
T Test
No ratings yet
T Test
2 pages
Assignment 4:: SMS 3113 Statistical Laboratory Semester 2 Sessions 2012/2013
No ratings yet
Assignment 4:: SMS 3113 Statistical Laboratory Semester 2 Sessions 2012/2013
15 pages
Chapter 5-2 PDF
No ratings yet
Chapter 5-2 PDF
6 pages
A Review of Mixed Methods, Pragmatism and Abduction Techniques
No ratings yet
A Review of Mixed Methods, Pragmatism and Abduction Techniques
14 pages
PR2-Q2-Performance Task 2
No ratings yet
PR2-Q2-Performance Task 2
2 pages
MT416 - BCommII - Introduction To Business Analytics - MBA - 10039 - 19 - PratyayDas
No ratings yet
MT416 - BCommII - Introduction To Business Analytics - MBA - 10039 - 19 - PratyayDas
44 pages
Lecture 13: Bacon & A New Ideology of Science (3/3)
No ratings yet
Lecture 13: Bacon & A New Ideology of Science (3/3)
21 pages
The Natural Sciences-Methods Tools
No ratings yet
The Natural Sciences-Methods Tools
15 pages
G11 2ND Sem Quarter2 Tos Statistics
No ratings yet
G11 2ND Sem Quarter2 Tos Statistics
3 pages
Hypothesis Test - Testing - Vijay Kumar
No ratings yet
Hypothesis Test - Testing - Vijay Kumar
2 pages
SPSS Exercise
No ratings yet
SPSS Exercise
11 pages
1 Evaluating Arguments
No ratings yet
1 Evaluating Arguments
6 pages
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
Homework 4
No ratings yet
Homework 4
4 pages
Unit-17 IGNOU STATISTICS
No ratings yet
Unit-17 IGNOU STATISTICS
15 pages
Statand Prob Q4 M6
No ratings yet
Statand Prob Q4 M6
16 pages
Deductive and Inductive Reasoning
No ratings yet
Deductive and Inductive Reasoning
2 pages

Evans - Analytics2e - PPT - 07 and 08

Uploaded by

Evans - Analytics2e - PPT - 07 and 08

Uploaded by

Chapter 7

Steps in conducting a hypothesis test:

Reject H0 if the p-value < α

 ANOVA derives its name from the fact that we are

Example of samples from two

 If these assumptions are violated, then the level

The model has an F-value of 9.963, which yields a p-value of

 Create charts to better understand data sets.

or decreases over the range of x.

ranges of values, can approximate

 We estimate the parameters from the sample data:

 Let Xi be the value of the independent variable of the ith

52% of the variation in home

varies from 0 (no fit) to 1 (perfect fit)

and number of X variables

and predicted Y values. This is formally called the

 The partial regression coefficients represent the

 Predict student graduation rates using several indicators:

The value of R2 indicates

 Looking at the p-values for the independent variables in the last

Copyright © 2013 Pearson Education, Inc.

 Colleges and Universities correlation matrix; none

 Banking Data correlation matrix; large correlations exist

Copyright © 2013 Pearson Education, Inc.

 Predict Salary using Age and MBA (code as

Copyright © 2013 Pearson Education, Inc.

variable as the product of the two variables,

Surface finish = 24.49 + 0.098 RPM - 13.31 type B - 20.49 type C -

Copyright © 2013 Pearson Education, Inc.

You might also like