Laboratory Excercise 7
Linear Model:
A simple Regression and Correlation
Siswanto Agus Wilopo
Professor of Faculty of Medicine, Public Health and Nursing,
Universitas Gadjah Mada, Yogyakarta
and
Adjunct Full Professor of College of Health and Agricultural Sciences
University College Dublin, Befield, Ireland
Departement of Biostatistics, Epidemiology and Population Health
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 1 / 27
Table of Contents
1 Learning Objectives
2 Activities
3 Class Exercise
4 Homework
5 Output
6 Required Reading
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 2 / 27
Learning Objectives
Learning Objectives
Upon completion of the course unit, students should be able to:
a. demonstrate the process of estimation and inference for the simple linear regression
and correlation
b. evaluate the simple regression assumption and correlation
c. apply concept of simple regressions for data analysis in public health research.
d. demonstrate how to use simple linear regression and correlation methods from
health data
e. appraise published article using a simple linear regression method
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 3 / 27
Activities
Activities
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 4 / 27
Activities
Activities
1 Discussion: correlation and a simple linear regression
2 Laboratory session:
1 Estimating correlation coefficient and a simple linear regression
2 Inference for correlation and linear regression
3 Calculating and reading computer outputs from the correlation and a
simple regression analysis
4 How to estimate and interpret coefficients of a simple linear regression
5 How to assess assumption for a simple linear regression
6 Analyzing data using a a simple linear regression analysis
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 5 / 27
Class Exercise
Class Exercise
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 6 / 27
Class Exercise
Instruction
• Every student should read the journal before the class exercise and
discuss these following questions with his/her friends under the
guidance of your tutor.
• In the group discussion, you are encouraged to discuss questions
and possible answers with other students.
• During the group discussion your tutor will be able to help a few
concepts that you have not exposed before.
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 7 / 27
Class Exercise
Lung function data
The lung function data set includes information on nonsmoking families from the UCLA
study of chronic obstructive respiratory disease (CORD). In the CORD study persons 7
years old and older from four areas (Burbank, Lancaster, Long Beach, and Glendora) were
sampled, and information was obtained from them at two time periods. The data in this
exercise are a subset including 150 families with a mother and a father, and one, two, or
three children between the ages of 7 and 17 who answered the questionnaire and took the
lung function tests at the first time period. The purpose of the CORD study was to
determine the effects of different types of air pollutants on respiratory function, but
numerous other types of studies have been performed on this data set. Data on age, sex,
height, weight, FVC, and FEV1 are included for the members of each family. Some families
have only one or two children and if there is only one child it is listed as the oldest child.
Since many families have only one child (considered the oldest), there are many missing
values in the data for the middle and youngest child.
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 8 / 27
Class Exercise
Data anlysis example
One of the major early indicators of reduced respiratory function is FEV1 or forced
expiratory volume in the first second (amount of air exhaled in 1 second). Since it is known
that taller males tend to have higher FEV1, we wish to determine the relationship between
height and FEV1. We exclude the data from the mothers as several studies have shown a
different relationship for women. The sample size is 150. These data belong to the
variable-X case, where X is height (in inches) and Y is FEV1 (in liters). Here we may be
concerned with describing the relationship between FEV1 and height, a descriptive
purpose. We may also use the resulting equation to determine expected or normal FEV1
for a given height, a predictive use.
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 9 / 27
Class Exercise
You should create scatterplot of the data using stata
sofware. Here the stata commands
use "lung.dta", clear
generate ffev1a = ffev1/100
graph twoway (scatter ffev1a fheight, msymbol(Oh)) ///
(lfit ffev1a fheight), ///
xtitle("A scatterplot between Male FEV1 and Height") ///
xlabel(58 62 to 78, grid) ylabel(2 2.5 to 6.5, angle(0)) ///
ytitle(FFEV1) graphregion(fcolor(white))
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 10 / 27
Class Exercise
6.5
5.5
4.5
FFEV1
3.5
2.5
2
58 62 66 70 74 78
A scatterplot between Male FEV1 and Height
ffev1a Fitted values
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 11 / 27
Class Exercise
Data analysis example
• In your graph, height is given on the horizontal axis since it is the
independent or predictor variable and FEV1 is given on the vertical
axis since it is the dependent or outcome variable.
• Please round heights to the nearest inch in the original data and the
program marked every four inches on the horizontal axis.
• The circles in Figure represent the location of the data.
• There does appear to be a tendency for taller men to have higher
FEV1.
• The program also draws the regression line in the graph.
• The line is tilted upwards, indicating that we expect larger values of
FEV1 with larger values of height.
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 12 / 27
Class Exercise
A simple regression equation
The equation of the regression line is given by stata as:
Y = −4.087 + 0.118X
. regress ffev1a fheight
Source SS df MS Number of obs = 150
F(1, 148) = 50.50
Model 16.0531702 1 16.0531702 Prob > F = 0.0000
Residual 47.0451258 148 .317872472 R-squared = 0.2544
Adj R-squared = 0.2494
Total 63.098296 149 .423478497 Root MSE = .5638
ffev1a Coef. Std. Err. t P>|t| [95% Conf. Interval]
fheight .1181052 .0166194 7.11 0.000 .0852633 .1509472
_cons -4.086702 1.151979 -3.55 0.001 -6.363155 -1.81025
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 13 / 27
Class Exercise
Iterpretation
• The quantity 0.118 in front of X is greater than zero, indicating that as
we increase X; Y will increase.
• For example, we would expect a father who is 70 inches tall to have
an FEV1 value of F EV 1 = −4.087 + (0.118)(70) = 4.173
• Question:
1 If the height was 66 inches then what would you expect for the value of
FEV1?
2 Suppose a father was 2 feet (what is in cm?) tall, what would you
expect for the value of FEV1?
3 This example illustrates the danger of using the regression equation
outside the appropriate range.
4 A safe policy is to restrict the use of the equation to the range of the X
observed in the sample.
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 14 / 27
Class Exercise
Iterpretation
In order to get more information about these men, we requested
descriptive statistics.
. summarize ffev1a fheight
Variable Obs Mean Std. Dev. Min Max
ffev1a 150 4.093267 .6507523 2.5 5.85
fheight 150 69.26 2.779189 61 76
Note that the mean height is approximately in the middle of the heights and the mean
FEV1 is approximately in the middle of the FEV1 values in Figure above.
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 15 / 27
Class Exercise
Iterpretation
We can compute a correlation coefficient as follows:
. pwcorr fheight ffev1 , sig star(.05)
fheight ffev1
fheight 1.0000
ffev1 0.5044* 1.0000
0.0000
a. Can you check whether correlation coefficient is statistically
significant different from 0?
b. How are you going to report?
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 16 / 27
Class Exercise
Model Checking
1 In the lecture, methods for checking for outliers, normality,
homogeneity of variance, and independence were presented along
with a brief discussion of the importance of including checks in the
analysis.
2 Create Normal Probability Plot of the Residuals of the Regression of
FEV1 on Height for Fathers
3 What is your conclusion?
4 Here the stata commands:
• regress ffev1a fheight
• predict resid, resid
• qnorm resid
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 17 / 27
Homework
Homework
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 18 / 27
Homework
Homework
The following are research articles for your reading assignment. Each
student needs to read this articles.
• Muhammad, I. N., Yasrul Izad, A. B., Akram, S., & Atif, A. B. (2021).
Correlation of anthropometric indices with lipid profile indices among
malay obese and non-obese subjects in malaysia. Nutrition and Food
Science, 51(2), 278-288.
doi:https://doi.org/10.1108/NFS-01-2020-0008
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 19 / 27
Homework
Question No. 1
Pay attention Table 3. This is simple linear correlationa between
anthropometric indices and serum lipid profile among Malay subjects
a. Can you read the regressions coefficient of BMI (Kg per m2 with HDL (mmol per l)
and LDL (mmol per l)?
b. Can you estimate the coefficient determinations from that regression equation? How
are you going to explain your coefficient determinations to the reader?
c. For the regression equation involving sex, can you justify the relationship between
sex and HDL (mmol per l) and LDL (mmol per l have statistical significance)? Can
you conclude that male has a higher regression coefficient than female? How are
you going to tell the reader who is not familiar with statistical language?
d. The p-value for BMI to HDL is equal to .01. Can you write a formal hypothesis of this
p-value? What is your conclusion on this association?
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 20 / 27
Homework
Question No. 2
Read the data on Framingham study (framingham.dta). These data
comprise of 3 periods of examination and its is coded as period
a. For the first follow-up, is systolic blood pressure in men determined by BMI? What is
for women? What is your conclusion to compare men and women? Read all
statistical findings from your Stata commands. What is coefficient determination?
b. For the third follow-up, are males having higher blood pressure compared to women?
c. You can use t-test to compare mean systolic blood pressure among men and
women. Please use a simple regression instead for comparing systolic blood
pressure among males and women. Try to compare for the third examination.
d. Create a scatterplot to correlate between systolic blood pressure and BMI among
males and females in the single graph and predictive line as well. Is it consistent with
your previous analysis?
e. Please check the assumption for those equations using qnorm plot. Is the
assumption met?
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 21 / 27
Homework
Note
• During Laboratory Exercise student will be assisted to use computer
program by your teaching assistants. Every student should turn in the
homework within at most 2 weeks after laboratory exercise.
• Here’s a link to class web:
https://elok.ugm.ac.id/course/view.php?id=13296
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 22 / 27
Output
Output
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 23 / 27
Output
Output of this laboratory exercise
1 Analysis continues data with a simple regression
2 Create a graph that presents an association between two variables
3 Interpret the association between two variables
4 Critical appraisal for an article that uses a simple regressionon
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 24 / 27
Required Reading
Required Reading
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 25 / 27
Required Reading
Required Reading
1 Lecture Materials
2 Muhammad, I. N., Yasrul Izad, A. B., Akram, S., & Atif, A. B. (2021).
Correlation of anthropometric indices with lipid profile indices among
malay obese and non-obese subjects in malaysia. Nutrition and Food
Science, 51(2), 278-288.
doi:https://doi.org/10.1108/NFS-01-2020-0008
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 26 / 27
Required Reading
END OF LABORATORY
EXERCISE 07
Prof. Siswanto Agus Wilopo (UGM) Biostatistics I July 23, 2024 27 / 27