BMI and Age Analysis in PREVEND Study
BMI and Age Analysis in PREVEND Study
Statistics 104
Due November 12, 2019 at 11:59 pm
Problem set policies. Please provide concise, clear answers for each question. Note that only writing the result of
a calculation (e.g., "SD = 3.3") without explanation is not sufficient. For problems involving R, be sure to include
the code in your solution.
Please submit your problem set via Canvas as a PDF, along with the R Markdown source file.
We encourage you to discuss problems with other students (and, of course, with the course head and the TFs), but
you must write your final answer in your own words. Solutions prepared "in committee" are not acceptable. If you
do collaborate with classmates on a problem, please list your collaborators on your solution.
Problem 1.
This problem uses data from the Prevention of REnal and Vascular END-stage Disease (PREVEND)
study, which took place between 2003 and 2006 in the Netherlands. Clinical and demographic
data for 4,095 individuals are stored in the prevend dataset in the oibiostat package.
Body mass index (BMI) is a measure of body fat that is based on both height and weight. The World
Health Organization and National Institutes for Health define a BMI of over 25.0 as overweight;
this guideline is typically applied to adults in all age groups. However, a recent study has reported
that individuals of ages 65 or older with the greatest mortality risk were those with BMI lower than
23.0, while those with BMI between 24.0 and 30.9 were at lower risk of mortality. These findings
suggest that the ideal weight-for-height in older adults may not be the same as in younger adults.
Explore the relationship between BMI (BMI) and age (age), using the data in [Link], a random
subset of 500 individuals from the larger prevend data.
a) Create a plot that shows the association between BMI and age. Based on the plot, comment
briefly on the nature of the association.
The association between BMI and age seems slightly positive; larger values of age generally
have higher values of BMI.
#load the data
library(oibiostat)
data("[Link]")
#create a plot
plot([Link]$BMI ~ [Link]$Age,
main = "BMI by Age in PREVEND (n = 500)",
xlab = "Age (years)", ylab = "BMI", cex = 0.75)
1
BMI by Age in PREVEND (n = 500)
60
50
BMI
40
30
20
40 50 60 70 80
Age (years)
## (Intercept) Age
## 23.62709808 0.05968762
i. Write the equation of the linear model.
The linear model equation can be written as either ŷ = 23.63 + 0.060x or BMI
[ = 23.63 +
0.060(Age).
ii. Interpret the slope and intercept values in the context of the data. Comment on whether
the intercept value has any interpretive meaning in this setting.
According to the model slope, an increase in age of 1 year is associated with an average
increase in BMI of 0.060. The model intercept suggests that the average BMI for an
individual of age 0 is 23.63. The intercept value in this setting does not have any
interpretive value, because it is not valid to assess BMI for a newborn; in fact, BMI is a
measure specific to adults.
iii. Is it valid to use the linear model to estimate BMI for an individual who is 30 years old?
Explain your answer.
No, it is not valid to use the linear model to estimate BMI for an individual who is 30
years old. The data in the sample only extend from ages 36 to 81 and should not be
used to predict scores for individuals outside that age range. It may well be that the
2
relationship between BMI and age is different for individuals in a different age group.
iv. According to the linear model, estimate the average BMI for an individual who is 60
years old.
According to the linear model, the average BMI for an individual who is 60 years old is
23.63 + 0.060(60) = 27.21.
predict([Link], newdata = [Link](Age = 60))
## 1
## 27.20836
v. Based on the linear model, how much does BMI differ, on average, between an individual
who is 70 years old versus an individual who is 50 years old?
Based on the linear model, BMI differs, on average, by 0.060(20) = 1.20 for individuals
that have an age difference of 20 years; the older individual is expected to have a higher
BMI.
c) Create residual plots to assess the model assumptions of linearity, constant variability, and
normally distributed residuals. In your assessment of whether an assumption is reasonable,
be sure to clearly reference and interpret relevant features of the appropriate plot.
par(mfrow = c(1, 2))
#create QQ Plot
qqnorm(resid([Link]), cex = 0.75)
qqline(resid([Link]))
3
Residual Plot for BMI vs Age Normal Q−Q Plot
30
30
Sample Quantiles
20
20
Residual
10
10
0
0
−10
−10
26.0 26.5 27.0 27.5 28.0 28.5 −3 −2 −1 0 1 2 3
i. Assess linearity.
The data show an approximately linear trend. The points scatter evenly above and below
the horizontal line and there is no indication of curvature.
ii. Assess constant variance.
Variance is roughly constant across the range of predicted BMI values, although the
variance may be slightly lower in the 25.5 - 26.5 BMI range.
iii. Assess normality of residuals.
Most of the distribution is approximately normal, but there is evidence of departure
from normality in the tails, particularly in the upper tail. There are more large residuals
than would be expected for a normal distribution (and fewer small residuals).
iv. Suppose that a point is located in the uppermost right corner on a Q-Q plot of residuals
(from a linear model). In one sentence, describe where that point would necessarily be
located on a scatterplot of the data.
If a point is located in the uppermost right-corner on a Q-Q plot of residuals, the residual
is large and positive; thus, this indicates a point that is located well above where the line
of best fit would be on a scatterplot of the data.
d) Conduct a formal hypothesis test of no association between BMI and age, at the α = 0.05
significance level. Summarize your conclusions.
A test of no association between two variables is a test of the null H0 : β1 = 0 against the
alternative β1 , 0. Let α = 0.05. The t-statistic is 3.40, with associated p-value of 7.28 × 10−4 .
Since p < α, there is sufficient evidence to reject H0 in favor of HA . From the observed data,
there is statistically significant evidence of a positive association between BMI and age; an
increase in age of 1 year is significantly associated with a mean increase in BMI of 0.060.
4
#conduct hypothesis test
summary([Link])$coef
## [1] 0.02268586
Problem 2.
For this problem, do not use the functions cor() or lm() (except for checking your calculations);
however, you are welcome to use R to make plots and compute values. This is the largest dataset
for which you will be asked to estimate a regression line ‘by hand’.
Consider the five ordered (x, y) pairs (1, 5), (2, 4), (2.5, 3.5), (3, 0.5), and (4, 0).
a) Plot the data.
#plot the data
x = c(1, 2, 2.5, 3, 4)
y = c(5, 4, 3.5, 0.5, 0)
5
5
4
3
y
2
1
0
b) Calculate the correlation between x and y. Show your work, either in the form of algebraic
work or from using R as a calculator.
The correlation between x and y is -0.93.
n ! !
1 X x i − x yi − y
r=
n−1 sx sy
i=1
1 1 − 2.5 5 − 2.6 2 − 2.5 4 − 2.6 4 − 2.5 0 − 2.6
= + + ··· +
5−1 1.12 2.22 1.12 2.22 1.12 2.22
= − 0.93
#use r as a calculator
[Link] = mean(x); s.x = sd(x)
[Link] = mean(y); s.y = sd(y)
n = length(x)
## [1] -0.9320165
#confirm answer
cor(x, y)
## [1] -0.9320165
c) Estimate the slope and y-intercept of the least-squares regression line predicting y from x for
these data.
sy 2.219
The slope of the least-squares line is r = (−0.932) = −1.85.
sx 1.118
6
The y-intercept of the least-squares line is y − b1 x = 2.6 − (−1.85)(2.5) = 7.225.
#use r as a calculator
b1 = (s.y*r)/s.x; b1
## [1] -1.85
b0 = [Link] - b1*[Link]; b0
## [1] 7.225
#confirm answer
coef(lm(y ~ x))
## (Intercept) x
## 7.225 -1.850
d) Plot the data with the regression line calculated in part c).
#plot the data with the regression line
x = c(1, 2, 2.5, 3, 4); y = c(5, 4, 3.5, 0.5, 0)
2
1
0
7
Problem 3.
The utility dataset contains the average utility bills (in USD) for homes of a particular size and
the average monthly outdoor temperature in Fahrenheit.
a) Plot the data. From a visual inspection, does there seem to be a linear relationship between
the variables? Why or why not?
There does not seem to be a linear relationship; instead, the relationship appears quadratic.
#load the data
utility = [Link]("[Link]
#create a plot
plot(utility$bill ~ utility$temp,
main = "Utility Bill versus Average Outdoor Monthly Temp",
xlab = "Temperature (F)", ylab = "Bill Amount (USD)")
120
100
80
40 50 60 70 80 90
Temperature (F)
b) Fit a linear model to the data. From the linear model, estimate the average utility bill if the
average monthly temperature is 120 degrees. Explain whether the estimate is reasonable.
The prediction from the linear model for average utility bill when monthly outdoor temper-
ature is 120 F is $86.04. This is not a reasonable answer based on the observed quadratic
relationship; a very high bill should be expected for a high temperature, while a value around
$80 is at the low end of the observed bill values. It is also not advisable to extrapolate; we
have only observed temperatures as high as 90 degrees.
#fit a model
model = lm(bill ~ temp, data = utility)
#calculate estimate
8
predict(model, newdata = [Link](temp = 120))
## 1
## 86.03667
Problem 4.
Suppose that a class of 360 students has just taken an exam. The exam consisted of 40 true-false
questions, each of which was worth one point. A diligent teaching fellow has recorded the number
of correct answers (Y ) and the number of incorrect answers (X) for each student. Suppose that the
teaching fellow then fits a linear model to estimate Y from X. Note: no data are necessary to solve
this problem.
a) What will be the values of b0 , b1 , and R2 ?
In a model with the number of correct answers as the dependent variable and number of
incorrect answers as the independent variable, b0 is the number of correct answers when the
number of incorrect answers is 0. Thus, b0 = 40; if no questions are answered incorrectly out
of 40 questions, then 40 must have been answered correctl.
The model slope, b1 , represents the change in the number of correct answers for a one
unit change in the number of incorrect answers. A one question increase in the number of
incorrect answers corresponds to a one question decrease in the number of correct answers;
thus, b1 = −1.
Var(e )
The R2 can be calculated using the formula R2 = 1 − Var(yi ) . Since there is no error around the
i
least squares line (the line perfectly predicts the number of correct answers), the variance of
the residuals is 0 and R2 = 1.
b) Is this a useful model to fit? Explain your answer.
This is not a useful model to fit to the data in the sense that it is unnecessary—the number of
correct answers completely determines the number of incorrect answers.
9
Problem 5.
The international bank UBS regularly produces a report on prices and earnings in major cities
throughout the world. Three of the measures they include are the prices of basic commodities: 1
kg of rice, 1 kg of bread, and the price of a Big Mac hamburger at McDonald’s.
An interesting feature of the prices they report is that prices are measured in the minutes of labor
required for a “typical” worker in that location to earn enough money to purchase the commodity.
Using minutes of labor corrects at least in part for currency fluctuations, prevailing wage rates,
and local prices.
The data file [Link] includes measurements for rice, bread, and Big Mac prices from the
2003 and 2009 reports. The year 2003 was before the major recession hit much of the world around
2006, and the year 2009 may reflect changes in price due to the recession.
In this problem, you will model rice prices in 2009 as the dependent variable and rice prices in
2003 as the independent variable.
a) Plot the data and the y = x line.
#load the data
prices = [Link]("datasets/[Link]")
50
30
10
20 40 60 80
10
b) Briefly explain the key difference between points above versus points below the y = x line.
The points above the line represent countries for which the price in 2009 was higher than the
price in 2003, while the points above the line represent countries for which there was a price
decrease.
c) Fit a linear model to the data and interpret the slope coefficient. For these data, what is the
key difference between a slope larger than 1 versus smaller than 1?
The slope coefficient is 0.50. On average, an increase in 1 minute of price in 2003 is associated
with an increase in half a minute of price in 2009 (after accounting for the intercept). A slope
smaller than 1 suggests that prices are typically lower in 2009 than in 2003, while a slope
larger than 1 suggests that prices are typically higher in 2009 than in 2003. The intercept for
this model can be thought of as a measure of inflation, in that it predicts a "baseline" increase
from 2003 to 2009 separate from the percent increase expressed by the slope.
#fit model
[Link] = lm(rice2009 ~ rice2003, data = prices)
[Link]
##
## Call:
## lm(formula = rice2009 ~ rice2003, data = prices)
##
## Coefficients:
## (Intercept) rice2003
## 12.5842 0.5014
d) From a visual inspection of the plot, identify two potentially influential points and explain
your reasoning.
The two points furthest to the right on the plot (i.e., with high x values) are potentially
influential. Not only do they have high leverage, but their y-values do not follow the general
trend exhibited by most of the other observations of a price increase; instead, prices dropped
substantially in these countries.
e) Fit a new linear model to the data excluding the two points identified in part d). Based on
the results, does it seem these points are influential? Explain your answer.
Yes, it seems these points are influential. The model excluding these points has a quite differ-
ent slope: 1.41. Most notably, it is larger than 1, which leads to the opposite interpretation
that the previous model did; when excluding Mumbai and Nairobi, the overall trend is that
the price for 1 kg of rice is on average higher in 2009 than in 2003.
#identify influential points
prices$city[[Link](prices$rice2003)]
## [1] Mumbai
## 54 Levels: Amsterdam Athens Auckland Bangkok Barcelona Berlin ... Warsaw
prices$city[prices$rice2003 > 60 & prices$rice2003 < 80]
## [1] Nairobi
## 54 Levels: Amsterdam Athens Auckland Bangkok Barcelona Berlin ... Warsaw
11
#fit new model
exclude = c([Link](prices$rice2003), which(prices$city == "Nairobi"))
[Link].2 = lm(rice2009 ~ rice2003, data = prices,
subset = -exclude)
[Link].2
##
## Call:
## lm(formula = rice2009 ~ rice2003, data = prices, subset = -exclude)
##
## Coefficients:
## (Intercept) rice2003
## 1.411 1.183
#create new plot
plot(prices$rice2009 ~ prices$rice2003,
main = "Price for 1 kg Rice in 2009 vs 2003",
xlab = "Price in 2003 (min)", ylab = "Price in 2009 (min)")
abline([Link], col = "blue")
abline([Link].2, col = "red")
50
30
10
20 40 60 80
12
Problem 6.
A study was conducted on children in cities from the Flanders region in Belgium to assess whether
a relationship exists between the fluoride content in a public water supply and the dental caries
experience of children with access to the supply.
The file [Link] contains some observations from the study. The fluoride content of the public
water supply in each city, measured in parts per million (ppm), is saved as the variable fluoride;
the number of dental caries per 100 children examined is saved as the variable caries. The number
of dental caries is calculated by summing the numbers of filled teeth, teeth with untreated dental
caries, teeth requiring extraction, and missing teeth at the time of the study.
a) Create a plot that shows the relationship between fluoride content and caries experience.
Add the least squares regression line to the scatterplot.
#load the data
load("datasets/[Link]")
1000
800
600
400
b) Based on the plot from part a), comment on whether the model assumptions of linearity and
constant variability seem reasonable for these data.
13
Linearity is not valid; the data follow a curvilinear pattern, with points above the line at
low and high values of x and below the line for moderate values of x. It is difficult to assess
constant variability with so few observations; even though the data appear more variable
from the line for moderate values of x, there are too few observations across most of the plot
to judge variability about the line.
c) Use a residual plot to assess the model assumptions of linearity and constant variability.
Comment on whether the residual plot reveals any information that was not evident from
the plot from part b).
The previous assessments still appear valid. It is clearer in this plot that constant variability
is not satisfied, and easier to see that there is certainly not the even scatter around the line
that would suggest a linear relationship.
#store model fit as [Link]
[Link] = lm(water$caries ~ water$fluoride)
plot(resid([Link]) ~ fitted([Link]),
main = "Residual Plot for Caries versus Fluoride",
xlab = "Predicted Number of Caries",
ylab = "Residual")
abline(h = 0, lty = 2, col = "red")
100
0
−100
14
The file water_new.Rdata contains data from a more recent study conducted across 175 cities in
Belgium. Repeat the analyses from parts a) - c) with the new data.
d) Create a plot that shows the relationship between fluoride content and caries experience in
the new data. Add the least squares regression line to the scatterplot.
#load the data
load("datasets/water_new.Rdata")
800
600
400
200
0
−200
e) Based on the plot from part d), comment on whether the model assumptions of linearity and
constant variability seem reasonable for these data.
It seems that linearity and constant variability are reasonable for these data. The line seems
to go through the center of the cloud of points, and the spread of points around the line
appears constant across the range of x-values.
15
f) Use a residual plot to assess the model assumptions of linearity and constant variability.
Comment on whether the residual plot reveals any information that was not evident from
the plot from part e).
The residual plot makes it clearer that linearity and constant variability are not reasonable
assumptions. In particular, the slight curvature of the data is more exaggerated in the residual
plot. It is also possible to see that the variability is somewhat lower for moderate predicted
y-values than for high or low ones.
#store model fit as [Link]
[Link] = lm([Link]$caries ~ [Link]$fluoride)
plot(resid([Link]) ~ fitted([Link]),
main = "Residual Plot for Caries versus Fluoride",
xlab = "Predicted Number of Caries",
ylab = "Residual")
abline(h = 0, lty = 2, col = "red")
0
−100
−200
16
Problem 7.
The data file [Link] contains data for the proportion of male births (annually) in four
countries: Denmark, the Netherlands, Canada, and the United States. This problem explores the
relationship between proportion of male births and time (as measured by year).
a) Create a scatterplot for each country that includes the regression line. Be sure the plots are
clearly labeled and that each plot has the same bounds on both axes.
#load the data
malebirths = [Link]("datasets/[Link]")
#generate plots
par(mfrow = c(2,2))
plot(denmark ~ year, data = malebirths,
ylim = ylims, xlim = xlims, main = "Denmark")
abline(fit1, col="red")
plot(netherlands ~ year, data = malebirths,
ylim = ylims, xlim = xlims, main = "Netherlands")
abline(fit2, col="red")
plot(canada~year,data = malebirths,
ylim = ylims, xlim = xlims, main = "Canada")
abline(fit3, col="red")
plot(usa~year,data = malebirths,
ylim = ylims, xlim = xlims, main = "USA")
abline(fit4, col="red")
17
Denmark Netherlands
0.516
0.516
netherlands
denmark
0.512
0.512
0.508
0.508
1950 1960 1970 1980 1990 1950 1960 1970 1980 1990
year year
Canada USA
0.516
0.516
canada
usa
0.512
0.512
0.508
0.508
1950 1960 1970 1980 1990 1950 1960 1970 1980 1990
year year
b) For each country, assess whether there is evidence of a significant association between
proportion of male births and year. Report b1 , the standard error of b1 , the related t-statistic,
and the related p-value for each model.
There is evidence of a significant association between proportion of male births and time in
all four countries; all p-values are lower than α = 0.05.
Country β̂1 S.E.(β̂1 ) t-statistic p-value
Denmark −4.29 × 10−5 2.07 × 10−5 −2.07 0.0442
Netherlands −8.08 × 10−5 1.42 × 10−5 −5.71 < 0.00001
Canada −1.11 × 10−4 2.77 × 10−5 −4.02 0.00074
USA −5.43 × 10−5 9.39 × 10−6 −5.78 0.00001
18
summary(fit1)$coef
Problem 8.
This problem uses data from the National Health and Nutrition Examination Survey (NHANES),
a survey conducted annually by the US Centers for Disease Control (CDC). The data can be
treated as if it were a simple random sample from the American population. The dataset
[Link].500 in the oibiostat package contains data for 500 participants ages 21 years
or older that were randomly sampled from the complete NHANES dataset that contains 10,000
observations.
19
Regular physical activity is important for maintaining a healthy weight, boosting mood, and
reducing risk for diabetes, heart attack, and stroke. In this problem, you will be exploring
the relationship between weight (Weight) and physical activity (PhysActive) using the data in
[Link].500. Weight is measured in kilograms. The variable PhysActive is coded Yes
if the participant does moderate or vigorous-intensity sports, fitness, or recreational activities, and
No if otherwise.
a) Explore the data.
i. Identify how many individuals are physically active.
250 adults are physically active.
#load the data
data("[Link].500")
##
## No Yes
## 250 250
ii. Create a plot that shows the association between weight and physical activity. Describe
what you see.
Median weight is slightly higher in the group that is not physically active relative to the
group that is physically active. The spread appears roughly similar between groups; the
width of the interquartile range seems about equal. There are a few upper outliers in
both groups, but two especially high outliers in the group that is not physically active.
#load colorbrewer
library(RColorBrewer)
#create a plot
boxplot(Weight ~ PhysActive, data = [Link].500,
xlab = "Physically Active", ylab = "Weight (kg)", col = [Link](2, "Set1"),
main = "Weight by Physically Active Indicator", cex = 0.75)
20
Weight by Physically Active Indicator
200
150
Weight (kg)
100
50
No Yes
Physically Active
b) Fit a linear regression model to relate weight and physical activity. Report the estimated
coefficients from the model and interpret them in the context of the data.
The estimated coefficients from the model are b0 = 85.98 and b1 = −4.29. The intercept is the
sample mean weight in the group who are not physically active. The slope is the difference in
means between the two physical activity groups, where the not physically active group is the
baseline group; thus, -4.29 indicates that the mean weight in the physically active group is
4.29 kg lower than in the not physically active group.
#fit a linear model
[Link] = lm(Weight ~ PhysActive, data = [Link].500)
coef([Link])
## (Intercept) PhysActiveYes
## 85.976210 -4.287455
c) Report a 95% confidence interval for the slope parameter and interpret the interval in the
context of the data. Based on the interval, is there sufficient evidence at α = 0.05 to reject the
null hypothesis of no association between weight and physical activity?
The 95% confidence interval for the slope parameter is (-7.98, -0.60) kg. With 95% confidence,
the difference in mean weight between adults who are not physically active and those who
are physically active lies within the interval (0.60, 7.98) kg; the observed evidence suggests
21
mean weight is higher in adults who are not physically active. Since the 95% interval does not
contain the null value (0), there is sufficient evidence at α = 0.05 to reject the null hypothesis
of no association between weight and physical activity.
#calculate a confidence interval
confint([Link], [Link] = 0.95)
## 2.5 % 97.5 %
## (Intercept) 83.362715 88.5897044
## PhysActiveYes -7.979782 -0.5951278
d) Report and interpret an approximate 95% prediction interval for the weight of an individual
who is physically active.
With 95% confidence, the interval (40.53, 122.85) pounds contains the predicted weight for
an individual in the population who is physically active.
[Link] = predict([Link],
newdata = [Link](PhysActive = "Yes"))
[Link] = qt(0.975, df = nrow([Link].500 - 2))
se = summary([Link])$sigma
m = [Link] * se
[Link] - m; [Link] + m
## 1
## 40.53241
## 1
## 122.8451
e) Suppose that upon seeing the results from part c), your friend claims that these data represent
evidence that being physically active promotes weight loss. Do you agree with your friend?
Explain your answer.
No, it would be inappropriate to argue that these data represent evidence that being physi-
cally active promotes weight loss. The data are observational, so it is not possible to draw
causal conclusions; there may well be unobserved confounding variables that are behind the
observed trend.
f) In the context of these data, would you prefer to conduct inference using the linear regression
approach or the two-sample t-test approach? Explain your answer.
The linear regression approach assumes that the variance between groups is constant, while
the two-sample t-test does not. The observed variances are 530 in the not physically active
group and 348 in the physically active group. The variances are similar enough that for these
data, the p-value from either method is 0.23. A statistical argument in favor of using the
t-test is that assuming variances are not equal is the more conservative perspective; doing so
always produces a larger p-value than assuming equal variances.
It is appropriate to assume variances are equal when there is a scientific basis for thinking that
the population variances are equal. Here, it may be reasonable to expect that the population
variances are different, with variance for individuals who are physically active being smaller,
22
because the upper bound on weight for individuals who are physically active may be lower
than for individuals who are not physically active.
tapply([Link].500$Weight, [Link].500$PhysActive, var,
[Link] = TRUE)
## No Yes
## 530.1573 347.8227
g) Suppose that the estimated slope coefficient from the model were positive (and statistically
significant). Propose at least two possible explanations for such a trend.
A positive slope coefficient would indicate that mean weight in the physically active group is
higher than in the not physically active group.
One possible explanation is response bias. Since being physically active is generally regarded
as a desirable health behavior, participants may be motivated to respond untruthfully that
they are physically active when they are not; this motivation may be especially strong for
participants with high weight values who feel negatively about their weight, but less so for
those with low weight values.
It may be that people who are physically active are trying to lose weight or avoid gaining
weight, and that those who are not physically active tend to have a faster metabolism, etc.
such that they maintain a low weight without being physically active.
Another possible explanation is that people who are physically active build more muscle
mass than those who are not. Having higher weight is not necessarily indicative of being
unhealthy.
23
Problem 9.
The file low_bwt.Rdata contains information for a random sample of 100 low birth weight infants
born in two teaching hospitals in Boston, Massachusetts.
The dataset contains the following variables:
– birthwt: the weight of the infant at birth, measured in grams
– gestage: the gestational age of the infant at birth, measured in weeks
– momage: the mother’s age at the birth of the child, measured in years
– toxemia: recorded as Yes if the mother was diagnosed with toxemia during pregnancy, and
No otherwise
– length: length of the infant at birth, measured in centimeters
– headcirc: head circumference of the infant at birth, measured in centimeters
The condition toxemia, also known as preeclampsia, is characterized by high blood pressure and
protein in urine by the 20th week of pregnancy; left untreated, toxemia can be life-threatening.
a) Fit a linear model estimating the association between birth weight and toxemia status.
i. Write the model equation.
ii. Report a 95% confidence interval for the slope and interpret the interval.
The 95% confidence interval for the slope is (-124.4, 140) g. We are 95% confident that
the interval (-124.4, 140, g) captures the population difference in mean birthweight
between infants born to mothers diagnosed with toxemia and those born to mothers not
diagnosed with toxemia.
#load the dataset
load("datasets/low_bwt.Rdata")
##
## Call:
## lm(formula = [Link]$birthwt ~ [Link]$toxemia)
##
## Coefficients:
## (Intercept) [Link]$toxemiaYes
## 1097.215 7.785
#calculate confidence interval
confint(lm([Link]$birthwt ~ [Link]$toxemia), level = 0.95)
24
## 2.5 % 97.5 %
## (Intercept) 1036.6312 1157.7992
## [Link]$toxemiaYes -124.4203 139.9899
b) Using graphical summaries, explore the relationship between birth weight and toxemia status,
birth weight and gestational age, and gestational age and toxemia. Summarize your findings.
Infants with toxemia have a higher median birth weight than infants without toxemia.
Gestational age and birth weight are positively correlated; infants with higher values of
gestational age typically have higher birth weights. Infants with toxemia have a higher
median gestational age than infants without toxemia.
par(mfrow = c(2, 2))
25
Birth Weight vs Toxemia Birth Weight vs. Gestational Age
1400
1400
Birth Weight (g)
1000
600
600
No Yes 24 26 28 30 32 34
32
28
24
No Yes
Toxemia
c) Fit a multiple regression model with toxemia and gestational age as predictors of birth weight.
i. Evaluate whether the assumptions for linear regression are reasonably satisfied.
Linearity is satisfied; the residual plot against observed values of gestational age show
that there is no non-linearity relative to gestational age once the model has been fit. The
variability of the residuals is roughly constant, although the residuals are less variable
at the lowest and highest predicted values of birth weight. It is reasonable to assume the
observations are independent; it is unlikely that infants from a random sample of 100
infants across two teaching hospitals would be related. The residuals are approximately
normally distributed, with some deviations in the upper tail and small deviation in the
center. It seems reasonable to fit the linear regression model to these data.
ii. Interpret the coefficients of the model, and comment on whether the intercept has a
meaningful interpretation.
The coefficient for toxemia indicates that children whose mothers experience toxemia
26
have an average predicted birthweight that is 206.6 grams lower than children whose
mothers do not experience toxemia (when age does not change).
The coefficient for gestational age indicates that on average, an increase of one week is
associated with an increase of birthweight by 84 grams (when toxemia status does not
change).
The intercept does not have a meaningful interpretation; it corresponds to an individual
born to a mother who did not experience toxemia, at gestational age of 0 weeks. It is not
possible for an infant to have gestational age of 0 weeks.
iii. Write the model equation and predict the average birth weight for an infant born to a
mother diagnosed with toxemia with gestational age 31 weeks.
The average birth weight predicted for an infant born to a mother diagnosed with tox-
emia with gestational age 31 weeks is 1,113 grams: −1286.2 − (206.59)(1) + (84.06)(31) =
1113 grams.
iv. The simple regression model and multiple regression model disagree regarding the
nature of the association between birth weight and toxemia. Briefly explain the reason
behind the discrepancy. Which model do you prefer for understanding the relationship
between birth weight and toxemia, and why?
Gestational age is a confounder for the apparent positive association seen in the model
for predicting birth weight that only included toxemia experience. As seen from the plot
of gestational age by toxemia experience, infants who experience toxemia tend to have a
longer gestational period. Longer gestation is positively associated with birth weight.
The association between birth weight and toxemia can be estimated more precisely after
adjusting for gestational age as a confounder; thus, the multiple regression model is
preferable. The model indicates that after adjusting for gestational age, there is a nega-
tive association between infant birth weight and mother toxemia status. Additionally,
the multiple regression model explains 52% of the variability in birth weight and has a
much higher adjusted R2 than the original simple regression model (0.51 versus -0.01).
Note: Since toxemia typically begins after the 20th week of pregnancy, it is expected that the
subset of mothers who experience toxemia will have infants with higher values of gestational
age. Mothers who have longer pregnancies are more likely to be diagnosed with toxemia.
27
#fit multiple regression model
model <- lm(birthwt ~ toxemia + gestage, data = [Link])
summary(model)
##
## Call:
## lm(formula = birthwt ~ toxemia + gestage, data = [Link])
##
## Residuals:
## Min 1Q Median 3Q Max
## -615.54 -133.84 16.49 157.67 372.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1286.200 234.918 -5.475 3.43e-07 ***
## toxemiaYes -206.591 51.078 -4.045 0.000105 ***
## gestage 84.058 8.251 10.188 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 189.6 on 97 degrees of freedom
## Multiple R-squared: 0.517, Adjusted R-squared: 0.507
## F-statistic: 51.91 on 2 and 97 DF, p-value: 4.703e-16
predict(model, newdata = [Link](toxemia = "Yes", gestage = 31))
## 1
## 1113.006
summary(lm(birthwt ~ toxemia, data = [Link]))$[Link]
## [1] -0.01006334
#evaluate assumptions
par(mfrow = c(2, 2))
28
#normal probability plot
qqnorm(resid(model), pch = 21, col = COL[6], bg = COL[6, 4])
qqline(resid(model))
200
Residual
Residual
−200
−200
−600
−600
24 26 28 30 32 34 800 1000 1200 1400
−200
−600
−2 −1 0 1 2
Theoretical Quantiles
29
Problem 10.
A survey of Harvard students was conducted to measure a number of variables of interest. The file
[Link] contains the self-reported number of hours of exercise per week for n = 225 students
split across the four class years, along with a number of binary variables, where 1 indicates “Yes”
and 0 indicates “No”.
In this problem, you will work with the following 10 predictor variables: class year (classyear),
sex (sex), concentration (conc), hair color (hair), vegetarian (vegetarian), wears glasses or contacts
(glasses), on an athletic team (athlete), regularly drinks coffee (coffee), height (height), and
exercise hours per week (exercise).
The dataset has been randomly split into “halves”, with exercise_half1.csv containing data for
112 students and exercise_half2.csv containing data for 113 students.
a) Using exercise_half1.csv, fit a regression model to predict resting heart rate (heartrate) in
beats per minute (bpm) from the 10 predictors listed above. Name this model1. Which slope
coefficients are significant at the α = 0.10 signifcance level?
The predictors hairbrown, exercise, sexmale, and glasses are significant at the α = 0.10
level.
#load the data
exercise1 = [Link]("datasets/exercise_half1.csv")
#fit model
model1 = lm(heartrate ~ classyear + sex + conc + hair + vegetarian +
glasses + athlete + coffee + height + exercise, data = exercise1)
summary(model1)
##
## Call:
## lm(formula = heartrate ~ classyear + sex + conc + hair + vegetarian +
## glasses + athlete + coffee + height + exercise, data = exercise1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.703 -4.924 -0.126 3.863 41.222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.8699 22.0516 2.806 0.00609 **
## classyearjunior 1.0047 2.7409 0.367 0.71477
## classyearsenior 2.9290 2.8710 1.020 0.31023
## classyearsophomore 0.1177 2.2283 0.053 0.95797
## sexmale -4.9989 2.8628 -1.746 0.08402 .
## conceconomics -0.8926 2.9191 -0.306 0.76043
## concnatural_sciences -1.2516 2.8689 -0.436 0.66364
## concsocial_sciences 1.8237 2.7575 0.661 0.50999
## hairblonde -2.7185 3.4357 -0.791 0.43076
30
## hairbrown -4.5015 1.9956 -2.256 0.02638 *
## hairother 2.8994 9.5813 0.303 0.76285
## vegetarian 1.1065 3.0354 0.365 0.71628
## glasses -3.7446 1.9348 -1.935 0.05591 .
## athlete -1.9608 4.2997 -0.456 0.64941
## coffee -1.0687 1.9591 -0.545 0.58670
## height 0.2292 0.3424 0.670 0.50478
## exercise -0.5370 0.2382 -2.254 0.02647 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.655 on 95 degrees of freedom
## Multiple R-squared: 0.2968, Adjusted R-squared: 0.1784
## F-statistic: 2.507 on 16 and 95 DF, p-value: 0.003071
b) It may be the case that a predictor that shows up as significant from a regression model is
not actually associated with the response on the population level; i.e., represents an instance
of Type I error. For each of the significant predictors from part a), briefly explain whether
you think the a model fit using the second half of the dataset will also demonstrate that the
predictor is significantly associated with heart rate.
The observed association between resting heart rate and sex might be observed in the second
half of the data set, because biological differences in heart anatomy between sexes contribute
to females having on average higher resting heart rate than males. The observed association
between resting heart rate and glasses is probably spurious and not likely to be observed in
the second half, because there does not seem to be a logical explanation for the association.
The observed association between resting heart rate and blonde hair is probably spurious, for
the same reason. The observed association between resting heart rate and exercise is most
likely to hold up in the second half of the data set, because exercising is known to have an
impact on resting heart rate.
c) Using exercise_half2.csv, fit a regression model to predict resting heart rate (heartrate) in
beats per minute (bpm) from the 10 predictors listed above. Name this model2. Are any of
the slope coefficients identified as significant from Model 1 are also significant in Model 2? If
so, which one(s)?
The exercise predictor from the previous model is still significant.
#load the data
exercise2 = [Link]("datasets/exercise_half2.csv")
#fit model
model2 = lm(heartrate ~ classyear + sex + conc + hair + vegetarian +
glasses + athlete + coffee + height + exercise, data = exercise2)
summary(model2)
##
## Call:
## lm(formula = heartrate ~ classyear + sex + conc + hair + vegetarian +
## glasses + athlete + coffee + height + exercise, data = exercise2)
31
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.052 -5.369 0.392 5.915 35.926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.7374 27.6145 1.874 0.064001 .
## classyearjunior 11.5120 3.1945 3.604 0.000497 ***
## classyearsenior 1.5930 3.6839 0.432 0.666389
## classyearsophomore 6.0452 2.3297 2.595 0.010931 *
## sexmale 3.0522 3.4365 0.888 0.376644
## conceconomics 1.9218 3.3787 0.569 0.570804
## concnatural_sciences 0.8544 3.3756 0.253 0.800724
## concsocial_sciences 0.1084 3.0958 0.035 0.972147
## hairblonde -2.7632 4.1315 -0.669 0.505201
## hairbrown -1.7374 2.1548 -0.806 0.422042
## hairother 9.4918 11.1990 0.848 0.398769
## vegetarian -4.5235 3.4048 -1.329 0.187114
## glasses 0.8748 2.0154 0.434 0.665217
## athlete 2.7822 3.7554 0.741 0.460575
## coffee 2.5431 2.4362 1.044 0.299135
## height 0.1971 0.4385 0.449 0.654110
## exercise -0.5555 0.2261 -2.457 0.015795 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.883 on 97 degrees of freedom
## Multiple R-squared: 0.237, Adjusted R-squared: 0.1112
## F-statistic: 1.883 on 16 and 97 DF, p-value: 0.03108
d) The code shown in the template creates a plot to visualize the results of the two models.
It plots a horizontal line connecting the t-statistics for each coefficient in the two models
and compares it to the critical t-value and 0 (plotted as dashed vertical lines). You are not
expected to know how to recreate a similar plot.
i. Briefly describe the most interesting features of the plot.
The three coefficients with the most consistent results are exercise, height, and
hairblonde. Of these, only exercise has a value outside the critical region in both
models. The majority of the coefficients have values that vary widely between the two
models, so it makes sense that a coefficient could, simply by chance, have a significant
statistic in one model and not the other.
ii. Explain what might have caused the differences between Model 1 and Model 2; i.e., why
were some predictors significant in one model and not the other?
The use of these two datasets demonstrates the multiple testing problem in a regression
context. When testing several predictors, it is possible that some are significantly
associated due to random chance; such signals may not be replicated in another sample.
32
Using a more stringent significance level can help with this problem, but not solve
it completely; for example, the associations with class year in the second model were
"highly significant", yet were not supported by the first model.
#extract t-statistics from each model
t1 = coef(summary(model1))[,3]
t2 = coef(summary(model2))[,3]
#define axes
plot(NA, xlim = c(min(t1, t2), max(t1, t2)), ylim = c(1, length(t1)),
xlab = "t-stats", ylab = "", yaxt = "n")
tstar = qt(0.950, model1$[Link])
#add y-labels
mtext(names(coef(model1)), side = 2, at = 1:length(t1), cex = 0.5, las = 1)
exercise | |
height | |
coffee | |
athlete | |
glasses | |
vegetarian | |
hairother | |
hairbrown | |
hairblonde | |
concsocial_sciences | |
concnatural_sciences | |
conceconomics | |
sexmale | |
classyearsophomore | |
classyearsenior | |
classyearjunior | |
(Intercept) | |
−2 −1 0 1 2 3
t−stats
33