100% found this document useful (3 votes)

2K views33 pages

BMI and Age Analysis in PREVEND Study

The document provides instructions and questions for Problem Set 6 in Statistics 104. It discusses submitting solutions via PDF and R Markdown, collaborating appropriately, and including R code for relevant problems. Problem 1 explores the relationship between BMI and age using linear regression on a subset of a Dutch health study dataset. It involves plotting the data, fitting and interpreting a linear model, assessing assumptions, and conducting a hypothesis test. Problem 2 asks students to estimate a regression line without using built-in functions, using 5 given (x, y) data pairs.

Uploaded by

joshua arnett

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (3 votes)

2K views33 pages

BMI and Age Analysis in PREVEND Study

Uploaded by

joshua arnett

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Problem Set 6 Solutions
Problem 2
Problem 3
Problem 4
Problem 5
Problem 6
Problem 7
Problem 8
Problem 9
Problem 10

Problem Set 6 Solutions

Statistics 104
Due November 12, 2019 at 11:59 pm

Problem set policies. Please provide concise, clear answers for each question. Note that only writing the result of
a calculation (e.g., "SD = 3.3") without explanation is not sufficient. For problems involving R, be sure to include
the code in your solution.
Please submit your problem set via Canvas as a PDF, along with the R Markdown source file.
We encourage you to discuss problems with other students (and, of course, with the course head and the TFs), but
you must write your final answer in your own words. Solutions prepared "in committee" are not acceptable. If you
do collaborate with classmates on a problem, please list your collaborators on your solution.

Problem 1.

This problem uses data from the Prevention of REnal and Vascular END-stage Disease (PREVEND)
study, which took place between 2003 and 2006 in the Netherlands. Clinical and demographic
data for 4,095 individuals are stored in the prevend dataset in the oibiostat package.
Body mass index (BMI) is a measure of body fat that is based on both height and weight. The World
Health Organization and National Institutes for Health define a BMI of over 25.0 as overweight;
this guideline is typically applied to adults in all age groups. However, a recent study has reported
that individuals of ages 65 or older with the greatest mortality risk were those with BMI lower than
23.0, while those with BMI between 24.0 and 30.9 were at lower risk of mortality. These findings
suggest that the ideal weight-for-height in older adults may not be the same as in younger adults.
Explore the relationship between BMI (BMI) and age (age), using the data in [Link], a random
subset of 500 individuals from the larger prevend data.
a) Create a plot that shows the association between BMI and age. Based on the plot, comment
briefly on the nature of the association.
The association between BMI and age seems slightly positive; larger values of age generally
have higher values of BMI.
#load the data
library(oibiostat)
data("[Link]")

#create a plot
plot([Link]$BMI ~ [Link]$Age,
main = "BMI by Age in PREVEND (n = 500)",
xlab = "Age (years)", ylab = "BMI", cex = 0.75)

1
BMI by Age in PREVEND (n = 500)

60
50
BMI

40
30
20

40 50 60 70 80

Age (years)

b) Fit a linear regression model to relate BMI and age.

#fit a linear model
[Link] = lm(BMI ~ Age, data = [Link])
coef([Link])

## (Intercept) Age
## 23.62709808 0.05968762
i. Write the equation of the linear model.
The linear model equation can be written as either ŷ = 23.63 + 0.060x or BMI
[ = 23.63 +
0.060(Age).
ii. Interpret the slope and intercept values in the context of the data. Comment on whether
the intercept value has any interpretive meaning in this setting.
According to the model slope, an increase in age of 1 year is associated with an average
increase in BMI of 0.060. The model intercept suggests that the average BMI for an
individual of age 0 is 23.63. The intercept value in this setting does not have any
interpretive value, because it is not valid to assess BMI for a newborn; in fact, BMI is a
measure specific to adults.
iii. Is it valid to use the linear model to estimate BMI for an individual who is 30 years old?
Explain your answer.
No, it is not valid to use the linear model to estimate BMI for an individual who is 30
years old. The data in the sample only extend from ages 36 to 81 and should not be
used to predict scores for individuals outside that age range. It may well be that the

2
relationship between BMI and age is different for individuals in a different age group.
iv. According to the linear model, estimate the average BMI for an individual who is 60
years old.
According to the linear model, the average BMI for an individual who is 60 years old is
23.63 + 0.060(60) = 27.21.
predict([Link], newdata = [Link](Age = 60))

## 1
## 27.20836
v. Based on the linear model, how much does BMI differ, on average, between an individual
who is 70 years old versus an individual who is 50 years old?
Based on the linear model, BMI differs, on average, by 0.060(20) = 1.20 for individuals
that have an age difference of 20 years; the older individual is expected to have a higher
BMI.
c) Create residual plots to assess the model assumptions of linearity, constant variability, and
normally distributed residuals. In your assessment of whether an assumption is reasonable,
be sure to clearly reference and interpret relevant features of the appropriate plot.
par(mfrow = c(1, 2))

#create residual plot

plot(resid([Link]) ~ fitted([Link]),
main = "Residual Plot for BMI vs Age",
xlab = "Predicted BMI", ylab = "Residual", cex = 0.75)
abline(h = 0, lty = 2, lwd = 2, col = "red")

#create QQ Plot
qqnorm(resid([Link]), cex = 0.75)
qqline(resid([Link]))

3
Residual Plot for BMI vs Age Normal Q−Q Plot

30
Sample Quantiles
20

20
Residual

10
0

0
−10

−10
26.0 26.5 27.0 27.5 28.0 28.5 −3 −2 −1 0 1 2 3

Predicted BMI Theoretical Quantiles

i. Assess linearity.
The data show an approximately linear trend. The points scatter evenly above and below
the horizontal line and there is no indication of curvature.
ii. Assess constant variance.
Variance is roughly constant across the range of predicted BMI values, although the
variance may be slightly lower in the 25.5 - 26.5 BMI range.
iii. Assess normality of residuals.
Most of the distribution is approximately normal, but there is evidence of departure
from normality in the tails, particularly in the upper tail. There are more large residuals
than would be expected for a normal distribution (and fewer small residuals).
iv. Suppose that a point is located in the uppermost right corner on a Q-Q plot of residuals
(from a linear model). In one sentence, describe where that point would necessarily be
located on a scatterplot of the data.
If a point is located in the uppermost right-corner on a Q-Q plot of residuals, the residual
is large and positive; thus, this indicates a point that is located well above where the line
of best fit would be on a scatterplot of the data.
d) Conduct a formal hypothesis test of no association between BMI and age, at the α = 0.05
significance level. Summarize your conclusions.
A test of no association between two variables is a test of the null H0 : β1 = 0 against the
alternative β1 , 0. Let α = 0.05. The t-statistic is 3.40, with associated p-value of 7.28 × 10−4 .
Since p < α, there is sufficient evidence to reject H0 in favor of HA . From the observed data,
there is statistically significant evidence of a positive association between BMI and age; an
increase in age of 1 year is significantly associated with a mean increase in BMI of 0.060.

4
#conduct hypothesis test
summary([Link])$coef

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 23.62709808 0.98363263 24.020246 3.017683e-85
## Age 0.05968762 0.01755533 3.399971 7.280522e-04
e) Report the R2 of the linear model relating BMI and age. Based on the R2 value, briefly
comment on whether you think the estimated average BMI values calculated in part b) are
accurate.
The R2 of the model is 0.023. This indicates that only 2.3% of the observed variability in BMI
is explained by the variation in age. A low R2 value suggests that there are other variables
contributing to the variation in BMI, and so making an estimate of average BMI only from
age will not be very accurate.
#calculate R^2
summary([Link])$[Link]

## [1] 0.02268586

Problem 2.

For this problem, do not use the functions cor() or lm() (except for checking your calculations);
however, you are welcome to use R to make plots and compute values. This is the largest dataset
for which you will be asked to estimate a regression line ‘by hand’.
Consider the five ordered (x, y) pairs (1, 5), (2, 4), (2.5, 3.5), (3, 0.5), and (4, 0).
a) Plot the data.
#plot the data
x = c(1, 2, 2.5, 3, 4)
y = c(5, 4, 3.5, 0.5, 0)

plot(y ~ x, cex = 0.75)

5
5
4
3
y

2
1
0

1.0 2.0 3.0 4.0

b) Calculate the correlation between x and y. Show your work, either in the form of algebraic
work or from using R as a calculator.
The correlation between x and y is -0.93.

n ! !
1 X x i − x yi − y
r=
n−1 sx sy
i=1
1 1 − 2.5 5 − 2.6 2 − 2.5 4 − 2.6 4 − 2.5 0 − 2.6

= + + ··· +
5−1 1.12 2.22 1.12 2.22 1.12 2.22
= − 0.93

#use r as a calculator
[Link] = mean(x); s.x = sd(x)
[Link] = mean(y); s.y = sd(y)
n = length(x)

r = (1/(n-1))sum( ((x - [Link])/s.x) ((y - [Link]))/s.y )

## [1] -0.9320165
#confirm answer
cor(x, y)

## [1] -0.9320165
c) Estimate the slope and y-intercept of the least-squares regression line predicting y from x for
these data.
sy 2.219
The slope of the least-squares line is r = (−0.932) = −1.85.
sx 1.118

6
The y-intercept of the least-squares line is y − b1 x = 2.6 − (−1.85)(2.5) = 7.225.
#use r as a calculator
b1 = (s.y*r)/s.x; b1

## [1] -1.85
b0 = [Link] - b1*[Link]; b0

## [1] 7.225
#confirm answer
coef(lm(y ~ x))

## (Intercept) x
## 7.225 -1.850
d) Plot the data with the regression line calculated in part c).
#plot the data with the regression line
x = c(1, 2, 2.5, 3, 4); y = c(5, 4, 3.5, 0.5, 0)

plot(y ~ x, cex = 0.75)

abline(7.225, -1.85, col = "red")
5
4
3
y

2
1
0

1.0 2.0 3.0 4.0

7
Problem 3.

The utility dataset contains the average utility bills (in USD) for homes of a particular size and
the average monthly outdoor temperature in Fahrenheit.
a) Plot the data. From a visual inspection, does there seem to be a linear relationship between
the variables? Why or why not?
There does not seem to be a linear relationship; instead, the relationship appears quadratic.
#load the data
utility = [Link]("[Link]

#create a plot
plot(utility$bill ~ utility$temp,
main = "Utility Bill versus Average Outdoor Monthly Temp",
xlab = "Temperature (F)", ylab = "Bill Amount (USD)")

Utility Bill versus Average Outdoor Monthly Temp

140
Bill Amount (USD)

120
100
80

40 50 60 70 80 90

Temperature (F)

b) Fit a linear model to the data. From the linear model, estimate the average utility bill if the
average monthly temperature is 120 degrees. Explain whether the estimate is reasonable.
The prediction from the linear model for average utility bill when monthly outdoor temper-
ature is 120 F is $86.04. This is not a reasonable answer based on the observed quadratic
relationship; a very high bill should be expected for a high temperature, while a value around
$80 is at the low end of the observed bill values. It is also not advisable to extrapolate; we
have only observed temperatures as high as 90 degrees.
#fit a model
model = lm(bill ~ temp, data = utility)

#calculate estimate

8
predict(model, newdata = [Link](temp = 120))

## 1
## 86.03667

Problem 4.

Suppose that a class of 360 students has just taken an exam. The exam consisted of 40 true-false
questions, each of which was worth one point. A diligent teaching fellow has recorded the number
of correct answers (Y ) and the number of incorrect answers (X) for each student. Suppose that the
teaching fellow then fits a linear model to estimate Y from X. Note: no data are necessary to solve
this problem.
a) What will be the values of b0 , b1 , and R2 ?
In a model with the number of correct answers as the dependent variable and number of
incorrect answers as the independent variable, b0 is the number of correct answers when the
number of incorrect answers is 0. Thus, b0 = 40; if no questions are answered incorrectly out
of 40 questions, then 40 must have been answered correctl.
The model slope, b1 , represents the change in the number of correct answers for a one
unit change in the number of incorrect answers. A one question increase in the number of
incorrect answers corresponds to a one question decrease in the number of correct answers;
thus, b1 = −1.
Var(e )
The R2 can be calculated using the formula R2 = 1 − Var(yi ) . Since there is no error around the
i
least squares line (the line perfectly predicts the number of correct answers), the variance of
the residuals is 0 and R2 = 1.
b) Is this a useful model to fit? Explain your answer.
This is not a useful model to fit to the data in the sense that it is unnecessary—the number of
correct answers completely determines the number of incorrect answers.

9
Problem 5.

The international bank UBS regularly produces a report on prices and earnings in major cities
throughout the world. Three of the measures they include are the prices of basic commodities: 1
kg of rice, 1 kg of bread, and the price of a Big Mac hamburger at McDonald’s.
An interesting feature of the prices they report is that prices are measured in the minutes of labor
required for a “typical” worker in that location to earn enough money to purchase the commodity.
Using minutes of labor corrects at least in part for currency fluctuations, prevailing wage rates,
and local prices.
The data file [Link] includes measurements for rice, bread, and Big Mac prices from the
2003 and 2009 reports. The year 2003 was before the major recession hit much of the world around
2006, and the year 2009 may reflect changes in price due to the recession.
In this problem, you will model rice prices in 2009 as the dependent variable and rice prices in
2003 as the independent variable.
a) Plot the data and the y = x line.
#load the data
prices = [Link]("datasets/[Link]")

#plot the data and y = x line

plot(prices$rice2009 ~ prices$rice2003,
main = "Price for 1 kg Rice in 2009 vs 2003",
xlab = "Price in 2003 (min)", ylab = "Price in 2009 (min)")
abline(0, 1)

Price for 1 kg Rice in 2009 vs 2003

70
Price in 2009 (min)

50
30
10

20 40 60 80

Price in 2003 (min)

10
b) Briefly explain the key difference between points above versus points below the y = x line.
The points above the line represent countries for which the price in 2009 was higher than the
price in 2003, while the points above the line represent countries for which there was a price
decrease.
c) Fit a linear model to the data and interpret the slope coefficient. For these data, what is the
key difference between a slope larger than 1 versus smaller than 1?
The slope coefficient is 0.50. On average, an increase in 1 minute of price in 2003 is associated
with an increase in half a minute of price in 2009 (after accounting for the intercept). A slope
smaller than 1 suggests that prices are typically lower in 2009 than in 2003, while a slope
larger than 1 suggests that prices are typically higher in 2009 than in 2003. The intercept for
this model can be thought of as a measure of inflation, in that it predicts a "baseline" increase
from 2003 to 2009 separate from the percent increase expressed by the slope.
#fit model
[Link] = lm(rice2009 ~ rice2003, data = prices)
[Link]

##
## Call:
## lm(formula = rice2009 ~ rice2003, data = prices)
##
## Coefficients:
## (Intercept) rice2003
## 12.5842 0.5014
d) From a visual inspection of the plot, identify two potentially influential points and explain
your reasoning.
The two points furthest to the right on the plot (i.e., with high x values) are potentially
influential. Not only do they have high leverage, but their y-values do not follow the general
trend exhibited by most of the other observations of a price increase; instead, prices dropped
substantially in these countries.
e) Fit a new linear model to the data excluding the two points identified in part d). Based on
the results, does it seem these points are influential? Explain your answer.
Yes, it seems these points are influential. The model excluding these points has a quite differ-
ent slope: 1.41. Most notably, it is larger than 1, which leads to the opposite interpretation
that the previous model did; when excluding Mumbai and Nairobi, the overall trend is that
the price for 1 kg of rice is on average higher in 2009 than in 2003.
#identify influential points
prices$city[[Link](prices$rice2003)]

## [1] Mumbai
## 54 Levels: Amsterdam Athens Auckland Bangkok Barcelona Berlin ... Warsaw
prices$city[prices$rice2003 > 60 & prices$rice2003 < 80]

## [1] Nairobi
## 54 Levels: Amsterdam Athens Auckland Bangkok Barcelona Berlin ... Warsaw

11
#fit new model
exclude = c([Link](prices$rice2003), which(prices$city == "Nairobi"))
[Link].2 = lm(rice2009 ~ rice2003, data = prices,
subset = -exclude)
[Link].2

##
## Call:
## lm(formula = rice2009 ~ rice2003, data = prices, subset = -exclude)
##
## Coefficients:
## (Intercept) rice2003
## 1.411 1.183
#create new plot
plot(prices$rice2009 ~ prices$rice2003,
main = "Price for 1 kg Rice in 2009 vs 2003",
xlab = "Price in 2003 (min)", ylab = "Price in 2009 (min)")
abline([Link], col = "blue")
abline([Link].2, col = "red")

Price for 1 kg Rice in 2009 vs 2003

70
Price in 2009 (min)

50
30
10

20 40 60 80

Price in 2003 (min)

12
Problem 6.

A study was conducted on children in cities from the Flanders region in Belgium to assess whether
a relationship exists between the fluoride content in a public water supply and the dental caries
experience of children with access to the supply.
The file [Link] contains some observations from the study. The fluoride content of the public
water supply in each city, measured in parts per million (ppm), is saved as the variable fluoride;
the number of dental caries per 100 children examined is saved as the variable caries. The number
of dental caries is calculated by summing the numbers of filled teeth, teeth with untreated dental
caries, teeth requiring extraction, and missing teeth at the time of the study.
a) Create a plot that shows the relationship between fluoride content and caries experience.
Add the least squares regression line to the scatterplot.
#load the data
load("datasets/[Link]")

#create a plot with the least squares line

plot(water$caries ~ water$fluoride,
main = "Caries Experience by Fluoride Content",
xlab = "Fluoride Content of Water Supply (ppm)",
ylab = "Number of Caries (per 100 children)")
abline(lm(water$caries ~ water$fluoride), col = "red")

Caries Experience by Fluoride Content

Number of Caries (per 100 children)

1000
800
600
400

0.0 0.5 1.0 1.5 2.0 2.5

Fluoride Content of Water Supply (ppm)

b) Based on the plot from part a), comment on whether the model assumptions of linearity and
constant variability seem reasonable for these data.

13
Linearity is not valid; the data follow a curvilinear pattern, with points above the line at
low and high values of x and below the line for moderate values of x. It is difficult to assess
constant variability with so few observations; even though the data appear more variable
from the line for moderate values of x, there are too few observations across most of the plot
to judge variability about the line.
c) Use a residual plot to assess the model assumptions of linearity and constant variability.
Comment on whether the residual plot reveals any information that was not evident from
the plot from part b).
The previous assessments still appear valid. It is clearer in this plot that constant variability
is not satisfied, and easier to see that there is certainly not the even scatter around the line
that would suggest a linear relationship.
#store model fit as [Link]
[Link] = lm(water$caries ~ water$fluoride)

plot(resid([Link]) ~ fitted([Link]),
main = "Residual Plot for Caries versus Fluoride",
xlab = "Predicted Number of Caries",
ylab = "Residual")
abline(h = 0, lty = 2, col = "red")

Residual Plot for Caries versus Fluoride

300
200
Residual

100
0
−100

0 200 400 600

Predicted Number of Caries

14
The file water_new.Rdata contains data from a more recent study conducted across 175 cities in
Belgium. Repeat the analyses from parts a) - c) with the new data.
d) Create a plot that shows the relationship between fluoride content and caries experience in
the new data. Add the least squares regression line to the scatterplot.
#load the data
load("datasets/water_new.Rdata")

#create a plot with the least squares line

plot([Link]$caries ~ [Link]$fluoride,
main = "Caries Experience by Fluoride Content",
xlab = "Fluoride Content of Water Supply (ppm)",
ylab = "Number of Caries (per 100 children)")
abline(lm([Link]$caries ~ [Link]$fluoride), col = "red")

Caries Experience by Fluoride Content

Number of Caries (per 100 children)

800
600
400
200
0
−200

0.0 0.5 1.0 1.5 2.0 2.5

Fluoride Content of Water Supply (ppm)

e) Based on the plot from part d), comment on whether the model assumptions of linearity and
constant variability seem reasonable for these data.
It seems that linearity and constant variability are reasonable for these data. The line seems
to go through the center of the cloud of points, and the spread of points around the line
appears constant across the range of x-values.

15
f) Use a residual plot to assess the model assumptions of linearity and constant variability.
Comment on whether the residual plot reveals any information that was not evident from
the plot from part e).
The residual plot makes it clearer that linearity and constant variability are not reasonable
assumptions. In particular, the slight curvature of the data is more exaggerated in the residual
plot. It is also possible to see that the variability is somewhat lower for moderate predicted
y-values than for high or low ones.
#store model fit as [Link]
[Link] = lm([Link]$caries ~ [Link]$fluoride)

plot(resid([Link]) ~ fitted([Link]),
main = "Residual Plot for Caries versus Fluoride",
xlab = "Predicted Number of Caries",
ylab = "Residual")
abline(h = 0, lty = 2, col = "red")

Residual Plot for Caries versus Fluoride

200
100
Residual

0
−100
−200

0 200 400 600

Predicted Number of Caries

16
Problem 7.

The data file [Link] contains data for the proportion of male births (annually) in four
countries: Denmark, the Netherlands, Canada, and the United States. This problem explores the
relationship between proportion of male births and time (as measured by year).
a) Create a scatterplot for each country that includes the regression line. Be sure the plots are
clearly labeled and that each plot has the same bounds on both axes.
#load the data
malebirths = [Link]("datasets/[Link]")

fit1 = lm(denmark ~ year, data = malebirths)

fit2 = lm(netherlands ~ year, data = malebirths)
fit3 = lm(canada ~ year, data = malebirths)
fit4 = lm(usa ~ year, data = malebirths)

#creating a single vector solely to determine limits for the graph

combined = c(malebirths$denmark, malebirths$netherlands,
malebirths$canada, malebirths$usa)
combined = combined[![Link](combined)]
ylims = c(min(combined) - .001, max(combined) + .001)
xlims = c(min(malebirths$year) - 1, max(malebirths$year) + 1)

#generate plots
par(mfrow = c(2,2))
plot(denmark ~ year, data = malebirths,
ylim = ylims, xlim = xlims, main = "Denmark")
abline(fit1, col="red")
plot(netherlands ~ year, data = malebirths,
ylim = ylims, xlim = xlims, main = "Netherlands")
abline(fit2, col="red")
plot(canada~year,data = malebirths,
ylim = ylims, xlim = xlims, main = "Canada")
abline(fit3, col="red")
plot(usa~year,data = malebirths,
ylim = ylims, xlim = xlims, main = "USA")
abline(fit4, col="red")

17
Denmark Netherlands

0.516

0.516
netherlands
denmark

0.512

0.512
0.508

0.508
1950 1960 1970 1980 1990 1950 1960 1970 1980 1990

year year

Canada USA
0.516

0.516
canada

usa
0.512

0.512
0.508

0.508

1950 1960 1970 1980 1990 1950 1960 1970 1980 1990

year year

b) For each country, assess whether there is evidence of a significant association between
proportion of male births and year. Report b1 , the standard error of b1 , the related t-statistic,
and the related p-value for each model.
There is evidence of a significant association between proportion of male births and time in
all four countries; all p-values are lower than α = 0.05.
Country β̂1 S.E.(β̂1 ) t-statistic p-value
Denmark −4.29 × 10−5 2.07 × 10−5 −2.07 0.0442
Netherlands −8.08 × 10−5 1.42 × 10−5 −5.71 < 0.00001
Canada −1.11 × 10−4 2.77 × 10−5 −4.02 0.00074
USA −5.43 × 10−5 9.39 × 10−6 −5.78 0.00001

18
summary(fit1)$coef

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 5.987233e-01 0.0408047207 14.672893 2.395722e-18
## year -4.288538e-05 0.0000206916 -2.072598 4.423828e-02
summary(fit2)$coef

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 6.723984e-01 0.0279195810 24.083398 1.365923e-26
## year -8.084321e-05 0.0000141577 -5.710196 9.636921e-07
summary(fit3)$coef

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 0.7337857143 5.480068e-02 13.390083 3.983523e-11
## year -0.0001111688 2.767698e-05 -4.016653 7.375947e-04
summary(fit4)$coef

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 6.200857e-01 1.859877e-02 33.340152 2.523643e-18
## year -5.428571e-05 9.393273e-06 -5.779212 1.439109e-05
c) Based on a visual inspection of the plots, explain why the United States can have the t-statistic
with the largest magnitude even though its slope does not have the largest magnitude.
Since the spread of the points around the fitted line is much smaller in the vertical direction
for the U.S. than the other countries, the standard error of the slope is the smallest. A small
standard error contributes to a larger t-statistic, since the t-statistic is defined as the estimated
slope divided by the standard error of the slope.
d) Why might it be reasonable to expect that the standard error of the slope would be the
smallest for the United States? Hint: Consider that each y-observation can be thought of as
an average of 0s and 1s for all births (in a year).
The United States has a much larger population than the other countries, along with a larger
number of births. When considering that each y-observation represents an average of binary
measurements for each birth as to whether the birth is male or female, it is reasonably to
expect by the Law of Large Numbers that the proportion of male births will vary less from
year-to-year when the number of births is relatively large.

Problem 8.

This problem uses data from the National Health and Nutrition Examination Survey (NHANES),
a survey conducted annually by the US Centers for Disease Control (CDC). The data can be
treated as if it were a simple random sample from the American population. The dataset
[Link].500 in the oibiostat package contains data for 500 participants ages 21 years
or older that were randomly sampled from the complete NHANES dataset that contains 10,000
observations.

19
Regular physical activity is important for maintaining a healthy weight, boosting mood, and
reducing risk for diabetes, heart attack, and stroke. In this problem, you will be exploring
the relationship between weight (Weight) and physical activity (PhysActive) using the data in
[Link].500. Weight is measured in kilograms. The variable PhysActive is coded Yes
if the participant does moderate or vigorous-intensity sports, fitness, or recreational activities, and
No if otherwise.
a) Explore the data.
i. Identify how many individuals are physically active.
250 adults are physically active.
#load the data
data("[Link].500")

#identify how many individuals are physically active

table([Link].500$PhysActive)

##
## No Yes
## 250 250
ii. Create a plot that shows the association between weight and physical activity. Describe
what you see.
Median weight is slightly higher in the group that is not physically active relative to the
group that is physically active. The spread appears roughly similar between groups; the
width of the interquartile range seems about equal. There are a few upper outliers in
both groups, but two especially high outliers in the group that is not physically active.
#load colorbrewer
library(RColorBrewer)

#create a plot
boxplot(Weight ~ PhysActive, data = [Link].500,
xlab = "Physically Active", ylab = "Weight (kg)", col = [Link](2, "Set1"),
main = "Weight by Physically Active Indicator", cex = 0.75)

20
Weight by Physically Active Indicator

200
150
Weight (kg)

100
50

No Yes

Physically Active

b) Fit a linear regression model to relate weight and physical activity. Report the estimated
coefficients from the model and interpret them in the context of the data.
The estimated coefficients from the model are b0 = 85.98 and b1 = −4.29. The intercept is the
sample mean weight in the group who are not physically active. The slope is the difference in
means between the two physical activity groups, where the not physically active group is the
baseline group; thus, -4.29 indicates that the mean weight in the physically active group is
4.29 kg lower than in the not physically active group.
#fit a linear model
[Link] = lm(Weight ~ PhysActive, data = [Link].500)
coef([Link])

## (Intercept) PhysActiveYes
## 85.976210 -4.287455
c) Report a 95% confidence interval for the slope parameter and interpret the interval in the
context of the data. Based on the interval, is there sufficient evidence at α = 0.05 to reject the
null hypothesis of no association between weight and physical activity?
The 95% confidence interval for the slope parameter is (-7.98, -0.60) kg. With 95% confidence,
the difference in mean weight between adults who are not physically active and those who
are physically active lies within the interval (0.60, 7.98) kg; the observed evidence suggests

21
mean weight is higher in adults who are not physically active. Since the 95% interval does not
contain the null value (0), there is sufficient evidence at α = 0.05 to reject the null hypothesis
of no association between weight and physical activity.
#calculate a confidence interval
confint([Link], [Link] = 0.95)

## 2.5 % 97.5 %
## (Intercept) 83.362715 88.5897044
## PhysActiveYes -7.979782 -0.5951278
d) Report and interpret an approximate 95% prediction interval for the weight of an individual
who is physically active.
With 95% confidence, the interval (40.53, 122.85) pounds contains the predicted weight for
an individual in the population who is physically active.
[Link] = predict([Link],
newdata = [Link](PhysActive = "Yes"))
[Link] = qt(0.975, df = nrow([Link].500 - 2))
se = summary([Link])$sigma
m = [Link] * se

[Link] - m; [Link] + m

## 1
## 40.53241
## 1
## 122.8451
e) Suppose that upon seeing the results from part c), your friend claims that these data represent
evidence that being physically active promotes weight loss. Do you agree with your friend?
Explain your answer.
No, it would be inappropriate to argue that these data represent evidence that being physi-
cally active promotes weight loss. The data are observational, so it is not possible to draw
causal conclusions; there may well be unobserved confounding variables that are behind the
observed trend.
f) In the context of these data, would you prefer to conduct inference using the linear regression
approach or the two-sample t-test approach? Explain your answer.
The linear regression approach assumes that the variance between groups is constant, while
the two-sample t-test does not. The observed variances are 530 in the not physically active
group and 348 in the physically active group. The variances are similar enough that for these
data, the p-value from either method is 0.23. A statistical argument in favor of using the
t-test is that assuming variances are not equal is the more conservative perspective; doing so
always produces a larger p-value than assuming equal variances.
It is appropriate to assume variances are equal when there is a scientific basis for thinking that
the population variances are equal. Here, it may be reasonable to expect that the population
variances are different, with variance for individuals who are physically active being smaller,

22
because the upper bound on weight for individuals who are physically active may be lower
than for individuals who are not physically active.
tapply([Link].500$Weight, [Link].500$PhysActive, var,
[Link] = TRUE)

## No Yes
## 530.1573 347.8227
g) Suppose that the estimated slope coefficient from the model were positive (and statistically
significant). Propose at least two possible explanations for such a trend.
A positive slope coefficient would indicate that mean weight in the physically active group is
higher than in the not physically active group.
One possible explanation is response bias. Since being physically active is generally regarded
as a desirable health behavior, participants may be motivated to respond untruthfully that
they are physically active when they are not; this motivation may be especially strong for
participants with high weight values who feel negatively about their weight, but less so for
those with low weight values.
It may be that people who are physically active are trying to lose weight or avoid gaining
weight, and that those who are not physically active tend to have a faster metabolism, etc.
such that they maintain a low weight without being physically active.
Another possible explanation is that people who are physically active build more muscle
mass than those who are not. Having higher weight is not necessarily indicative of being
unhealthy.

23
Problem 9.

The file low_bwt.Rdata contains information for a random sample of 100 low birth weight infants
born in two teaching hospitals in Boston, Massachusetts.
The dataset contains the following variables:
– birthwt: the weight of the infant at birth, measured in grams
– gestage: the gestational age of the infant at birth, measured in weeks
– momage: the mother’s age at the birth of the child, measured in years
– toxemia: recorded as Yes if the mother was diagnosed with toxemia during pregnancy, and
No otherwise
– length: length of the infant at birth, measured in centimeters
– headcirc: head circumference of the infant at birth, measured in centimeters
The condition toxemia, also known as preeclampsia, is characterized by high blood pressure and
protein in urine by the 20th week of pregnancy; left untreated, toxemia can be life-threatening.
a) Fit a linear model estimating the association between birth weight and toxemia status.
i. Write the model equation.

[ = 1097.22 + 7.78(toxemiaY es)

birthwt

ii. Report a 95% confidence interval for the slope and interpret the interval.
The 95% confidence interval for the slope is (-124.4, 140) g. We are 95% confident that
the interval (-124.4, 140, g) captures the population difference in mean birthweight
between infants born to mothers diagnosed with toxemia and those born to mothers not
diagnosed with toxemia.
#load the dataset
load("datasets/low_bwt.Rdata")

#fit the model

lm([Link]$birthwt ~ [Link]$toxemia)

##
## Call:
## lm(formula = [Link]$birthwt ~ [Link]$toxemia)
##
## Coefficients:
## (Intercept) [Link]$toxemiaYes
## 1097.215 7.785
#calculate confidence interval
confint(lm([Link]$birthwt ~ [Link]$toxemia), level = 0.95)

24
## 2.5 % 97.5 %
## (Intercept) 1036.6312 1157.7992
## [Link]$toxemiaYes -124.4203 139.9899
b) Using graphical summaries, explore the relationship between birth weight and toxemia status,
birth weight and gestational age, and gestational age and toxemia. Summarize your findings.
Infants with toxemia have a higher median birth weight than infants without toxemia.
Gestational age and birth weight are positively correlated; infants with higher values of
gestational age typically have higher birth weights. Infants with toxemia have a higher
median gestational age than infants without toxemia.
par(mfrow = c(2, 2))

#load color packages

library(RColorBrewer); library(openintro); data(COL)

#birth weight and toxemia

plot([Link]$birthwt ~ [Link]$toxemia,
main = "Birth Weight vs Toxemia", xlab = "Toxemia", ylab = "Birth Weight (g)",
col = rev([Link](3, "Set1")[1:2]))

#birth weight and gestational age

plot([Link]$birthwt ~ [Link]$gestage,
main = "Birth Weight vs. Gestational Age", xlab = "Gestational Age (wks)",
ylab = "Birth Weight (g)",
pch = 21, col = COL[6], bg = COL[6, 3])

#gestational age and toxemia

plot([Link]$gestage ~ [Link]$toxemia,
main = "Gestational Age vs Toxemia", xlab = "Toxemia",
ylab = "Gestational Age (wks)",
col = rev([Link](3, "Set1")[1:2]))

25
Birth Weight vs Toxemia Birth Weight vs. Gestational Age

1400

1400
Birth Weight (g)

Birth Weight (g)

1000

1000
600

600
No Yes 24 26 28 30 32 34

Toxemia Gestational Age (wks)

Gestational Age vs Toxemia

Gestational Age (wks)

32
28
24

No Yes

Toxemia

c) Fit a multiple regression model with toxemia and gestational age as predictors of birth weight.
i. Evaluate whether the assumptions for linear regression are reasonably satisfied.
Linearity is satisfied; the residual plot against observed values of gestational age show
that there is no non-linearity relative to gestational age once the model has been fit. The
variability of the residuals is roughly constant, although the residuals are less variable
at the lowest and highest predicted values of birth weight. It is reasonable to assume the
observations are independent; it is unlikely that infants from a random sample of 100
infants across two teaching hospitals would be related. The residuals are approximately
normally distributed, with some deviations in the upper tail and small deviation in the
center. It seems reasonable to fit the linear regression model to these data.
ii. Interpret the coefficients of the model, and comment on whether the intercept has a
meaningful interpretation.
The coefficient for toxemia indicates that children whose mothers experience toxemia

26
have an average predicted birthweight that is 206.6 grams lower than children whose
mothers do not experience toxemia (when age does not change).
The coefficient for gestational age indicates that on average, an increase of one week is
associated with an increase of birthweight by 84 grams (when toxemia status does not
change).
The intercept does not have a meaningful interpretation; it corresponds to an individual
born to a mother who did not experience toxemia, at gestational age of 0 weeks. It is not
possible for an infant to have gestational age of 0 weeks.
iii. Write the model equation and predict the average birth weight for an infant born to a
mother diagnosed with toxemia with gestational age 31 weeks.

[ = −1286.2 − 206.59(toxemiaY es) + 84.06(gestage)

birthwt

The average birth weight predicted for an infant born to a mother diagnosed with tox-
emia with gestational age 31 weeks is 1,113 grams: −1286.2 − (206.59)(1) + (84.06)(31) =
1113 grams.
iv. The simple regression model and multiple regression model disagree regarding the
nature of the association between birth weight and toxemia. Briefly explain the reason
behind the discrepancy. Which model do you prefer for understanding the relationship
between birth weight and toxemia, and why?
Gestational age is a confounder for the apparent positive association seen in the model
for predicting birth weight that only included toxemia experience. As seen from the plot
of gestational age by toxemia experience, infants who experience toxemia tend to have a
longer gestational period. Longer gestation is positively associated with birth weight.
The association between birth weight and toxemia can be estimated more precisely after
adjusting for gestational age as a confounder; thus, the multiple regression model is
preferable. The model indicates that after adjusting for gestational age, there is a nega-
tive association between infant birth weight and mother toxemia status. Additionally,
the multiple regression model explains 52% of the variability in birth weight and has a
much higher adjusted R2 than the original simple regression model (0.51 versus -0.01).
Note: Since toxemia typically begins after the 20th week of pregnancy, it is expected that the
subset of mothers who experience toxemia will have infants with higher values of gestational
age. Mothers who have longer pregnancies are more likely to be diagnosed with toxemia.

27
#fit multiple regression model
model <- lm(birthwt ~ toxemia + gestage, data = [Link])
summary(model)

##
## Call:
## lm(formula = birthwt ~ toxemia + gestage, data = [Link])
##
## Residuals:
## Min 1Q Median 3Q Max
## -615.54 -133.84 16.49 157.67 372.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1286.200 234.918 -5.475 3.43e-07 ***
## toxemiaYes -206.591 51.078 -4.045 0.000105 ***
## gestage 84.058 8.251 10.188 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 189.6 on 97 degrees of freedom
## Multiple R-squared: 0.517, Adjusted R-squared: 0.507
## F-statistic: 51.91 on 2 and 97 DF, p-value: 4.703e-16
predict(model, newdata = [Link](toxemia = "Yes", gestage = 31))

## 1
## 1113.006
summary(lm(birthwt ~ toxemia, data = [Link]))$[Link]

## [1] -0.01006334
#evaluate assumptions
par(mfrow = c(2, 2))

#check linearity with gestage

plot(resid(model) ~ [Link]$gestage,
main = "Residual vs Gestage",
xlab = "Gestage (wks)", ylab = "Residual",
pch = 21, col = COL[6], bg = COL[6, 4], cex = 0.8)
abline(h = 0, col = "red", lty = 2)

#residual plot for constant variability

plot(resid(model) ~ predict(model),
main = "Residual vs Fitted",
xlab = "Predicted Birthweight (g)", ylab = "Residual",
pch = 21, col = COL[6], bg = COL[6, 4], cex = 0.8)
abline(h = 0, col = "red", lty = 2)

28
#normal probability plot
qqnorm(resid(model), pch = 21, col = COL[6], bg = COL[6, 4])
qqline(resid(model))

Residual vs Gestage Residual vs Fitted

200

200
Residual

Residual
−200

−200
−600

−600
24 26 28 30 32 34 800 1000 1200 1400

Gestage (wks) Predicted Birthweight (g)

Normal Q−Q Plot

200
Sample Quantiles

−200
−600

−2 −1 0 1 2

Theoretical Quantiles

29
Problem 10.

A survey of Harvard students was conducted to measure a number of variables of interest. The file
[Link] contains the self-reported number of hours of exercise per week for n = 225 students
split across the four class years, along with a number of binary variables, where 1 indicates “Yes”
and 0 indicates “No”.
In this problem, you will work with the following 10 predictor variables: class year (classyear),
sex (sex), concentration (conc), hair color (hair), vegetarian (vegetarian), wears glasses or contacts
(glasses), on an athletic team (athlete), regularly drinks coffee (coffee), height (height), and
exercise hours per week (exercise).
The dataset has been randomly split into “halves”, with exercise_half1.csv containing data for
112 students and exercise_half2.csv containing data for 113 students.
a) Using exercise_half1.csv, fit a regression model to predict resting heart rate (heartrate) in
beats per minute (bpm) from the 10 predictors listed above. Name this model1. Which slope
coefficients are significant at the α = 0.10 signifcance level?
The predictors hairbrown, exercise, sexmale, and glasses are significant at the α = 0.10
level.
#load the data
exercise1 = [Link]("datasets/exercise_half1.csv")

#fit model
model1 = lm(heartrate ~ classyear + sex + conc + hair + vegetarian +
glasses + athlete + coffee + height + exercise, data = exercise1)
summary(model1)

##
## Call:
## lm(formula = heartrate ~ classyear + sex + conc + hair + vegetarian +
## glasses + athlete + coffee + height + exercise, data = exercise1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.703 -4.924 -0.126 3.863 41.222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.8699 22.0516 2.806 0.00609 **
## classyearjunior 1.0047 2.7409 0.367 0.71477
## classyearsenior 2.9290 2.8710 1.020 0.31023
## classyearsophomore 0.1177 2.2283 0.053 0.95797
## sexmale -4.9989 2.8628 -1.746 0.08402 .
## conceconomics -0.8926 2.9191 -0.306 0.76043
## concnatural_sciences -1.2516 2.8689 -0.436 0.66364
## concsocial_sciences 1.8237 2.7575 0.661 0.50999
## hairblonde -2.7185 3.4357 -0.791 0.43076

30
## hairbrown -4.5015 1.9956 -2.256 0.02638 *
## hairother 2.8994 9.5813 0.303 0.76285
## vegetarian 1.1065 3.0354 0.365 0.71628
## glasses -3.7446 1.9348 -1.935 0.05591 .
## athlete -1.9608 4.2997 -0.456 0.64941
## coffee -1.0687 1.9591 -0.545 0.58670
## height 0.2292 0.3424 0.670 0.50478
## exercise -0.5370 0.2382 -2.254 0.02647 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.655 on 95 degrees of freedom
## Multiple R-squared: 0.2968, Adjusted R-squared: 0.1784
## F-statistic: 2.507 on 16 and 95 DF, p-value: 0.003071
b) It may be the case that a predictor that shows up as significant from a regression model is
not actually associated with the response on the population level; i.e., represents an instance
of Type I error. For each of the significant predictors from part a), briefly explain whether
you think the a model fit using the second half of the dataset will also demonstrate that the
predictor is significantly associated with heart rate.
The observed association between resting heart rate and sex might be observed in the second
half of the data set, because biological differences in heart anatomy between sexes contribute
to females having on average higher resting heart rate than males. The observed association
between resting heart rate and glasses is probably spurious and not likely to be observed in
the second half, because there does not seem to be a logical explanation for the association.
The observed association between resting heart rate and blonde hair is probably spurious, for
the same reason. The observed association between resting heart rate and exercise is most
likely to hold up in the second half of the data set, because exercising is known to have an
impact on resting heart rate.
c) Using exercise_half2.csv, fit a regression model to predict resting heart rate (heartrate) in
beats per minute (bpm) from the 10 predictors listed above. Name this model2. Are any of
the slope coefficients identified as significant from Model 1 are also significant in Model 2? If
so, which one(s)?
The exercise predictor from the previous model is still significant.
#load the data
exercise2 = [Link]("datasets/exercise_half2.csv")

#fit model
model2 = lm(heartrate ~ classyear + sex + conc + hair + vegetarian +
glasses + athlete + coffee + height + exercise, data = exercise2)
summary(model2)

##
## Call:
## lm(formula = heartrate ~ classyear + sex + conc + hair + vegetarian +
## glasses + athlete + coffee + height + exercise, data = exercise2)

31
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.052 -5.369 0.392 5.915 35.926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.7374 27.6145 1.874 0.064001 .
## classyearjunior 11.5120 3.1945 3.604 0.000497 ***
## classyearsenior 1.5930 3.6839 0.432 0.666389
## classyearsophomore 6.0452 2.3297 2.595 0.010931 *
## sexmale 3.0522 3.4365 0.888 0.376644
## conceconomics 1.9218 3.3787 0.569 0.570804
## concnatural_sciences 0.8544 3.3756 0.253 0.800724
## concsocial_sciences 0.1084 3.0958 0.035 0.972147
## hairblonde -2.7632 4.1315 -0.669 0.505201
## hairbrown -1.7374 2.1548 -0.806 0.422042
## hairother 9.4918 11.1990 0.848 0.398769
## vegetarian -4.5235 3.4048 -1.329 0.187114
## glasses 0.8748 2.0154 0.434 0.665217
## athlete 2.7822 3.7554 0.741 0.460575
## coffee 2.5431 2.4362 1.044 0.299135
## height 0.1971 0.4385 0.449 0.654110
## exercise -0.5555 0.2261 -2.457 0.015795 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.883 on 97 degrees of freedom
## Multiple R-squared: 0.237, Adjusted R-squared: 0.1112
## F-statistic: 1.883 on 16 and 97 DF, p-value: 0.03108
d) The code shown in the template creates a plot to visualize the results of the two models.
It plots a horizontal line connecting the t-statistics for each coefficient in the two models
and compares it to the critical t-value and 0 (plotted as dashed vertical lines). You are not
expected to know how to recreate a similar plot.
i. Briefly describe the most interesting features of the plot.
The three coefficients with the most consistent results are exercise, height, and
hairblonde. Of these, only exercise has a value outside the critical region in both
models. The majority of the coefficients have values that vary widely between the two
models, so it makes sense that a coefficient could, simply by chance, have a significant
statistic in one model and not the other.
ii. Explain what might have caused the differences between Model 1 and Model 2; i.e., why
were some predictors significant in one model and not the other?
The use of these two datasets demonstrates the multiple testing problem in a regression
context. When testing several predictors, it is possible that some are significantly
associated due to random chance; such signals may not be replicated in another sample.

32
Using a more stringent significance level can help with this problem, but not solve
it completely; for example, the associations with class year in the second model were
"highly significant", yet were not supported by the first model.
#extract t-statistics from each model
t1 = coef(summary(model1))[,3]
t2 = coef(summary(model2))[,3]

#define axes
plot(NA, xlim = c(min(t1, t2), max(t1, t2)), ylim = c(1, length(t1)),
xlab = "t-stats", ylab = "", yaxt = "n")
tstar = qt(0.950, model1$[Link])

#plot vertical lines at 0 and the +/- critical value

abline(v = 0, lty = 2, lwd = 2, col = "blue")
abline(v = c(-tstar, tstar), lty = 3, lwd = 2, col = "red")

#plot each of the t-statistics

for(i in 1:length(t1)){
lines(c(t1[i], t2[i]), c(i,i), lwd = 2)
points(c(t1[i], t2[i]), c(i,i), pch = "|", cex = 0.8)
}

#add y-labels
mtext(names(coef(model1)), side = 2, at = 1:length(t1), cex = 0.5, las = 1)

−2 −1 0 1 2 3

t−stats

Challenger O-Ring Failure Analysis
No ratings yet
Challenger O-Ring Failure Analysis
41 pages
NHANES Poverty and Education Analysis
50% (2)
NHANES Poverty and Education Analysis
35 pages
Problem Set 5: Statistical Analysis Solutions
No ratings yet
Problem Set 5: Statistical Analysis Solutions
21 pages
Statistics 104 Problem Set 3 Solutions
No ratings yet
Statistics 104 Problem Set 3 Solutions
15 pages
Statistical Analysis of Medical Studies
No ratings yet
Statistical Analysis of Medical Studies
6 pages
Statistics 104 Problem Set Solutions
No ratings yet
Statistics 104 Problem Set Solutions
18 pages
Statistical Analysis Problem Set Solutions
No ratings yet
Statistical Analysis Problem Set Solutions
12 pages
Examining Differences in Test Scores
No ratings yet
Examining Differences in Test Scores
2 pages
Correlation and Regression Analysis Guide
No ratings yet
Correlation and Regression Analysis Guide
2 pages
Gender and Race Effects on Belief in Heaven
No ratings yet
Gender and Race Effects on Belief in Heaven
4 pages
MATH 1281 - Unit 3 Assignment
No ratings yet
MATH 1281 - Unit 3 Assignment
5 pages
Linear Regression Analysis in Health
No ratings yet
Linear Regression Analysis in Health
10 pages
BMI Data Analysis and Regression Model
No ratings yet
BMI Data Analysis and Regression Model
19 pages
BMI Analysis by Alisha Srivastava
No ratings yet
BMI Analysis by Alisha Srivastava
16 pages
Scatter Plots and Regression Analysis Guide
No ratings yet
Scatter Plots and Regression Analysis Guide
15 pages
Math 3150 Homework 5
No ratings yet
Math 3150 Homework 5
9 pages
Scatter Diagrams and Linear Regression Analysis
No ratings yet
Scatter Diagrams and Linear Regression Analysis
20 pages
Insurance Data Analysis in R
No ratings yet
Insurance Data Analysis in R
4 pages
Correlation and Regression Overview
No ratings yet
Correlation and Regression Overview
42 pages
Scatter Diagrams and Linear Regression Analysis
No ratings yet
Scatter Diagrams and Linear Regression Analysis
21 pages
Linear Regression Assumptions and Analysis
No ratings yet
Linear Regression Assumptions and Analysis
13 pages
Body Fat Analysis with Regression Models
No ratings yet
Body Fat Analysis with Regression Models
8 pages
Regression Analysis of Hydration Systems
No ratings yet
Regression Analysis of Hydration Systems
7 pages
CARDIA Study: BMI and Waist Measurements Analysis
No ratings yet
CARDIA Study: BMI and Waist Measurements Analysis
2 pages
Correlation and Regression in Biostatistics
80% (5)
Correlation and Regression in Biostatistics
24 pages
Body Fat and Assets Analysis Report
No ratings yet
Body Fat and Assets Analysis Report
4 pages
Simple Linear Regression Explained
No ratings yet
Simple Linear Regression Explained
18 pages
Linear Regression Analysis by Nipun Goyal
No ratings yet
Linear Regression Analysis by Nipun Goyal
12 pages
Causal Inference in Regression Analysis
No ratings yet
Causal Inference in Regression Analysis
9 pages
New York Health: Multiple Regression Analysis
33% (3)
New York Health: Multiple Regression Analysis
10 pages
Understanding Simple Linear Regression
No ratings yet
Understanding Simple Linear Regression
112 pages
Simple Linear Regression Explained
No ratings yet
Simple Linear Regression Explained
10 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
110 pages
Understanding Regression and Correlation
No ratings yet
Understanding Regression and Correlation
40 pages
BMI and Weight Linear Regression Analysis
No ratings yet
BMI and Weight Linear Regression Analysis
3 pages
Understanding Linear Regression Analysis
No ratings yet
Understanding Linear Regression Analysis
59 pages
Externally Studentized Residuals Analysis
No ratings yet
Externally Studentized Residuals Analysis
7 pages
Blood Pressure Analysis and BMI Categories
No ratings yet
Blood Pressure Analysis and BMI Categories
9 pages
R Programming: Statistical Analysis Assignment
No ratings yet
R Programming: Statistical Analysis Assignment
8 pages
STATS 330 Term Test September 2003
No ratings yet
STATS 330 Term Test September 2003
8 pages
Diabetic Status Analysis in Pima Women
No ratings yet
Diabetic Status Analysis in Pima Women
6 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
75 pages
Correlation and Regression Analysis Guide
No ratings yet
Correlation and Regression Analysis Guide
8 pages
Understanding Multiple Linear Regression
No ratings yet
Understanding Multiple Linear Regression
26 pages
Understanding Simple Linear Regression
No ratings yet
Understanding Simple Linear Regression
7 pages
ESB Analytics Resit Exam 2021/22 Questions
No ratings yet
ESB Analytics Resit Exam 2021/22 Questions
9 pages
EDA for Two-Variable Linear Regression
No ratings yet
EDA for Two-Variable Linear Regression
9 pages
Evaluating Linear Regression Models
No ratings yet
Evaluating Linear Regression Models
54 pages
Common Pitfalls in Statistical Analysis: Linear Regression Analysis
No ratings yet
Common Pitfalls in Statistical Analysis: Linear Regression Analysis
4 pages
SPSS Analysis of Age and BMI Correlation
No ratings yet
SPSS Analysis of Age and BMI Correlation
10 pages
LASSO and OLS Analysis in Diabetes Data
No ratings yet
LASSO and OLS Analysis in Diabetes Data
11 pages
Muscle Mass Regression Analysis
No ratings yet
Muscle Mass Regression Analysis
9 pages
Statistical Modeling with R: FEV & Pima Data
No ratings yet
Statistical Modeling with R: FEV & Pima Data
10 pages
Understanding Multiple Linear Regression
No ratings yet
Understanding Multiple Linear Regression
58 pages
Understanding Simple Linear Regression
No ratings yet
Understanding Simple Linear Regression
13 pages
Introducción a la Regresión Lineal
No ratings yet
Introducción a la Regresión Lineal
42 pages
OLS Method in Simple Linear Regression
No ratings yet
OLS Method in Simple Linear Regression
9 pages
Kaggle Stroke Prediction Dataset Overview
No ratings yet
Kaggle Stroke Prediction Dataset Overview
48 pages
Week 2 Linear Regression Lab Guide
100% (2)
Week 2 Linear Regression Lab Guide
8 pages
Muscle Mass vs Age Analysis
No ratings yet
Muscle Mass vs Age Analysis
2 pages
Behavioral Finance in Trading Theory
No ratings yet
Behavioral Finance in Trading Theory
8 pages
Capital Allocation Strategies for Risky Assets
No ratings yet
Capital Allocation Strategies for Risky Assets
5 pages
Understanding Market Efficiency Concepts
No ratings yet
Understanding Market Efficiency Concepts
34 pages
Z Ferrari The 2015 Initial Public Offering
100% (1)
Z Ferrari The 2015 Initial Public Offering
20 pages
13.exploratory Data Analysis
0% (1)
13.exploratory Data Analysis
10 pages
Huffman vs Arithmetic Coding Explained
No ratings yet
Huffman vs Arithmetic Coding Explained
139 pages
Statistics for Nursing Research Manual
No ratings yet
Statistics for Nursing Research Manual
17 pages
Basic Research Terminology Explained
100% (1)
Basic Research Terminology Explained
2 pages
Probability and Statistics Key Concepts Guide
No ratings yet
Probability and Statistics Key Concepts Guide
17 pages
New Method for Box-Cox Transformation
No ratings yet
New Method for Box-Cox Transformation
10 pages
Mathematical Foundations for AI
No ratings yet
Mathematical Foundations for AI
24 pages
Econometric Concepts Review for Students
No ratings yet
Econometric Concepts Review for Students
5 pages
Further Mathematics Probability Exam 2022
No ratings yet
Further Mathematics Probability Exam 2022
16 pages
Statistics Notes and Formulas
No ratings yet
Statistics Notes and Formulas
48 pages
Statistical Inference Basics Explained
No ratings yet
Statistical Inference Basics Explained
14 pages
Standard Deviation and IQR Explained
No ratings yet
Standard Deviation and IQR Explained
7 pages
Understanding Volatility and Risk Management
No ratings yet
Understanding Volatility and Risk Management
23 pages
A-Level H2 Maths 2016 Paper 2 Analysis
No ratings yet
A-Level H2 Maths 2016 Paper 2 Analysis
11 pages
Durbin-Watson Test Manual Guide
No ratings yet
Durbin-Watson Test Manual Guide
6 pages
Relational Plots and Subplots in Seaborn
No ratings yet
Relational Plots and Subplots in Seaborn
38 pages
Understanding Statistical Significance
No ratings yet
Understanding Statistical Significance
2 pages
Analisis Data Stabilitas Ekonomi
No ratings yet
Analisis Data Stabilitas Ekonomi
26 pages
Understanding Probability Concepts
No ratings yet
Understanding Probability Concepts
58 pages
JC2 H1 Math Preliminary Exam 2018
No ratings yet
JC2 H1 Math Preliminary Exam 2018
8 pages
Frequency Distribution Analysis Report
No ratings yet
Frequency Distribution Analysis Report
3 pages
Health Insurance Enrollment Regression Analysis
No ratings yet
Health Insurance Enrollment Regression Analysis
6 pages
Comprehensive Guide to Statistics Concepts
No ratings yet
Comprehensive Guide to Statistics Concepts
34 pages
Confidence Intervals for Population Mean
No ratings yet
Confidence Intervals for Population Mean
7 pages
蘑菇书：强化学习教程
No ratings yet
蘑菇书：强化学习教程
188 pages
PLS Path Modeling for Hierarchical Constructs
No ratings yet
PLS Path Modeling for Hierarchical Constructs
21 pages
Non-Parametric Statistical Methods
No ratings yet
Non-Parametric Statistical Methods
19 pages
Gringorten Rule for Extreme Probability Paper
100% (1)
Gringorten Rule for Extreme Probability Paper
2 pages
Best ARIMA Model Selection Guide
No ratings yet
Best ARIMA Model Selection Guide
2 pages
Cuckoo Search Algorithm Efficiency Analysis
No ratings yet
Cuckoo Search Algorithm Efficiency Analysis
8 pages

BMI and Age Analysis in PREVEND Study

Uploaded by

BMI and Age Analysis in PREVEND Study

Uploaded by

Problem Set 6 Solutions

b) Fit a linear regression model to relate BMI and age.

#create residual plot

Predicted BMI Theoretical Quantiles

## Estimate Std. Error t value Pr(>|t|)

plot(y ~ x, cex = 0.75)

1.0 2.0 3.0 4.0

r = (1/(n-1))*sum( ((x - [Link])/s.x) * ((y - [Link]))/s.y )

plot(y ~ x, cex = 0.75)

1.0 2.0 3.0 4.0

Utility Bill versus Average Outdoor Monthly Temp

#plot the data and y = x line

Price for 1 kg Rice in 2009 vs 2003

Price in 2003 (min)

Price for 1 kg Rice in 2009 vs 2003

Price in 2003 (min)

#create a plot with the least squares line

Caries Experience by Fluoride Content

0.0 0.5 1.0 1.5 2.0 2.5

Fluoride Content of Water Supply (ppm)

Residual Plot for Caries versus Fluoride

0 200 400 600

Predicted Number of Caries

#create a plot with the least squares line

Caries Experience by Fluoride Content

0.0 0.5 1.0 1.5 2.0 2.5

Fluoride Content of Water Supply (ppm)

Residual Plot for Caries versus Fluoride

0 200 400 600

Predicted Number of Caries

fit1 = lm(denmark ~ year, data = malebirths)

#creating a single vector solely to determine limits for the graph

## Estimate Std. Error t value Pr(>|t|)

## Estimate Std. Error t value Pr(>|t|)

## Estimate Std. Error t value Pr(>|t|)

## Estimate Std. Error t value Pr(>|t|)

#identify how many individuals are physically active

[ = 1097.22 + 7.78(toxemiaY es)

#fit the model

#load color packages

#birth weight and toxemia

#birth weight and gestational age

#gestational age and toxemia

Birth Weight (g)

Toxemia Gestational Age (wks)

Gestational Age vs Toxemia

[ = −1286.2 − 206.59(toxemiaY es) + 84.06(gestage)

#check linearity with gestage

#residual plot for constant variability

Residual vs Gestage Residual vs Fitted

Gestage (wks) Predicted Birthweight (g)

Normal Q−Q Plot

#plot vertical lines at 0 and the +/- critical value

#plot each of the t-statistics

You might also like

r = (1/(n-1))sum( ((x - [Link])/s.x) ((y - [Link]))/s.y )