-2-
Question 1 (15 Marks)
Post-traumatic stress disorder (or PTSD) is a severe anxiety disorder that
can develop after exposure to a traumatic event. Diagnostic symptoms for PTSD
include re-experiencing the original trauma(s) through flashbacks or nightmares.
A psychologist gathers data from a sample of Vietnam War veterans in Australia.
The veterans had all been diagnosed with PTSD at some stage following their
service in the Vietnam War. All subjects completed a General Health
Questionnaire with 28 questions.
There were 107 males in the sample.
The variables considered are described below:
Variable
Description
Subject_Id
Identification number of subject
Age
Age in years
GHQ-Total
Total score for all four sub-sections (GHQ-A, GHQ-B, GHQ-C
and GHQ-D)
Combat
Number of days exposed to combat
PTSD
1: Early Onset PTSD,
2: Late Onset PTSD
a) Identify the following variables as Continuous, Ordinal or Nominal.
Variable
Type
Age:
GHQ-Total:
Combat:
PTSD:
b) Is this an experimental or observational study? Explain.
c) What is the target population for this study?
-3-
Question 1 continued
d) Why may the results from this study not be used to draw conclusions
about the population of all people with PTSD in Australia?
e) If you were to carry out a similar study about Australian veterans of the
current war in Afghanistan, describe how you might obtain a
representative sample of 200 subjects from the population of interest.
f) What type of display may be used to investigate the relation between
GHT-Total and Age in the target population?
g) If we wanted to investigate whether we can predict general health
(GHQ-Total) from age, what would be the dependent variable? Explain.
-4-
Question 2 (9 Marks)
Total scores on the General Health Questionnaire (GHQ-Total) in the general
adult population are known to be normally distributed with a mean of 19 and
standard deviation of 2.51. Answer the following questions showing your
working.
a) What is the probability that an individual from the general adult
population would obtain a score which is higher than 22 on the
GHQ-Total?
b) What is the probability that the GHQ-Total scores of five randomly
selected individuals have an average score which is lower than 18?
c) What is the upper quartile of the distribution of GHQ-Total scores in the
general adult population?
-5-
Question 3 (9 Marks)
Research Question: Is the Ecological Footprint for countries in the Americas,
Central Asia and Europe less than 3.0 on average?
According to the World Wildlife Foundation, a countrys Ecological Footprint is
the sum of all the cropland, grazing land, forest and fishing grounds required to
produce the food, fibre and timber it consumes, to absorb the wastes emitted
when it uses energy and to provide space for its infrastructure.
The average Ecological Footprint (measured in hectares per person) for a
random sample of 25 countries from the Americas, Central Asian and Europe was
found to be 2.832 with a standard deviation of 1.281.
A histogram and a boxplot for the Ecological Footprints for these countries are
given below.
Histogram of Eco Footprint (ha per person)
Boxplot of Eco Footprint (ha per person)
10
Eco Footprint (ha per person)
Frequency
0.0
1.2
2.4
3.6
Eco Footprint (ha per person)
4.8
Source: Watkins et al., Statistics, from data to decision, Second Edition (2011), Wiley (adapted).
Use the above information to test the claim that the average Ecological Footprint
for countries in the Americas, Central Asia and Europe is less than 3.0 hectares
per person. For any calculations, show your working.
Hypothesis Test:
-6-
Question 4 (8 Marks)
Research Question: Is a new diet effective in reducing cholesterol?
Fifteen people began a new diet to reduce their cholesterol levels. The table
below shows the cholesterol readings for these fifteen people both before the
new diet and again one month after the diet began. The differences between the
two readings are given as well.
Dieter
Before
After
Differences
(Before - After)
255
197
58
230
225
290
250
40
242
215
27
300
270
30
250
235
15
215
190
25
230
220
10
225
200
25
10
219
203
16
11
236
223
13
12
240
220
20
13
215
180
35
14
217
195
22
15
231
235
-4
Stem-and-Leaf Display:
Differences (Before - After)
Stem-and-leaf of
Differences
(Before - After)
N = 15
Leaf Unit = 1.0
1 -0
2
0
6
1
(5) 2
4
3
2
4
1
5
4
5
0356
02557
05
0
8
Test of mu = 0 vs > 0
Variable
Differences
(Before After)
N
15
Mean
22.47
StDev
15.05
SE Mean
3.89
95% Lower
Bound
15.62
T
*
P
0.000
Use the stem-and-leaf plot for the differences and the Minitab output (which
gives results based a one-sample t-test for the differences) to answer the
following questions.
a) State the null and the alternative hypothesis being tested in the Minitab
output.
b) Comment on the normality assumption for the differences.
c)
Calculate the test statistic. Show your working.
d) Do you reject or not reject the null hypothesis stated in part a? Give a
reason for your answer and clearly state a conclusion.
-7-
Question 5 (8 Marks)
Two samples of female students participated in an experiment to investigate
alternative treatments for the eating disorder, bulimia. One sample consisted of
11 students known to suffer from bulimia; the other sample consisted of 14
students with normal eating habits. Each student completed a questionnaire
from which a fear of negative evaluation (FNE) score was produced. The higher
the score, the greater was the fear of negative evaluation. A summary table and
a histogram for the results from this experiment are presented below:
Histogram of FNE score
5
Bulimic
10
15
20
25
Normal
Frequency
n
Mean
StDev
Normal
Eating
habits
(N)
14
14.14
5.29
Degrees of freedom = 22
Bulimic
eating
habits
(B)
11
17.82
4.92
10
15
20
25
FNE score
Panel variable: Eating habits
Source: McClave, J.T. and T. Sincich, Statistics, Eleventh Edition (2009), Pearson (adapted).
a) Calculate a two-sided 95% confidence interval for the difference between
the population means of the FNE scores for students with bulimic eating
habits (B) and students with normal eating habits (N). Interpret the
results.
b) Based on your results in the previous part, what conclusion can you make
H1: B N
regarding the following hypothesis:
H0: B = N,
c) What assumptions are required for the interval of part a to be statistically
valid? Are these assumptions reasonably satisfied? Explain.
-8-
Question 6 (11 Marks)
Research Question: Do more than 50% of car crashes occur within 8km of
home?
A large insurance company conducted a study into car crashes. It found that out
of a random selection of 2200 car crashes, 1144 of them occurred within 8km of
home.
Source: Triola, M.F., Elementary Statistics, Third Edition (2007), Pearson (adapted).
a) Use a z-test for a proportion to test the claim that more than 50% of car
crashes occur within 8km of home.
Hypothesis Test
b) The lower 95% confidence bound is found to be 0.5025. Interpret the
meaning of this confidence bound in the context of this problem.
-9-
Question 7 (8 marks)
Recall the data used in Assignment 1. The data, which are described below were
recorded on 1151 AIDS patients between 1996 and 1997. These patients were
recruited to take part in a clinical trial to compare survival times for AIDS
patients treated with a standard two-drug regimen, with survival times for
patients treated with a new three-drug regimen. For this problem, we will only
use the variables Time (which has been categorised here) and Treatment.
Variable Name
Variable Description
ID
Subject ID
Time
Time to AIDS diagnosis or death:
0 = Less than 6 months
1 = 6 months or longer
Treatment
Treatment: 0 = Two-drug treatment regime
1 = Three drug treatment regime
Source: Hosmer, D.W. and Lemeshow, S. and May, S. (2008), Applied Survival Analysis: Regression Modeling of
Time to Event Data: Second Edition, John Wiley and Sons Inc., New York, NY
The following Minitab output was obtained from the AIDS study described above.
Use this output to answer the questions on this question.
Rows: Treatment
Columns: Time
<6 months
6 months+
All
Chart of Time, Treatment
500
166
411
577
150.4
426.6
577.0
1.6201
0.5711
400
Count
2 Drug
300
200
100
3 Drug
All
Cell Contents:
134
440
574
149.6
424.4
574.0
1.6285
0.5741
300
851
1151
300.0
851.0
1151.0
Count
Expected count
Contribution to Chi-square
a) Comment on the clustered bar chart.
0
Treatment
Time
2 Drug
3 Drug
< 6 months
2 Drug
3 Drug
6 months+
- 10 -
Question 7 continued
b) Use the output on the previous page to carry out an appropriate
hypothesis test to answer the research question below.
Research Question: Is there an association between the time to diagnosis or
death and the treatment regime prescribed to AIDS patients?
Hypothesis Test
- 11 -
Question 8 (8 marks)
An AIDS specialist in 1997 claimed that 5% of patients presenting with AIDS
were aged under 25 years and 10% were aged over 50 years, with the
remainder aged between 25 and 50 years. Of the 1151 patients in the study
described in the last question, 34 patients were aged under 25 years and 113
patients were aged over 50 years. Assuming the sample is representative of
AIDS patients in 1997, carry out an appropriate hypothesis test to test the claim
made by the AIDS specialist.
Research Question: Were 5% of AIDS patients in 1997 aged under 25, 10%
aged over 50 years, and the remainder aged between 25 and 50 years?
Hypothesis Test
- 12 -
Question 9 (24 marks)
Research Question: What variables are useful to predict oxygen consumption
during exercise?
During exercise, your body uses large amounts of oxygen. It is both difficult and
expensive to measure the volume of oxygen used. A study was undertaken to
determine whether the amount of oxygen used during exercise could be
predicted from other variables which were easier to measure. A random sample
of males enrolled in a physical fitness course was selected for the study. Each of
the 31 males in the study was asked to run 2.4km and the following information
was recorded:
Variable Name
Variable Description
Oxygen
Oxygen consumption (ml per kg bodyweight per minute)
Age
Age (years)
Runtime
Time to run 2.4km (minutes)
RestPulse
Heart rate while resting (beats per minute)
RunPulse
Heart rate while running (beats per minute at same time
oxygen rate measured)
Age, running time, resting pulse rate and running pulse rate were all
investigated as possible determinants of oxygen consumption.
The following descriptive statistics were obtained:
Variable
mean
[Link]
Oxygen
31
47.38
5.33
Age
31
47.68
5.21
Runtime
31
10.59
1.39
RestPulse
31
53.45
7.62
RunPulse
31
169.95
10.25
The following Minitab outputs were obtained to investigate the relationship
between Oxygen consumption and Age.
The regression equation is
Scatterplot of Oxygen vs Age
Oxygen = 62.2 - 0.311 Age
60
Oxygen
55
50
Predictor
Coef
SE Coef
Constant
62.221
8.670
7.18
0.000
-0.3114
0.1808
-1.72
0.096
Age
45
R-Sq = 9.3%
40
40
44
48
Age
52
56
- 13 -
Question 9 continued
a) Describe the target population for this study.
b) Use the information on the previous page to:
i.
Calculate the correlation coefficient to assess the linear relation
between Oxygen and Age.
ii. Comment on the linear relationship (or lack thereof) between
Oxygen and Age.
iii. Give the best possible prediction for the oxygen consumption
during exercise for a male in the target population who is aged
42 years.
Now we will consider running time as a predictor of oxygen consumption. The
following Minitab outputs were obtained to investigate the relationship between
Oxygen and RunTime. The values for the test statistics and the p-value have
deliberately been replaced by a *. Use the output below to answer the
questions on the following page.
The regression equation is
Scatterplot of Oxygen vs RunTime
Oxygen = 82.4 - 3.31 RunTime
60
Oxygen
55
Predictor
Coef
Constant
82.422
RunTime
-3.3106
SE Coef
3.855
21.38
0.000
0.3612
50
45
40
10
11
RunTime
12
13
14
R-Sq = 74.3%
- 14 -
Question 9 continued
c) Use the scatterplot provided to comment on the relation between oxygen
consumption and running time.
d) Write down the goodness of fit statistic and clearly explain what this value
means in relation to oxygen consumption and running time.
e)
Predict the oxygen consumption for a male in the target population who
runs for 10 minutes.
f)
Use the output above to carry out an appropriate hypothesis test to
determine whether running time is a useful predictor of oxygen
consumption.
Hypothesis Test:
- 15 -
Question 9 continued
Finally, we will consider resting pulse rates and running pulse rates as predictors
of oxygen consumption. The following Minitab outputs were obtained to
investigate these relations.
Scatterplot of Oxygen vs RunPulse
60
60
55
55
Oxygen
Oxygen
Scatterplot of Oxygen vs RestPulse
50
50
45
45
40
40
40
45
50
55
RestPulse
60
65
70
140
150
160
170
180
The regression equation is
The regression equation is
Oxygen = 62.3 - 0.279 RestPulse
Oxygen = 82.5 - 0.207 RunPulse
Predictor
Coef
Constant
62.300
RestPulse -0.2792
R-Sq = 15.9%
190
RunPulse
SE Coef
6.425
9.70
0.1190
-2.35
Predictor
Coef
0.000
Constant
82.46
0.026
RunPulse -0.20680
SE Coef
15.04
0.08552
5.48
0.000
-2.42
0.022
R-Sq = 20.3%
g)
Circle the observation in the plot on the right hand side that represents a
male with an average running pulse rate of 170 beats per minute and
oxygen consumption of 60.055 ml per kg. This male had the largest
residual. Calculate the residual for this observation.
h)
Which is the better predictor of oxygen consumption during exercise,
resting pulse rate or running pulse rate? You must give a reason for your
answer.
- 16 -
Question 9 continued
i)
Write a thorough summary statement of your findings in regard to the
relation between oxygen consumption and each of the four predictors you
have investigated. Your summary should explain which predictors are
useful and which are not. Your summary should also explain which, if
any, of the predictors gives the best predictions on oxygen consumption
and give a reason for your choice. Your summary only needs to be one
paragraph.