STA4026S Analytics.
Continuous Assessment 2 2021
Statistical Report Writing Conventions and Instructions
This is an individual assessment, you may not discuss, share content, or ask
your classmates questions about the assessment. If something is unclear, direct
your questions at me (Etienne) via email.
You may use any typesetting software to compile your report. Rmarkdown and
LATEXare preferred for the obvious reasons, but you are welcome to use whatever
you are comfortable with as long as your nal hand in is a legible PDF le.
Clearly delineate the questions to which your responses apply.
You may include code responses (copies of the code relevant to delineated ques-
tions) either interspersed in your write up at the relevant positions where you an-
swer questions, or in an appendix. Either way, you should include the code in your
write-up.
Provide comments in your R code indicating roughly to which question your code
applies. Even if they are interspersed.
Do not include any R console output! And denitely do not screen-shot and paste
in the body of your write up. You are the analyst, not the reader. A well written
statistical report would not contain any console output. Tabulate and typeset or
plot your output properly. (I've included an example of how to tabulate R objects
in the Rmarkdown le.)
Do not include gures in an appendix. Figures are supplemental to your
writing and should be included in body of the write-up. Also, gures presented
on their own are rarely of any value. The only species of gure that can live on
its own in this context is an infographic. Figures on the other hand are graphical
mechanisms which support discussion in scientic reports.
Include a plagiarism declaration as the very last page in your report. No signed
declaration, no mark. I've included an example Rmarkdown le showing how you
can incorporate a pdf directly in your markdown compilation.
Use the naming convention STDNUM001_STA4026S_CA2.pdf for your PDF le.
Note the underscores.
1
Though you should include your code in a write-up, a separate single le with all
of your R code must be uploaded separately to the code tab for the assessment.
Use the naming convention STDNUM001_STA4026S_CA2.R for your le. Note
the underscores. Your R code should NOT contain any of the following:
install.packages()
rm()
setwd()
I want to be able to run your code on my computer without having to manually
edit your code, installing libraries or calling to external les that I don't have. It
may refer to datasets which I have provided as part of this assessment.
Page 2
Question 1 (25 marks)
Re-
You are a Statistician consulting on behalf of a sports science institute.
searchers at the institute are interested in the relationship between nu-
tritional intake over race distance and speed. The researchers have provided
you with data from the Rondebosch Half Marathon (21km) which consist of 725
observations on the following variables:
Variable Description
Speed_21km Average speed over the full race distance, 21.1km.
Nutrition Variable indicating how much water/liquid nutrition was consumed. Pos-
sible values ∈ [0, 2.5].
Age_Scl Age of participant in 100s of years. (So, years scaled by 100.) Possible
values ∈ [0.2, 0.8].
Sex Sex of participant. Factor variable, either ’Male’ or ’Female’.
ShoeBrand Shoe brand used. Factor variable with levels ’Nike’, ’NewBalance’.
The data are already split into training, validation, and test sets. See, e.g.:
> rm(list = ls(all = TRUE))
> dat_train = read.table(’Rondebosch21km_2021_Train.txt’, h = TRUE)
> dat_val = read.table(’Rondebosch21km_2021_Validate.txt’, h = TRUE)
> dat_test = read.table(’Rondebosch21km_2021_Test.txt’, h = TRUE)
> head(dat_train, 5))
Speed_21km Nutrition Age_Scl Sex ShoeBrand
1 10.39 0.74 0.47 Male NewBalance
2 10.00 0.30 0.52 Female Nike
3 9.06 1.40 0.39 Female Nike
4 8.74 1.03 0.48 Female NewBalance
5 10.31 0.54 0.47 Male Nike
(a) Code and Write-up: Conduct an exploratory data analysis. Use relevant (4)
plots to probe the empirical relationship between the predictors and responses
and interpret these gures.
(b) Code: Encode the input data in an appropriate design matrix. Note: no further (2)
scaling is required for the input variables here. Hint: model.matrix()
Write-up: Give mathematical expressions for the encoding of the input vector,
xi , where i denotes the ith observation.
(c) Code: Write a R-function that evaluates the updating equation that denes (5)
a neural network with a single hidden layer with m hidden nodes with logistic
activation functions on all hidden nodes. Full marks can only be obtained for
evaluating the forward equations in matrix form.
Write-up: Shortly motivate your choice of activation functions, cost function
and regularisation mechanism.
(d) Code: Fit two neural networks, each with a single hidden layer containing three (7)
and ve nodes respectively to the data. Do this by conducting an appropri-
ate validation analysis under an appropriately chosen regularisation mechanism.
You may use any standard R optimisation routines in order to t the models.
Write-up: Plot the validation error vs. λ for both models on the same gure
and interpret the results. Use this gure to motivate your choice of regularization
Page 3
level and model (amongst the two tted here). That is, rst determine which
model to use and then report the level of regularization which you will apply to
the chosen model.
(e) Code & Write-up:Use the model selected in (d) to plot response curves over (5)
Age and Nutrition for Male runners who use Nike shoes. Do the same for female
runners. Use these gure to formulate a response to describe the relationship
between the predictors and the response to the researchers for which you are
consulting. Hint: use a 2D lattice over Age and Nutrition and visualise using
filled.contour(). Failover: If you can't get the 2D lattice to work draw the
response curves over Nutrition but x for individuals aged 40 (0.4, scaled).
(f) Code: Use the network tted in (d) to predict the responses for the test dataset. (2)
Write your predictions to a .csv le to be handed in with your report using the
following naming convention (replace `STDNUM001' with your student num-
ber):
R> pred = data.frame(predictions = matrix(predictions, ncol = 1))
R> write.table(pred,’STDNUM001_STA4026S_CA2.csv’, quote = F, row.
names = F, sep = ’,’)
The .csv le is to be uploaded to the CA2 predictions assignment tab on Vula.
Make sure your le contains a single column of predictions! Important: if you
did not get to this point or your predictions did not work for whatever reason,
change the name of the example le given to reect your student ID and return
that without altering its contents.
Page 4
Question 2 (9 marks)
A subject of interest in modern research on neural networks is that of input sensitivity.
For our purposes, we'll focus specically on the sensitivity measured as the gradient
of the 1st output with respect to the inputs of the model for a given parameter set.
By simple modication of the elements in the backprop algorithm we can calculate
these sensitivities directly. Alternatively, this can be achieved by approximating the
gradient of the output variable: Let x = (xj )1×p denote a vector of inputs, then dene
two new vectors xk+ = (xk+ j )1×p and x
k−
j )1×p where
= (xk−
xj + h/2 if j = k,
(
xk+ =
j
xj otherwise,
for some index k and h suitably small. Likewise:
xj − h/2 if j = k,
(
xk− =
j
xj otherwise,
Evaluate
aL1 (xk+ , θ̂) − aL1 (xk− , θ̂)
∇k =
h
for all k variables where aL1 (x, θ) denotes the rst output (L is the number of layers
in the network) evaluated for the input vector x and parameter set θ. Note that we
have to estimate θ here rst.
Consider now the simulation exercise where we conducted gradient checking in class
(Lecture 6):
# Let’s fake a dataset and see if the network evaluates:
set.seed(2020)
N = 50
x = runif(N,-1,1)
e = rnorm(N,0,1)
y = 2*sin(3*pi*x)+e
plot(y~x, pch = 16, col = ’blue’)
...
Use the template code provided to t a (10)-network to the simulated data, with no
regularisation. (I've given you everything up to tting the model.)
(a) R-code: Approximate the gradient of the output with respect to the input (3)
at a regularly spaced set of coordinates for the input variable using h = 0.01.
Write-up: Plot the values of ∇1 for the regularly spaced set of coordinates in
the input space (evaluate this quantity at dierent values for the input) and
compare these to the derivative of the true target function. (You may plot the
derivative of the true target function and then comment. )
(b) Does the plot in (a) suggest a means for conducting regularisation which does (2)
not involve penalising the parameters? Clearly motivate your response.
(c) R-code & write-up: We can streamline the above procedure by calculating (4)
the gradients of the outputs w.r.t. the inputs exactly. For these purposes, do the
Page 5
following: Modify your R-code in Q3 (a) to calculate and return the gradients
of the outputs w.r.t. the input variables using back-propagation. You may
dene a new function/copy of existing function if you like. Verify that these are
correct by superimposing the values calculated on your gradient testing plot in
the previous question. Note: this part might require some careful thought, but
it is actually very easy.
Page 6