PO687 End of term project
Dr Raluca Popp
December 7, 2020
The rationale behind the project:
• it will test all the stats skills you acquired this term - from formulating
hypotheses, to visualising relationships, running statistical analysis, pre-
senting and interpreting the results, but also data management, such as
recoding of variables, where needed;
• you have some freedom over the analysis you will run; You have to pick
one of the 3 datasets available to you, and you get two pick the variables
you will use in the analysis;
• rather than telling you exactly what methods to apply, you will need to
think about the variables you are using and which are the appropriate
statistical techniques to test the relationship(s) between the variables you
chose;
• think about it as a miniature research project, but one in which you don’t
need a theory and literature review part. Treat this as practice for your
dissertation next year (if you choose to write one).
Your seminar leaders will not show you how to run analysis on the three
dataset for the final project. Statistical analysis is run the same way, following
the same principles. If you learn which functions to run when, you will then be
able to apply them to any dataset.
A word on R code:
• It is not mandatory to add your R code to the assignment, but it is
recommended. It does not count towards the word limit (which is not
strict, anyway) and you will be not marked on it. However, it helps us
when marking the assignment.
• If you produce your document in Word, then you can add the code at the
end of the assignment.
• If you produce the assignment using RMarkdown, then you don’t need to
include the code at the end, as it is part of the document.
1
Formulate hypotheses
1. Pick a dataset among gss, nes and world. Inspect it, have a look at the
variables it contains and at the codebook. Select an outcome and a predictor
variable. These will be the central elements of your assignment. Remember
that the outcome variable needs to be interval, ratio or high-level ordinal - what
we call a continuous variable. Feel free to recode variables where you need to.
Formulate the working and the null hypotheses. (15 points)
Univariate statistics and visualisations
2. Describe the two variables. Create appropriate visualisations for each
variable, accompanied by the appropriate descriptive statistics (hint: it all de-
pends on the level of measurement). (15 points)
Visualise a bivariate relationship
3. Thinking about the type of variable you selected, create a graph that will
illustrate the relationship between your dependent and independent variables.
Remember that visualisations have to be nice to look at, represent the data
truthfully, be clear and informative. In other words, do not forget to add titles,
labels and so on. (15 points)
Hypothesis testing with a t-test or a non-parametric test
4. Test the hypothesis you formulated in Step 1 using a t-test or a non-
parametric test, depending on which one is appropriate (hint: remember it
depends on whether the variable is normally distributed or not). Report the
test statistics, and its associated p-value. Use the .05 cut off point for statistical
significance and interpret the results. (15 points)
Bivariate regression
5. Test the hypothesis you formulated in Step 1 using a regression model.
Present the regression results in a table and interpret them. Use the .05 cut off
point for statistical significance. (15 points)
Multiple regression
6. Expand on the relationship you tested above, by choosing another two
variables that could improve your model. Feel free to recode variables.
6a. Create hypotheses for each new variable (and your outcome variable).
(5 points)
6b. Present univariate analysis on the new variables (descriptive statistics
and visualisations). (5 points)
6c. Run a regression model that includes the new variables. Present the
regression results in a table and interpret them. Use the .05 cut off point for
statistical significance. Run regression diagnostics for your model and discuss
whether your model respects OLS assumptions. If it violates any assumptions,
you need to indicate how you would fix the issue. You don’t need to re-run the
model. (10 points)
2
6d. Compare the new regression model to the model from Step 5, using
the appropriate statistical test. Report the results and interpret them. Is the
second regression model more informative? (5 points)