Data Interpretation Techniques Explained
Data Interpretation Techniques Explained
Jonathan Bestwick
Wolfson Institute of Population Health
Interpreting Data
• Summarising data
• Sampling from a population
• Describing associations
• Comparisons and p-values
2
Types of data
Data
Qualitative Quantitative
Nominal Ordinal
(unordered) (ordered)
6.49cm
Patient may
live or die Colours Short, 10 graduates Lengt
(yes/no) medium, tall h
Summarising data:
Measures of location
8, 4, 2, 11, 8, 3, 7
Median = middle value when values ordered from smallest to largest
2, 3, 4, 7, 8, 8, 11
Mean = average = sum of all the values divided by the number of values
2+3+4+7+8+8+11 = 43 = 6.1
7 7
Example: BMI in 163 1st year students
Mode = 21 = Median
40
30
Number of students
Mean = 22
20
10
0
14 16 18 20 22 24 26 28 30 32 34 36 38 40
BMI
Summarising data:
Measures of spread
8, 4, 2, 11, 8, 3, 7 6.1
(2-6.1)2
• Standard deviation
(3-6.1)2
(4-6.1)2
– Mean was 6.1
(7-6.1)2
– Distance from the mean, squared
(8-6.1)2
16.81+9.61+4.41+0.81+3.61+3.61+24.01 62.87
= = 9
7 7
– Standard deviation = square root of 9 = 3
Summarising data:
Measures of spread
8, 4, 2, 11, 8, 3, 7
• Interquartile range
2, 3, 4, 7, 8, 8, 11
30
Number of students
Interquartile range
= 20 to 23
20
10
0
14 16 18 20 22 24 26 28 30 32 34 36 38 40
BMI
Mean or median?
1, 2, 3, 4, 5, 6, 7
Median = 4, Mean = 1+2+3+4+5+6+7 = 4
7
Either would do
1, 2, 3, 4, 5, 6, 100
1+2+3+4+5+6+100 = 17.3
Median = 4, Mean =
7
Better to use median in this case, can avoid the influence of outliers
Standard deviation or interquartile range?
1, 2, 3, 4, 5, 6, 7
2 2 2 2 2 2 2
Standard deviation = (1−4) +(2−4) +(3−4) +(4−4) +(5−4) +(6−4) +(7−4)
7
=2
Interquartile range (IQR) = 2 to 6 Either would do
1, 2, 3, 4, 5, 6, 100
Standard deviation = 33.8
Interquartile range (IQR) = 2 to 6
Better to use IQR in this case, to avoid the influence of outliers
Which measures to use?
Example: AFP levels
1.0
60
40
A. Mean
B. Median
20 C. Mode
0
0 0.5 1 1.5 2 2.5 3 3.5
Alphafetoprotein level
Which measures to use?
Example: AFP levels
1.0
0.8 1.3
60
40
A. Standard deviation
B. Interquartile range
20
0
0 0.5 1 1.5 2 2.5 3 3.5
Alphafetoprotein level
Distribution of TSH
(sample of pregnant women in Britain)
1500
1250
How should this data be summarised?
A. Median and standard deviation
Number of women
1000
B. Mean and interquartile range
750 C. Mean and standard deviation
D. Median and interquartile range
500
250
0
0 1 2 3 4 5 6
TSH (mIU/L)
Distribution of free thyroxine
(sample of pregnant women in Britain)
1250
1000
How should this data be summarised?
A. Median and standard deviation
Number of women
750
B. Mean and interquartile range
C. Mean and standard deviation
500 D. Median and interquartile range
250
0
8 10 12 14 16 18 20
FT4 (pmol/L)
Distribution of thyroxine
(sample of pregnant women in Britain)
1250
1000
Gaussian distribution
Number of women
750
500
250
0
8 10 12 14 16 18 20
FT4 (pmol/L)
Carl Friedrich Gauss 1777-1855
• German mathematician and scientist
• The formula for the Gaussian distribution is
æ - ( x - m )2 ö
ç ÷
1 ç 2 sd 2 ÷
y= e è ø
2p sd
• The Gaussian distribution is determined only by the mean (m)
and standard deviation (sd)
• Abraham de Moivre actually specified the formula 100 years
before
Gaussian Distribution
50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
10
15
20
50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
68%
1 SD 1 SD
16% 16%
-4 sd -3 sd -2 sd -1 sd Mean
0 +1 sd +2 sd +3 sd +4 sd
Variable
Variable
Gaussian Distribution
Useful properties
90%
5% 1.64 SD 1.64 SD 5%
-4 sd -3 sd -2 sd -1 sd Mean
0 +1 sd +2 sd +3 sd +4 sd
Variable
Variable
Gaussian Distribution
Useful properties
95%
2.5% 2.5%
1.96 SD 1.96 SD
-4 sd -3 sd -2 sd -1 sd Mean
0 +1 sd +2 sd +3 sd +4 sd
Variable
Variable
Gaussian Distribution
Useful properties
A constant proportion of values will lie within any
specified number of Standard Deviations above or
below the mean: reference ranges
99% range (0.5th to 99.5th centile) = mean ± 2.58 SDs
Getting
narrower
Distribution of free thyroxine
(sample of pregnant women in Britain)
1250 Mean = 14
0
8 10 12 14 16 18 20
FT4 (mIU/L)
Interpreting Data
• Summarising data
• Sampling from a population
• Describing associations
• Comparisons and p-values
26
What can a sample tell us about
the population?
Population
e.g. true mean BMI in all 1st year
students in English universities
Statistics
Sample
e.g. sample mean
of BMI in 163 QMUL
1st year students
Repeated sampling from a population
Population
Sample mean
True mean
…. .…
Sample .…
Sample 1 Sample 2 …. Samples
100
Sample mean 1 Sample mean 2 Sample mean 100
• If the sample size isn’t too small then the distribution of the
sample mean will be Gaussian
• The standard deviation of this distribution is called the
standard error
Standard error of the mean
• The standard error is a measure of the statistical accuracy of
an estimate
• The standard error of the mean is the standard deviation of
the distribution of all possible sample means
• This can be estimated from a single sample as
Standard deviation
Standard error of the mean =
sample size
Example: BMI in 163 first year QMUL students
• n=163
• mean=22
• standard deviation=4
Standard deviation 4
Standard error = = = 0.3
n 163
Confidence interval for the mean
The 95% confidence interval (CI) of a sample mean is
95% CI = sample mean ± 1.96 × standard error
In our example: 95% CI = mean ± 1.96 x SE
= 22 ± 1.96 x 0.3
= 21.4 to 22.6
We would expect 95% of samples of the same size to have
a mean BMI between 21.4 and 22.6
50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
Systolic Blood Pressure (mmHg)
50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
Systolic Blood Pressure (mmHg)
50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
Systolic Blood Pressure (mmHg)
50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
Systolic Blood Pressure (mmHg)
Does the 95% CI A. Get wider B. Get narrower C. Stay the same
Interpreting Data
• Summarising data
• Sampling from a population
• Describing associations
• Comparisons and p-values
38
Example data: BRradykinesia-Akinesia
INcoordination (BRAIN) Test
Computerised alternate finger tapping test
Best 2 measures
Kinesia score (number of keystrokes)
Akinesia time (mean dwell time on each key)
Correlation
100
80
40
20
0
0 20 40 60 80
Kinesia score
Correlation
100
80
r = -0.53
Total Motor UPDRS 60
40
20
0
0 20 40 60 80
Kinesia score
• Total motor UPDRS scores are negatively correlated with Kinesia scores
• The correlation coefficient, usually denoted r, takes a value between -1 and +1
Correlation (Pearson)
5
r=0
4
No correlation
Variable y
3
1
1 2 3 4 5
Variable x
Correlation (Pearson)
5
r=0
4
No correlation
Variable y
3
2
5
r=1
1
4 1 2 3 4 5
Variable x
Variable y
Perfect
2
positive
correlation
1
1 2 3 4 5
Variable x
Correlation (Pearson)
5
r=0
4
No correlation
Variable y
3
2
5 5
r=1 r = -1
1
4 1 2 3 4 5 4
Variable x
Variable y
Variable y
3 3
Perfect Perfect
2
positive 2 negative
correlation correlation
1 1
1 2 3 4 5 1 2 3 4 5
Variable x Variable x
Correlation (Pearson)
5
r=0
4
No correlation
Variable y
3
2
5 5
r=1 r = -1
1
4 1 2 3 4 5 4
Variable x
Variable y
Variable y
3 3
80
40
20
0
0 20 40 60 80 100
x
Linear regression
100
80
40
25.5
20
0
0 20 40 43.8 60 80
Kinesia score
Linear regression
100
80
40
20
0
0 20 40 60 80
Kinesia score
Linear regression
100
20
0
0 20 40 60 80
Kinesia score
Linear regression
100
Dependant Independent
80
variable variable
20
0
0 20 40 60 80
Kinesia score
3000
2000
r=0
y
1000
Regression line
0
0 20 40 60 80 100
x
Some other regression models
• Many different regression curves can be fitted where the data pattern is not
linear
• Polynomial – quadratic (y=b0+b1x+b2x2)
Some other regression models
• Many different regression curves can be fitted where the data pattern is not
linear
• Polynomial – quadratic (y=b0+b1x+b2x2), cubic (y=b0+b1x+b2x2+b3x3) etc
Some other regression models
• Many different regression curves can be fitted where the data pattern is not
linear
• Polynomial – quadratic (y=b0+b1x+b2x2), cubic (y=b0+b1x+b2x2+b3x3) etc
• Exponential growth (y=bx;b>1)
Some other regression models
• Many different regression curves can be fitted where the data pattern is not
linear
• Polynomial – quadratic (y=b0+b1x+b2x2), cubic (y=b0+b1x+b2x2+b3x3) etc
• Exponential growth (y=bx;b>1), exponential decay (y=bx;b<1)
Some other regression models
• Many different regression curves can be fitted where the data pattern is not
linear
• Polynomial – quadratic (y=b0+b1x+b2x2), cubic (y=b0+b1x+b2x2+b3x3) etc
• Exponential growth (y=bx;b>1), exponential decay (y=bx;b<1)
• Sigmoid
Multivariate linear regression
Dependant Independent
variable variables
105 85
80
77 50
45
70 40
40 45 50 55 60 65 70 75 80 85 70 77 84 91 98 105
Crown-rump length (mm) Gestational age from LMP (days)
• Summarising data
• Sampling from a population
• Describing associations
• Comparisons and p-values
60
Statistical significance
• An observed sample difference between groups
might be due to chance
Interpretation:
We are 95% sure that the difference in mean BMI
between male and female 1st year students in English
universities is between 0.36 and 2.84 kg/m2
Throw 1
H T
Throw 2
H T H T
Throw 3
H T H T H T H T
Total heads 3 2 2 1 2 1 1 0
Total tails 0 1 1 2 1 2 2 3
111×41/163
Ever smoked?
Yes No Total
Drink Yes 28 83 111
Expected alcohol?
No 13 39 52
Total 41 122 163
Are alcohol drinkers more likely to smoke?
Sample of first year medical students
Observed Expected
Ever smoked? Ever smoked?
Yes No Total Yes No Total
Drink Yes 36 75 111 Drink Yes 28 83 111
alcohol? alcohol?
No 5 47 52 No 13 39 52
Ever smoked?
The probability of observing the data
Yes No Total
given the fixed row and column totals is
Drink Yes a b a+b
alcohol? 𝑎+𝑏 !+ 𝑐+𝑑 !+ 𝑎+𝑐 !+ 𝑏+𝑑 !
No c d c+d
Total a+c b+d a+b+c+d (=n) 𝑎!×𝑏!×𝑐!×𝑑!×𝑛!
The same calculation is done for all possible tables that can be created given the
fixed row and column totals. The p-value is then the sum of all probabilities less
than or equal to the probability of the data observed