0% found this document useful (0 votes)
15 views83 pages

Data Interpretation Techniques Explained

Uploaded by

2jpkdb4hst
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views83 pages

Data Interpretation Techniques Explained

Uploaded by

2jpkdb4hst
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Interpreting Data

Jonathan Bestwick
Wolfson Institute of Population Health
Interpreting Data

• Summarising data
• Sampling from a population
• Describing associations
• Comparisons and p-values

2
Types of data
Data

Qualitative Quantitative

Nominal Ordinal
(unordered) (ordered)

Binary Categorical Discrete Continuous

6.49cm

Patient may
live or die Colours Short, 10 graduates Lengt
(yes/no) medium, tall h
Summarising data:
Measures of location
8, 4, 2, 11, 8, 3, 7
Median = middle value when values ordered from smallest to largest
2, 3, 4, 7, 8, 8, 11

Mode = most common value


2, 3, 4, 7, 8, 8, 11

Mean = average = sum of all the values divided by the number of values
2+3+4+7+8+8+11 = 43 = 6.1
7 7
Example: BMI in 163 1st year students
Mode = 21 = Median
40

30
Number of students

Mean = 22
20

10

0
14 16 18 20 22 24 26 28 30 32 34 36 38 40

BMI
Summarising data:
Measures of spread
8, 4, 2, 11, 8, 3, 7 6.1
(2-6.1)2
• Standard deviation
(3-6.1)2
(4-6.1)2
– Mean was 6.1
(7-6.1)2
– Distance from the mean, squared
(8-6.1)2

(2-6.1)2 = 16.81 , (3-6.1)2 =9.61, etc (8-6.1)2


(11-6.1)2
– Average squared distance from mean 1 2 3 4 5 6 7 8 9 10 11 12

16.81+9.61+4.41+0.81+3.61+3.61+24.01 62.87
= = 9
7 7
– Standard deviation = square root of 9 = 3
Summarising data:
Measures of spread
8, 4, 2, 11, 8, 3, 7
• Interquartile range

– 25th to 75th centile Median = 50th centile

2, 3, 4, 7, 8, 8, 11

25th centile 75th centile


Example: BMI in 163 1st year students
25th 75th
centile centile
40

30
Number of students

Interquartile range
= 20 to 23
20

10

0
14 16 18 20 22 24 26 28 30 32 34 36 38 40

BMI
Mean or median?
1, 2, 3, 4, 5, 6, 7
Median = 4, Mean = 1+2+3+4+5+6+7 = 4
7
Either would do

1, 2, 3, 4, 5, 6, 100
1+2+3+4+5+6+100 = 17.3
Median = 4, Mean =
7
Better to use median in this case, can avoid the influence of outliers
Standard deviation or interquartile range?
1, 2, 3, 4, 5, 6, 7
2 2 2 2 2 2 2
Standard deviation = (1−4) +(2−4) +(3−4) +(4−4) +(5−4) +(6−4) +(7−4)
7
=2
Interquartile range (IQR) = 2 to 6 Either would do

1, 2, 3, 4, 5, 6, 100
Standard deviation = 33.8
Interquartile range (IQR) = 2 to 6
Better to use IQR in this case, to avoid the influence of outliers
Which measures to use?
Example: AFP levels
1.0

60

Which measure of central location?


Number of women

40
A. Mean
B. Median
20 C. Mode

0
0 0.5 1 1.5 2 2.5 3 3.5
Alphafetoprotein level
Which measures to use?
Example: AFP levels
1.0
0.8 1.3
60

Which measure of spread?


Number of women

40
A. Standard deviation
B. Interquartile range
20

0
0 0.5 1 1.5 2 2.5 3 3.5
Alphafetoprotein level
Distribution of TSH
(sample of pregnant women in Britain)
1500

1250
How should this data be summarised?
A. Median and standard deviation
Number of women

1000
B. Mean and interquartile range
750 C. Mean and standard deviation
D. Median and interquartile range
500

250

0
0 1 2 3 4 5 6

TSH (mIU/L)
Distribution of free thyroxine
(sample of pregnant women in Britain)
1250

1000
How should this data be summarised?
A. Median and standard deviation
Number of women

750
B. Mean and interquartile range
C. Mean and standard deviation
500 D. Median and interquartile range

250

0
8 10 12 14 16 18 20

FT4 (pmol/L)
Distribution of thyroxine
(sample of pregnant women in Britain)
1250

1000

Gaussian distribution
Number of women

750

500

250

0
8 10 12 14 16 18 20

FT4 (pmol/L)
Carl Friedrich Gauss 1777-1855
• German mathematician and scientist
• The formula for the Gaussian distribution is
æ - ( x - m )2 ö
ç ÷
1 ç 2 sd 2 ÷
y= e è ø

2p sd
• The Gaussian distribution is determined only by the mean (m)
and standard deviation (sd)
• Abraham de Moivre actually specified the formula 100 years
before
Gaussian Distribution

Mean 110 120 130

50 60 70 80 90 100 110 120 130 140 150 160 170 180 190

Systolic Blood Pressure (mmHg)


Gaussian Distribution
Standard
deviation

10

15

20

50 60 70 80 90 100 110 120 130 140 150 160 170 180 190

Systolic Blood Pressure (mmHg)


Gaussian Distribution
Useful properties

• A constant proportion of values will lie within any


specified number of Standard Deviations above or
below the mean
Gaussian Distribution
Useful properties

68%

1 SD 1 SD
16% 16%

-4 sd -3 sd -2 sd -1 sd Mean
0 +1 sd +2 sd +3 sd +4 sd
Variable
Variable
Gaussian Distribution
Useful properties

90%

5% 1.64 SD 1.64 SD 5%

-4 sd -3 sd -2 sd -1 sd Mean
0 +1 sd +2 sd +3 sd +4 sd
Variable
Variable
Gaussian Distribution
Useful properties

95%

2.5% 2.5%
1.96 SD 1.96 SD

-4 sd -3 sd -2 sd -1 sd Mean
0 +1 sd +2 sd +3 sd +4 sd
Variable
Variable
Gaussian Distribution
Useful properties
A constant proportion of values will lie within any
specified number of Standard Deviations above or
below the mean: reference ranges
99% range (0.5th to 99.5th centile) = mean ± 2.58 SDs

95% range (2.5th to 97.5th centile) = mean ± 1.96 SDs

90% range (5th to 95th centile) = mean ± 1.64 SDs

Getting
narrower
Distribution of free thyroxine
(sample of pregnant women in Britain)
1250 Mean = 14

1000 95% reference range


SD = 1.7 = 14 ± 1.96×1.7
Number of women

750 = 10.7 to 17.3

500 Observed middle 95%


of women in the sample
250 = 10.9 to 17.5

0
8 10 12 14 16 18 20
FT4 (mIU/L)
Interpreting Data

• Summarising data
• Sampling from a population
• Describing associations
• Comparisons and p-values

26
What can a sample tell us about
the population?
Population
e.g. true mean BMI in all 1st year
students in English universities

Statistics

Sample
e.g. sample mean
of BMI in 163 QMUL
1st year students
Repeated sampling from a population
Population

Sample mean
True mean

…. .…

Sample .…
Sample 1 Sample 2 …. Samples
100
Sample mean 1 Sample mean 2 Sample mean 100

• If the sample size isn’t too small then the distribution of the
sample mean will be Gaussian
• The standard deviation of this distribution is called the
standard error
Standard error of the mean
• The standard error is a measure of the statistical accuracy of
an estimate
• The standard error of the mean is the standard deviation of
the distribution of all possible sample means
• This can be estimated from a single sample as

Standard deviation
Standard error of the mean =
sample size
Example: BMI in 163 first year QMUL students

• n=163
• mean=22
• standard deviation=4

Standard deviation 4
Standard error = = = 0.3
n 163
Confidence interval for the mean
The 95% confidence interval (CI) of a sample mean is
95% CI = sample mean ± 1.96 × standard error
In our example: 95% CI = mean ± 1.96 x SE
= 22 ± 1.96 x 0.3
= 21.4 to 22.6
We would expect 95% of samples of the same size to have
a mean BMI between 21.4 and 22.6

In the population we are 95% sure that the mean BMI


could be as low as 21.4 or as high as 22.6
95% confidence interval for the mean weight
of a sample of 30 adult men is 75kg to 81kg

Which is the correct definition?

A. In the population we are 95% sure that the mean weight


could be as low as 75kg or as high as 81kg

B. In this study 95% of men weighed between 75kg and 81kg


Confidence intervals

99% CI = sample mean ± 2.58 x standard error

95% CI = sample mean ± 1.96 x standard error

90% CI = sample mean ± 1.64 x standard error


Getting
narrower

• Use standard deviation for ranges (for individual values)

• Use standard error for confidence intervals (for means)


What happens as sample size increases?
Example: systolic blood pressure

Measured in samples of Each time mean = 120,


25, 50 and 100 SD = 15

50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
Systolic Blood Pressure (mmHg)

Sample size 95% range 95% CI


25 114.1 to 125.9
50 115.8 to 124.2
100
117.1 to 122.9
Does the 95% range A. Get wider B. Get narrower C. Stay the same
What happens as sample size increases?
Example: systolic blood pressure

Measured in samples of Each time mean = 120,


25, 50 and 100 SD = 15

50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
Systolic Blood Pressure (mmHg)

Sample size 95% range 95% CI


25 90.6 to 149.4 114.1 to 125.9
50 90.6 to 149.4 115.8 to 124.2
100 90.6 to 149.4 117.1 to 122.9
Does the 95% range A. Get wider B. Get narrower C. Stay the same
What happens as sample size increases?
Example: systolic blood pressure

Measured in samples of Each time mean = 120,


25, 50 and 100 SD = 15

50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
Systolic Blood Pressure (mmHg)

Sample size 95% range 95% CI


25 90.6 to 149.4 114.1 to 125.9
50 90.6 to 149.4 115.8 to 124.2
100 90.6 to 149.4 117.1 to 122.9
Does the 95% CI A. Get wider B. Get narrower C. Stay the same
What happens as sample size increases?
Example: systolic blood pressure

Measured in samples of Each time mean = 120,


25, 50 and 100 SD = 15

50 60 70 80 90 100 110 120 130 140 150 160 170 180 190
Systolic Blood Pressure (mmHg)

Sample size 95% range 95% CI


25 90.6 to 149.4 114.1 to 125.9

50 90.6 to 149.4 115.8 to 124.2


100 90.6 to 149.4 117.1 to 122.9

Does the 95% CI A. Get wider B. Get narrower C. Stay the same
Interpreting Data

• Summarising data
• Sampling from a population
• Describing associations
• Comparisons and p-values

38
Example data: BRradykinesia-Akinesia
INcoordination (BRAIN) Test
Computerised alternate finger tapping test

Best 2 measures
Kinesia score (number of keystrokes)
Akinesia time (mean dwell time on each key)
Correlation
100

80

Total Motor UPDRS 60

40

20

0
0 20 40 60 80
Kinesia score
Correlation
100

80
r = -0.53
Total Motor UPDRS 60

40

20

0
0 20 40 60 80
Kinesia score

• Total motor UPDRS scores are negatively correlated with Kinesia scores
• The correlation coefficient, usually denoted r, takes a value between -1 and +1
Correlation (Pearson)
5
r=0
4
No correlation

Variable y
3

1
1 2 3 4 5
Variable x
Correlation (Pearson)
5
r=0
4
No correlation

Variable y
3

2
5
r=1
1
4 1 2 3 4 5
Variable x
Variable y

Perfect
2
positive
correlation
1
1 2 3 4 5
Variable x
Correlation (Pearson)
5
r=0
4
No correlation

Variable y
3

2
5 5
r=1 r = -1
1
4 1 2 3 4 5 4
Variable x
Variable y

Variable y
3 3

Perfect Perfect
2
positive 2 negative
correlation correlation
1 1
1 2 3 4 5 1 2 3 4 5
Variable x Variable x
Correlation (Pearson)
5
r=0
4
No correlation

Variable y
3

2
5 5
r=1 r = -1
1
4 1 2 3 4 5 4
Variable x
Variable y

Variable y
3 3

Perfect 𝑐𝑜𝑣[𝑋, 𝑌] Perfect


2
positive 𝑟= 2 negative
𝑠𝑑! ×𝑠𝑑"
correlation correlation
1 1
1 2 3 4 5 𝑐𝑜𝑣 𝑋, 𝑌 = 𝐸 𝑋 − 𝜇! 𝑌 − 𝜇" 1 2 3 4 5
Variable x Variable x
Correlation (Spearman’s rank)
• If the data do not follow a Gaussian distribution or there is a non-linear
monotonic relationship Spearman’s rank correlation can be used
• Both variables are ranked and the correlation is calculated using the difference
(d) in ranks
100

80

6 ∑ 𝑑!" Pearson r=0.88


𝑟 =1− 60
Spearman’s rank r=0.96
𝑛 𝑛" − 1 y

40

20

0
0 20 40 60 80 100
x
Linear regression
100

80

Total Motor UPDRS 60

40

25.5
20

0
0 20 40 43.8 60 80
Kinesia score
Linear regression
100

80

Total Motor UPDRS 60

40

20

0
0 20 40 60 80
Kinesia score
Linear regression
100

80 Regression line minimises


the sum of the vertical
Total Motor UPDRS 60 distances from each point to
the line
40

20

0
0 20 40 60 80
Kinesia score
Linear regression
100

Dependant Independent
80
variable variable

Total Motor UPDRS


55.8
60 y = b0 + b 1x
UPDRS = b0 + b1×KS
40 UPDRS = 55.8 – 0.7×KS

20

0
0 20 40 60 80
Kinesia score

UPDRS decreases by 0.7 for every unit increase in kinesia score


Correlation (Pearson) and linear regression

Pearson’s correlation and linear regression only describe linear associations

3000

2000
r=0
y

1000
Regression line

0
0 20 40 60 80 100
x
Some other regression models

• Many different regression curves can be fitted where the data pattern is not
linear
• Polynomial – quadratic (y=b0+b1x+b2x2)
Some other regression models

• Many different regression curves can be fitted where the data pattern is not
linear
• Polynomial – quadratic (y=b0+b1x+b2x2), cubic (y=b0+b1x+b2x2+b3x3) etc
Some other regression models

• Many different regression curves can be fitted where the data pattern is not
linear
• Polynomial – quadratic (y=b0+b1x+b2x2), cubic (y=b0+b1x+b2x2+b3x3) etc
• Exponential growth (y=bx;b>1)
Some other regression models

• Many different regression curves can be fitted where the data pattern is not
linear
• Polynomial – quadratic (y=b0+b1x+b2x2), cubic (y=b0+b1x+b2x2+b3x3) etc
• Exponential growth (y=bx;b>1), exponential decay (y=bx;b<1)
Some other regression models

• Many different regression curves can be fitted where the data pattern is not
linear
• Polynomial – quadratic (y=b0+b1x+b2x2), cubic (y=b0+b1x+b2x2+b3x3) etc
• Exponential growth (y=bx;b>1), exponential decay (y=bx;b<1)
• Sigmoid
Multivariate linear regression
Dependant Independent
variable variables

y = b0 + b1x1 + b2x2 + b3x3 + b4x4 + …

• Independent variables can be continuous,


categorical or a mix
Multivariate linear regression
Dependant Independent
variable variables

y = b0 + b1x1 + b2x2 + b3x3 + b4x4 + …

• Independent variables can be continuous,


categorical or a mix
• Example: kinesia score according to age
and sex
• Kinesia score decreases by an average of
0.2 points per year of age (p=0.005)
• Kinesia score also on average 1.5 points
higher in females (p=0.035)
Predicting gestational age from crown rump length
Which regression should you be doing?
A: GA=61+0.4×CRL B: CRL=-13+0.9×GA
Gestational age from LMP (days)

105 85
80

Crown-rump length (mm)


98 75
70
91
65
60
84
55

77 50
45
70 40
40 45 50 55 60 65 70 75 80 85 70 77 84 91 98 105
Crown-rump length (mm) Gestational age from LMP (days)

What is being predicted should always be on the vertical axis


Interpreting Data

• Summarising data
• Sampling from a population
• Describing associations
• Comparisons and p-values

60
Statistical significance
• An observed sample difference between groups
might be due to chance

• We want to know whether a result is statistically


significant i.e. unlikely to be due to chance

• To determine whether an observed difference was


due to chance we look at confidence intervals and
p-values
Statistical significance
Example: mean BMI in males and females
In our sample of 163 QMUL students, suppose we are interested in
whether BMI differs between male and female students
N Mean (SD) BMI in kg/m2 Difference 95% CI for difference

Male 82 23.1 (4.2)


1.6 0.36 to 2.84
Female 81 21.5 (3.9)
What does this tell us about the difference in BMI between males
and females in the population?
We look at the confidence interval:
95% CI = mean difference ± 1.96 × SE of mean difference
Statistical significance
Example: mean BMI in males and females
Difference in means (95% CI): 1.6 (0.36 to 2.84)

Interpretation:
We are 95% sure that the difference in mean BMI
between male and female 1st year students in English
universities is between 0.36 and 2.84 kg/m2

Is this a true difference in the population or is it likely to


be a chance finding in this sample?
P-values
• If there was truly no difference between the males
and females (the underlying assumption), then the p-
value is the probability of observing a difference of at
least 1.6 kg/m2

• In general, a p-value for a result is the probability of


observing a result as or more extreme than the
sample result if the underlying assumption in the
population is true
Tossing a coin
• If you throw a fair coin 3 times, there are 8 possible combinations of outcomes

Throw 1

H T

Throw 2
H T H T

Throw 3
H T H T H T H T

Total heads 3 2 2 1 2 1 1 0
Total tails 0 1 1 2 1 2 2 3

• The probability of throwing 3 heads is 1/8


• The probability of throwing 2 heads and 1 tail is 3/8
• The probability of throwing at least 2 heads is 4/8 = 1/2
Biased coin?
• We suspect a coin of being biased towards
heads. We throw it three times and it landed
head side up 3 times. Is there enough
evidence to say it is biased?
• The probability of this occurring is 1/8 or
0.125
• We would need to have thrown the coin more
times for their to be sufficient evidence it was
biased
Another coin
• We suspect another coin is biased (but don’t know to which side). To
test this we threw it 22 times and it landed head side up 17 times.
• Is there enough evidence to say the coin is biased?
Probability of getting that
Coin lands heads side up number of heads*
17 0.00628
18 0.00174
19 0.000367
20 0.0000551
21 0.00000525
22 0.000000238
Total 0.008
*Assuming the coin is fair
• Assuming the coin is fair the probability of getting at least 17 heads
from 22 throws is 0.008
P-values
• A p-value for a result is
The probability of observing a result as or more extreme than the sample
result if the underlying assumption in the population is true

• If the coin was fair (the underlying assumption):

The probability of observing 17 heads is 0.006


The probability of observing 17 tails is also 0.006 (as extreme)

The probability of observing at least 17 heads is 0.008 (as or more


The probability of observing at least 17 tails is also 0.008 extreme)

• So the probability of observing at least 17 heads or at least 17 tails is


0.008+0.008 = 0.016. This is the p-value
Statistical significance
Statistical significance

Very unlikely to be a Probably not a Cannot rule out


chance effect chance effect chance effect
(assuming the coin (assuming
Probablythedifferent
coin (assuming the coin
is fair) is fair) is fair)

0.0001 Very likely to be different


0.001 0.01 0.1 1
0.016 0.05
P-value
• The p-value is less than 0.05 so we can be reasonably confident the coin is biased
When can p-values be calculated?

• When there is a comparison


– 2 means – are they different i.e. is their difference
different from 0?
– Association – are the observed results different
from those expected
– Regression – is the slope different from 0?
Statistical significance
• In a study after 1 year patients receiving propranolol had on average a heart rate
10bpm lower than patients receiving a placebo, p=0.0003
Statistical significance

Very unlikely to be a Probably not a Cannot rule out


chance effect chance effect chance effect
(assuming the coin (assuming
Probablythedifferent
coin (assuming the coin
is fair) is fair) is fair)

0.0001 Very likely to be different


0.001 0.01 0.1 1
0.0003 0.05
P-value
• The p-value is much lower than 0.05 so we can be strongly confident propranolol
reduces heart rate
Are girls smarter than boys?!
• IQ measured at age 3 in sample of 150 girls and 156 boys
Mean Standard error 95% CI
Girls 110 0.98 108 to 112
Boys 106 1.00 107 to 109
Difference 4 1.40 1 to 7

• On average girls in this study had an IQ 4 points higher than boys.


• In the population 95% sure that the mean difference is between 1 and 7
• The test used to compare the means is the two-sample t-test
• P-value calculated by comparing the t-statistic (4/1.40) to Student’s t-
distribution with 304 degrees of freedom
Degrees of freedom
• What is meant by degrees of freedom?
• These are the number of values that are free to vary
• Say you had to pick three numbers that had a mean of 10
• Could pick 9, 10 and 11
• Or 8, 10 and 12
• Or 5, 9 and 16
• Once you have picked the first two numbers, to get a mean of 10, the third
number has to be fixed – only 2 of the 3 numbers are free to vary so the
degrees of freedom is 2
• In the example we have 150 girls and 156 boys, so there are 149+155=304
degrees of freedom.
Are girls smarter than boys?!
• IQ measured at age 3 in sample of 150 girls and 156 boys
Mean Standard error 95% CI
Girls 110 0.98 108 to 112
Boys 106 1.00 107 to 109
Difference 4 1.40 1 to 7

• P-value calculated by comparing 2.85; p=0.007


the t-statistic (4/1.40=2.85) to
Student’s t-distribution with 304
degrees of freedom
P-values and confidence intervals
In our example
95% CI: 1 to 7 doesn’t contain 0
P-value: 0.007 <0.05

They are consistent:

If the 95% CI for a difference excludes 0 then p<0.05


contains 0 then p≥0.05
P-values and confidence intervals
In general:

If the 99% CI for a difference excludes 0 then p<0.01


contains 0 then p≥0.01

If the 95% CI for a difference excludes 0 then p<0.05


contains 0 then p≥0.05

If the 90% CI for a difference excludes 0 then p<0.1


contains 0 then p≥0.1
P-values and confidence intervals
The p-value for the difference in birth weight of
children born to smokers compared with non-
smokers is 0.02
Which is the correct 95% confidence interval for the
difference in birth weight?
1. -0.70 to 0.06kg
2. -0.06 to 0.70kg
3. 0.06 to 0.70kg
P-values and confidence intervals
In a study, a group of patients took statins and another group
placebo. The mean difference in LDL cholesterol was 1 mmol/L:

The 95% CI was 0.2 to 1.8


The 99% CI was -0.1 to 2.1
Which is correct?
1. P-value is less than 0.01
2. P-value is less than 0.05 but greater than 0.01
3. P-value is greater than 0.05
T-tests and non-parametric tests
• T-tests can also be performed where two measurements are
made on the same group of people – paired t-test
• Calculate individual differences – is the mean of those different to 0?
• T-tests assume data follows a Gaussian distribution
• If not, non-parametric test can be performed
• These are based on the ranks of the data rather than the actual values
of the data
• Instead of 2-sample t-test do Mann Whitney U-test (Wilcoxon rank
sum test)
• Instead of paired t-test do Wilcoxon signed rank test
Are alcohol drinkers more likely to smoke?
Sample of first year medical students
Ever smoked?
Yes No Total
Drink Yes 36 75 111
Observed alcohol?
No 5 47 52
Total 41 122 163

111×41/163
Ever smoked?
Yes No Total
Drink Yes 28 83 111
Expected alcohol?
No 13 39 52
Total 41 122 163
Are alcohol drinkers more likely to smoke?
Sample of first year medical students
Observed Expected
Ever smoked? Ever smoked?
Yes No Total Yes No Total
Drink Yes 36 75 111 Drink Yes 28 83 111
alcohol? alcohol?
No 5 47 52 No 13 39 52

Total 41 122 163 Total 41 122 163

• Test statistic calculated by comparing the observed with the expected


• For the first cell this is (36-28)2/28 = 2.3
• Do for the other three cells and add up to get 9.8, the test statistic
• Compare to chi-squared distribution to get p-value; p=0.002
Chi-squared test and Fisher’s exact test
If the data are sparse (less than 5 in at least one of the cells of the table) then
Fisher’s exact test is more appropriate than the Chi-squared test

Ever smoked?
The probability of observing the data
Yes No Total
given the fixed row and column totals is
Drink Yes a b a+b
alcohol? 𝑎+𝑏 !+ 𝑐+𝑑 !+ 𝑎+𝑐 !+ 𝑏+𝑑 !
No c d c+d
Total a+c b+d a+b+c+d (=n) 𝑎!×𝑏!×𝑐!×𝑑!×𝑛!

The same calculation is done for all possible tables that can be created given the
fixed row and column totals. The p-value is then the sum of all probabilities less
than or equal to the probability of the data observed

You might also like