BCBB Workshop
Applied Statistics II-2 & III
Jingwen Gu ([Link]@[Link])
Clinical Statistician
Bioinformatics and Computational Biosciences Branch (BCBB) /OCICB /NIAID
Contact us: bioinformatics@[Link] Material Download
Outline
1. Contingency table
• Sensitivity, Specificity, Type I/II error, Positive/Negative Predictive Value
• Joint, Marginal, Conditional probability distribution
2. Strength of association
• Odds ratio
• Relative Risk
3. Test of independence
• Cases with nominal large sample size and small sample size
• Cases with stratified and paired data
• Cases with ordinal data
4. Generalized linear model
• Logistic regression model
• Loglinear model
2
Introduction
Categorical variable is one for which the measurement scale
consists of a set of categories.
Categorical data is the statistical data type consisting of
categorical variables or of data has been converted into the
form.
In categorical data analysis,
• Response or Dependent variable Y : two ore more categories
• Explanatory or Independent variables X : discrete or continuous
or both
Example
Y = vote in election (Democrat, Republican, Independent)
X’s - income, education, gender, race
3
Categorical data type
Nominal and ordinal are categorical data.
• Nominal: unorder categories
– Examples: Gender, race, hair color
– Measures: counts, frequency, mode
• Ordinal: order categories
– Examples: Highest education degree, levels of satisfaction
– Measures: counts, frequency, mode, median
4
Contingency table
Cases: A data frame where each row represents one case. (e.g.
patient-level data)
Count Contingency table
Smoking Lung Cancer Count
Lung Cancer
Yes Case 688 Smoking Case Control
Yes Control 650 Yes 688 650
No Case 21 No 21 59
No Control 59
Table with cells that represent the IJ possible outcomes, when the cells
contain frequency counts of outcomes for a sample, called contingency
table.
5
Sensitivity and Specificity
Sensitivity of a test is the ability to identify correctly those who have the
disease and it is the proportion of patients with disease in whom the test
is positive.
Specificity of a test is the ability to identify correctly those who do not
have the disease and it is the proportion of patients without disease in
whom the test is negative.
Type I error is the rejection of a true null hypothesis, also known as false
positive.
Type II error is fail to reject a false null hypothesis, also known as false
negative.
6
Example – Screening test
Screening
Test Result
Diseased Not Diseased Total
Positive 10 400 410
Negative 1 4500 4501
Total 11 4900 4911
The sensitivity of a screening test ).
The specificity of a screening test: ).
A good screening exam has both high sensitivity and specificity.
7
PPV and NPV
Positive
predictive value (PPV) of a test is the probability of an
individual with a positive test has the disease. .
Negative predictive value (NPV) of a test is the probability of an
individual with negative test does not have the disease.
Diseased Not Diseased Total
Test Positive 100 900 1000
Test Negative 50 5000 5050
Total 150 5900 6050
8
Partial table
In a three-way contingency table cross-classifies X, Y, and Z, we
control for Z by studying the XY relationship at fixed levels of Z.
• Partial table splits the original three- Z X Y
way table according to levels of Z.
Gender Smoke Case Control
• The associations in partial tables are Male Yes
called conditional associations. It No
refers to the association between X Female Yes
and Y conditional on fixing Z at some No
level. Total Yes
No
• The two-way contingency table
obtained by combining the partial
tables is called the XY marginal
table.
9
Probability distribution
Joint distribution
– Let denote the probability that (X, Y) occurs in the cell in row i and column j.
is the joint distribution of X and Y.
Marginal distribution
– The marginal distribution that or
– Sum of the marginal distribution is 1.
Conditional distribution
– Given that a subject is classified in row i of X, we use to denote the
probability of classification in column j of Y, j = 1 , . . . , J. Then, .
10
Example - Probability distribution
A new drug is being tested on a group of 800 people (400 men and 400
women) with a particular disease. We wish to establish whether there is a
link between taking the drug and recovery from the disease.
Drug trial results:
Drug taken
Recovered Yes No
Yes 200 160
No 200 240
Recovery rate 50% 40%
We can conclude that the drug has positive effect. But if the result break
down into gender…
11
Cont
Gender Male Female
Drug taken Yes No Yes No
Recovered
Yes 180 70 20 210
No 120 30 80 90
Recovery rate 60% 70% 20% 30%
Both for male and female, the recovery rates are better without drug. Gender
influences drug taken because men are much more likely in this study to have been
given the drug than women.
The result that a marginal association can have a different direction from each
conditional association is called Simpson's paradox.
Avoid? Yes if we are certain that we know every possible variable that can impact
the outcome variable. If we are not certain – and in general we simply cannot be –
then Simpson’s paradox is theoretically unavoidable.
Reference: Simpson’s Paradox and the implications for medical trials
12
Confounding
Confounding variable is a variable that influences both
the dependent and independent variable causing a
spurious association.
To reduce effects of confounding variable:
• In experimental studies: randomly assigning subjects to
different levels.
• In observational studies: control confounding variable that
can influence relationship.
• Statistical control: collect data and include the potential
confounders as variable in your model.
13
Strength of association
To measure the strength of association, use other methods like:
• Odds ratio
• Relative risk
Above methods also measures of risk, they can be useful in safety
and efficacy studies.
Measures are effective when confounding variables are controlled.
14
Odds ratio
is the probability of an outcome divided by the probability of not
Odds
having that outcome. If is the probability of the outcome, the odds equals.
The odds of outcome when exposure presents is:
Similarly, the odds of outcome when exposure absent is:
Odds ratio is a ratio of the odds of two groups:
The asymptotic standard error of the :
15
Confidence intervals
The
general form of confidence interval is:
The confidence level of a confidence interval is the probability that
the true parameter is between this interval.
• Usually use 95% confidence interval, . In this lecture, critical value is
usually . For a 95% confidence interval, = 1.96.
• Sometimes a confidence interval for can be obtained indirectly, we
first calculate a confidence interval for , and then a confidence
interval for is obtained as .
16
Example – Odds ratio
In a study, patients admitted with lung cancer in the preceding year were queried
about their smoking behavior.
For each of the 709 patients admitted, they recorded the smoking behavior of a
noncancer patient at the same hospital of the same gender and within the same 5-
year grouping on age (smoker was defined as a person who had smoked at least one
cigarette a day for at least a year).
Evaluate the strength of association by calculating odds ratio and 95% confidence
interval. Is the odds ratio significantly different from 1?
17
Cont
example,
From
The odds of patient to have lung cancer are three times that if smoke compare to if
did not smoke.
The asymptotic standard error of the log odds is:
)=
95% CI for is
95% CI for is
The 95% confidence interval is Since it does not include one, odds ratio is
significantly different from 1. The odds of the smoking group to have lung cancer is
between 1.79 and 4.95 times compared to the non smoking group.
18
Relative risk
Relative
risk is the ratio of the probability of an outcome in an
exposed group to the probability of an outcome in an unexposed
group.
Relative risk equals to:
An estimated standard error for log :
19
Example – Relative risk
In smoking and lung cancer example, the proportions having lung cancer were for smoker and
were for non-smoker.
The sample relative risk is
Participants who smoke are 1.96 times more likely to develop lung cancer as compared to non-
smokers.
An estimated standard error for log is
95% confidence interval of is
95% confidence interval of is .
20
Similarity between OR and RR
The
relationship between odds ratio and relative risk:
when and small, odds ratio is approximately equals to relative risk.
OR and RR are useful for prospective study designs.
Dealing with small probability, RR is better in interpretation.
21
Test of Independence
Lung Cancer
Smoker Case Control Total
Yes Large688sample
650 size
1338 Small sample size
No 21 59 80
Total 709 709 1418
CVD Non-CVD Total
Ordinal data
Obese 10 90 100
Not Obese 35 465 500
Total 45 555 600
Stratified data Paired data
CVD Non-CVD Total
Obese 36 164 200
Not Obese 25 175 200
Total 61 339 400
22
Test of Independence
Pearson’s chi-square test and likelihood ratio test are used for testing
independence by evaluating the closeness between observed and expected
frequencies.
– Assumption: large samples and independence of individual observation.
– Two variables are independent. They are not independent.
– is the expected frequencies, is the observed.
Pearson chi-square statistic: where
Likelihood ratio statistic:
Degree of freedom:
and follow . The larger the values of and are, the more evidence exists
against independence. If p-value less than significance level, then reject null
hypothesis.
23
Example – Chi-square test
the example of case-control study of lung cancer and smoking:
In
Is there a significant association between smoking and lung cancer?
Smoking and lung cancer are independent. not independent
Assume significance level=0.05.
Observed table Expected table
Lung Cancer Lung Cancer
Smoker Case Control Total Smoker Case Control
Yes 688 650 1338 Yes 669 669
No 21 59 80 No 40 40
Total 709 709 1418 Total 709 709
24
Cont – Pearson Chi-square test
Pearson statistic equals to
The
Degree of freedom is
P-value is , less than , reject null hypothesis. (calculate using in R)
We have 95% confidence to reject the null hypothesis that smoking
and lung cancer are independent.
25
Cont – Likelihood ratio test
The
likelihood ratio statistic equals to,
Degree of freedom is
P-value is , less than , reject null hypothesis.
We have 95% confidence to reject the null hypothesis that
smoking and lung cancer are independent.
26
Properties
and
have the same limiting chi-squared distribution
(asymptotically equivalent).
converges in probability to zero.
The order of the row or column vector does not change for the
result of a chi-square or likelihood ratio test of independence.
27
Small sample size
Fisher’s exact test can be used for test of independence when is small.
– Assumption: Independence of individual observation and fixed totals. (the
row and column totals are fixed, or “conditioned.”) When row or column totals are
unconditioned, makes this test less powerful.
The probability mass function is:
The exact possibility assigned to each of the possible outcomes:
Calculate p-value as the total probability of observing data as extreme
and more extreme cases. Reject null hypothesis if p-value less than
significant level.
28
Example – Fisher’s test for small size data
Example: Lady Tasting Tea
R. A. Fisher described the following experiment from his
days working at Rothamsted Experimental Station. His
colleague, Dr. Muriel Bristol declares that by tasting a
cup of tea made with milk she can discriminate whether
the milk or the tea infusion was first added to the cup.
Experiment design: consist of eight cups of tea, four
pouring milk first and four pouring tea first; serve in a
random order. She knew there were four cups of each
type and had to predict which four had the milk added
first.
29
Cont – Fisher’s test for small size data
Distinguishing the order of pouring better than with pure guessing corresponds to ,
reflecting a positive association between order of pouring and the prediction.
against
All Even All Correct
Incorrect
0 4 1 3 2 2 3 1 4 0
4 0 3 1 2 2 1 3 0 4
The probability is The more extreme in the direction of has correct.
The P-value is p
Cannot reject at significant level 0.05, which means that the result does not establish
an association between actual order of pouring and her predictions.
30
Stratified data
Cochran-Mantel-Haenszel test for the analysis of stratified or matched
categorical data.
– Often used in observational studies where random assignment of subjects to different
treatments cannot be controlled, but confounding covariates can be measured.
– : There is no association between the two inner variables.
According to stratification, create contingency tables. Assume there
aretables.
Y1 Y2 Total
The test statistic can be calculated by:
X1
X2
Total
Follow distribution asymptotically with one degree of freedom under .
31
Example – CHM test for stratified data
examine the association between obesity and cardiovascular diseases (CVD),
To
data is stratified into two categories with ageand age50:
CVD Non-CVD Total CVD Non-CVD Total
Obese 10 90 100 Obese 36 164 200
Not Obese 35 465 500 Not Obese 25 175 200
Total 45 555 600 Total 61 339 400
There
is no association between obesity and CVD.
=
p-value=0.08, fail to reject null hypothesis that there is no association between
obesity and CVD.
32
Paire
Pairedd data
McNemar’s
McNemar’s
test is
test
two samples
is used
used for
for comparing
comparingcategorical
categoricalresponses
responsesfor
fortwo
samples that that are statistically
are statistically dependent.
dependent.
Commonly occur in studies with repeated measurement of
Commonly
subjects. occur in studies with repeated measurement of subjects.
McNemar statistic
McNemar statistic with
with continuity
continuitycorrection:
correction:
(� − � − 1)
� =
� +�
For large samples, has a chi-squared distribution with , p-value less
For large samples, � has a chi-squared distribution with �� = 1,
than significance
p-value less thanlevel, reject the
significance nullreject
level, hypothesis of hypothesis
the null independence.
of
independence.
33
32
Example – McNemar’s test for matched pair
data
In the 2010 General Social Survey, subjects were asked who they voted for
democrat or republican in the 2004 and 2008 Presidential elections. Was
there a shift in this direction?
The McNemar statistic is: with , p-value , extremely strong evidence of a
shift in the Democrat direction.
34
Ordinal data
chi-square tests ignore some information when used to test
The
independence between ordinal classifications. Taking the ordering
into account are usually more powerful.
In ordinal data analysis, we can assign scores to the levels for
ordinal variables by using:
• Average of category interval
• Midrank
Then Linear trend test statistic can be used:
where is Pearson correlation between two variables.
For large samples, it is approximately chi-squared with df = 1.
35
Example – Linear trend test for ordinal data
subjects are cross-classified according to the three
491
factors: hypertension (hyp; 2 levels), obesity (obe; 3 levels)
and alcohol (alc; 4 levels).
• Alc: the classification of alcohol intake of drinks per day
(0, 1-2, 3-5, 6+)
• Obe: the classification of obesity (low, average, high)
• Hyp: the classification of hypertension (yes, no)
Objective: whether correlation between two ordinal variables.
versus
• Use linear trend test to test for independence between
two variables.
• Check p-value of statistic with degree of freedom 1.
Source: Knuiman, M.W. & Speed, T.P. (1988) Incorporating
Prior Information into the Analysis of Contingency
Tables. Biometrics, 44 (4), 1061–1071.
36
Cont
Assign scores to the level of ordinal variables
• Average of the category interval
– Obesity: low, median, high assign to 1, 2, 3
– High BP: no, yes assign to 0, 1
– Alcohol: 0, 1-2, 3-5, 6+ assign to 0, 1.5, 4, 7
(Left graphic shown the recoding data)
• Midrank
Rank the observations and applies midrank as scores.
– Obesity: 83 for low (average among 1-165), 246 for median (from 166-
326), 409 for high (from 327-491)
Create contingency table
(rows for obesity and columns for alcohol)
Pearson correlation between obesity and alcohol is , ,
Compare with significant level and make conclusion.
We have 95% confidence to conclude that there is association
between obesity and alcohol.
37
Compare result with other statistic
Obesity vs Alcohol 0.325 0.317 0.022
High BP vs Alcohol 0.026 0.022 0.003
Obesity vs High BP 0.003 0.003 0.001
Sensitivity to choice of scores
• Scores that are linear transforms of each other, such as (1, 2,
3, 4) and (0, 2, 4, 6), have the same absolute correlation and
hence the same .
• Results may depend on the scores when the data are highly
unbalanced.
38
Summary of Statistical Testing Applied
to Specific Cases
Large and independent data Pearson Chi-square test/
Likelihood ratio test
Small and conditioned data Fisher’s exact test
Stratified data Cochran-Mantel-Haenszel
Paired data McNemar’s test / CMH test
Ordinal data Linear trend test
** There are much more than listed statistical testing can be applied to categorical data, compare and select one most fit
to your cases! Check assumption before applying it.
39
Bernoulli and Binomial distribution
Bernoulli trial: two possible outcomes for one trial (success, failure).
Assume as a success, the probability of a success is . Assume as a failure, the probability of
a failure .
Bernoulli probability mass function (PMF):
Binomial distribution with parameters and is the discrete probability distribution of the number
of successes in a sequence of n independent experiments.
• Binomial: n Bernoulli trials – two possible outcome for each trial.
• Y follows binomial distribution, .
Binomial PMF:
Represents the probability of having y successes in n independent trials.
40
Example – Binomial distribution
are three students registered a class. The decision of whether attend class or
There
not are independent for each student. Assume the probability of attending class is
0.5. What is the probability of have none, one, two, three students attend the class?
From example, . For random sample size , let . Probability mass function express as:
The probability of no student come attend class is:
The probability of having 1, 2, 3 students attend is:
Since sample size is 3, the sum of the probability of none, 1, 2, 3 students come equals 1.
41
Properties – Binomial distribution
• has
mean and variance
• , probability of success, also denote as has mean and standard
deviation
• If trial has more than two possible outcomes, lets say categories, the
counts follow multinomial distribution. The multinomial probability
mass function is:
42
Poisson distribution
The
Poisson distribution expresses the probability of a number of events that
occur randomly over a fixed interval of time or space, when outcomes in
disjoint periods or regions are independent.
Examples: the number of emails you get in an hour; the number of red car
pass by in a hour; the number of earthquake in a year in some region.
Poisson probabilities depends on a single parameter, mean .
Probability mass function:
Where
mean number of occurrences in the given interval or space
Euler’s constant 2.71828
43
Cont
Source: Poisson distribution
Property:
44
Limitation
Overdispersion: count observations often exhibit variability exceeding that
predicted by the binomial or Poisson.
When vary from different conditions, the counts event display more
variation.
The negative binomial is a related distribution for count data that has a
second parameter and permits the variance to exceed the mean.
where and are parameters.
Or other methods like generalized linear model…
45
Generalized Linear Models
Generalized linear models (GLMs): extend ordinary regression models to
encompass non-normal response distributions and modeling functions of
the mean.
• Random component
– Consists of a response variable Y with independent observations
from a distribution in the natural exponential family.
• Systematic component
– Specify explanatory variables used in a linear predictor function.
• Link function
– A function of that equals to a linear function of explanatory
variables.
46
Introduction to Logistic Regression (LR)
model is to describe data and to explain the relationship between one
Logistic
dependent binary variable and one or more categorical or continuous independent
variables.
Suppose we have a binary response variable, denote as and a single explanatory
variable the distribution of is .
Logistic regression model:
As increases, increases when and decreases when .
The log odds has the linear relationship:
47
Coefficient trend
A fixed change in often has less impact when is near 0
or 1 then when is near 0.5.
Multiplicative effect
The odds multiply by for every 1-
unit increase in .
In other words, is an odd ratio, the
odds at divided by the odds at .
48
Example – logistic model
From an epidemiological survey to investigate snoring as a risk factor
for heart disease. (0, 2, 4, 5) is used to score the level of snoring,
treating the last two levels closer. Build logistic model to demonstrate
relationship.
Heart Disease
Snoring/ score Yes No
Never 0 24 1355
Occasionally 2 35 603
Nearly every night 4 21 192
Every night 5 30 224
49
Cont
Software
reposts the logistic regression ML fit
50
Logistic model interpretation
Interpret:
• The positive reflects the increased incidence of heart disease
at higher snoring levels.
• The estimated probability of heart disease is about 0.02 for
non-snorers (calculated using ); it increases to 0.04 for
occasional snorers, to 0.09 for those who snore nearly very
night, and to 0.13 for those who always snore.
51
Example: Multiple logistic regression
The following data are from a study on the effects of AZT in slowing
the development of AIDS symptoms. In the study, 338 veterans
whose immune systems were beginning to falter after infection with
HIV were randomly assigned either to receive AZT immediately or to
wait until their T cells showed severe immune weakness.
Symptoms
Race AZT Use Yes No
White Yes 14 93
No 32 81
Black Yes 11 52
No 12 43
52
Cont
Logistic
regression for binary response:
Where x represents AZT treatment, z represents race, predicting the
probability of AIDS symptoms developed.
In this method, we assume there is no interaction between race (z)
and AZT treatment (x), the effect of one factor is the same at each
level of the other factor.
is the multiplicative effect on the odds of a 1-unit increase in , when
we can keep fixed the levels of other .
53
Cont
1 for white race, 0 for black race
1 for immediate AZT use, otherwise 0
54
Multiple logistic model Interpretation
Parameter
Interpretation
x z logit
1 1
1 0
0 1
0 0
• is the log odds of developing AIDS symptoms for black race
subjects without immediate AZT uses.
• is the increment to the log odds for those with immediate AZT
use.
• is the increment to the log odds for white race subjects.
55
Cont
At a fixed level of Z, the effect on the logit of changing
categories of X is:
The estimated odds ratio between immediate AZT use and
development of AIDS symptoms equals .
For each race, the estimated odds of symptoms are half as
high for those who took AZT immediately.
The Wald confidence interval for this effect is .
56
Loglinear model
Poisson GLMs are used for model count or rate data for a single
nonnegative integer-value response variable.
Poisson loglinear GLM assumes a Poisson distribution for and uses the
log link.
The Poisson loglinear model with explanatory variable is
The mean satisfied the exponential relationship
A 1-unit increase in has a multiplicative impact of : The mean at equals
the mean at times .
57
Example – Loglinear model
this example, the response outcome for each of 173 female crabs is her
In
number of satellites. Explanatory variables are the female crab's color, spine
condition, weight, and carapace width. Table below shows a small set of the
data.
Create a model to predict number of satellites using carapace width. Let the
expected number of satellites at width . The ML fit of Poisson loglinear model
is
is the multiplicative effect on for 1 unit increase in . For example, when , , also
equals to multiplicative effect 1.18 times 2.81 which calculated from
58
References
Agresti, A. (2018). An introduction to categorical data analysis. Wiley.
Sullivan, L. M. (2011). Essentials of biostatistics in public health. Jones & Bartlett Publishers.
Knuiman, M.W. & Speed, T.P. (1988) Incorporating Prior Information into the Analysis of
Contingency Tables. Biometrics, 44 (4), 1061–1071.
Peng Zeng (2012) Categorical Data Analysis - More discussions on Logistic Regression
[Link]
Fenton, N., Neil, M. and Constantinou, A. (2015) Simpson’s Paradox and the implications for
medical trials
[Link]
Wikipedia Cochran-Mantel-Haenszel test
[Link]
59