Biostatistics for Clinical and Public Health Research, 1st
Edition
Visit the link below to download the full version of this book:
https://medipdf.com/product/biostatistics-for-clinical-and-public-health-researc
h-1st-edition/
Click Download Now
viii Contents
4 Discrete probability distributions 107
Terms 107
Introduction 107
Examples of discrete random variables 108
Examples of continuous random variables 108
Measures of location and spread for random variables 108
Permutations and combinations 109
Binomial distribution 111
Poisson distribution 116
References 123
5 Continuous probability distributions 125
Terms 125
Introduction 125
Distribution functions 125
Normal distribution 127
Review of probability distributions 135
References 140
Lab B: Probability distributions 141
6 Estimation 145
Terms 145
Introduction 146
Statistical inference 146
Sampling 146
Randomized clinical trials 147
Population and sample mean 148
Confidence intervals for means 151
Using the standard normal distribution for a mean 154
The t-distribution 157
Obtaining critical values in SAS and Stata 158
Sampling distribution for proportions 163
Confidence intervals for proportions 164
References 172
7 One-sample hypothesis testing 175
Terms 175
Introduction 175
Basics of hypothesis testing 176
Confidence intervals and hypothesis tests 180
Inference for proportions 190
Determining power and calculating sample size 197
References 213
Contents ix
Lab C: One-sample hypothesis testing, power, and sample size 215
8 Two-sample hypothesis testing 221
Terms 221
Introduction 221
Dependent samples (paired tests) 222
Independent samples 228
Sample size and power for two-sample test of means 241
References 253
9 Nonparametric hypothesis testing 255
Terms 255
Introduction 255
Types of data 256
Parametric vs. nonparametric tests 256
References 280
Lab D: Two-sample (parametric and nonparametric) hypothesis testing 281
10 Hypothesis testing with categorical data 289
Terms 289
Introduction 289
Two-sample test for proportions 290
References 340
11 Analysis of variance (ANOVA) 341
Terms 341
Introduction 341
Within- and between-group variation 342
ANOVA assumptions 343
Testing for significance 345
References 370
12 Correlation 371
Term 371
Introduction 371
Population correlation coefficient (ρ) 371
Pearson correlation coefficient (r) 373
Spearman rank correlation coefficient (rs) 386
References 400
13 Linear regression 401
Terms 401
Simple linear regression 401
x Contents
Multiple linear regression 423
Model evaluation 423
Other explanatory variables 428
Model selection 436
References 440
14 Logistic regression 441
Term 441
Introduction 441
Interpretation of coefficients 445
References 467
15 Survival analysis 469
Terms 469
Introduction 469
Comparing two survival functions 471
References 487
Lab E: Data analysis project 489
Appendix: Statistical tables 499
Index 545
Acknowledgments
It takes a village to write a book even if there is only one name on the cover. I want to
thank the Goodman Lab team at the Washington University School of Medicine (Nicole
Ackermann, Goldie Komaie, Sarah Lyons, and Laurel Milam). Although I often get
the credit and accolades, you make it possible for me to look good. You are an amaz-
ing group of women and have been an enormous support to me, for which I am forever
grateful. I know that working for me can be challenging. This was especially the case
when I decided to write two books in addition to everything else that we already had
going on as a team, but you smiled through it all.
A special thanks to the talented young female statisticians who worked in the
Goodman Lab and served as teaching assistants for my course. Each of you made sig-
nificant contributions that helped turn my course notes into an actual book: Nicole
Ackermann (Chapters 1 through 5, 15, and Lab A), Sarah Lyons (Chapters 11 through
13), and Laurel Milam (Chapters 6 through 10, 14, and Labs B through E). Thank you
for creating examples, datasets, code books, figures, tables, and callout boxes; thank you
for truly making this a team effort.
Thanks to Sharese T. Wills for being brave enough to take on the challenge of editing
a biostatistics book. You took what I thought were solid drafts and completely reworked
them, making them much better.
To my family, friends, mentors, and colleagues who have supported me through this
journey—there are too many of you to name (not really; I am just too private to do
so), but I hope that you know who you are. It is rare for someone to have such strong
personal and professional support systems, and I am lucky to have both. It is great to
feel wanted in your department and supported by your institution, but I think it is rare
to feel supported by your profession. Thank you to all the academics who have helped
me to get to the point of publishing this workbook on biostatistics. It was not an easy
journey, and there is still more on the road ahead. However, I never would have made it
this far without you.
Author
Dr. Melody S. Goodman, is an associate professor in the Department of Biostatistics at
New York University College of Global Public Health. She is a biostatistician with expe-
rience in study design, developing survey instruments, data collection, data management,
and data analysis for public health and clinical research projects. She has taught intro-
ductory biostatistics for masters of public health and medical students for over 10 years
at multiple institutions (Stony Brook University School of Medicine, Washington
University in St. Louis School of Medicine, New York University College of Global
Public Health).
List of abbreviations
ACS American Community Survey
BMI Body mass index
BRFSS Behavioral Risk Factor Surveillance System
CBT Cognitive behavioral therapy
CDC Centers for Disease Control and Prevention
CDF Cumulative distribution function
CL Confidence limit
CLM Confidence limit for mean
CLT Central limit theorem
CRP C-reactive protein
df Degrees of freedom
DRE Digital rectal examination
ED Emergency department
GED General education diploma
GIS Graphical information system
HIV Human immunodeficiency virus
LCLM Lower confidence limit
MRCI Medication Regimen Complexity Index
mRFEI Modified retail food environment index
MSA Metropolitan Statistical Area
NATA National Air Toxics Assessment
NHANES National Health and Nutrition Examination Survey
NIH National Institutes of Health
PDF Probability density function
PK Pyruvate kinase
PSA Prostate-specific antigen
ROC Receiver operating characteristic
RR Relative risk
RT-PCR Reverse transcription polymerase chain reaction
SEM Standard error of the mean
SES Socioeconomic status
SSA Social Security Administration
TRUS Transrectal ultrasound
UCLM Upper confidence limit
VPA Vigorous physical activity
YRBSS Youth Risk Behavior Surveillance System
Introduction
This workbook started as a set of course notes and handouts that I used while teaching
Introduction to Biostatistics. Although I love technology, I just think that “old school”
is better for some things. I remember learning math in elementary school with work-
books. I am almost sure that none of my current students have even seen a workbook.
Nonetheless, there is something about working through problems that helps people to
grasp the concepts. We try to use examples in this workbook that are easy to under-
stand, and we walk through problems step by step.
This introductory workbook is designed like a good set of notes from the best stu-
dent in the class—the professor. Its outline format at the beginning of each chapter
points to key concepts in a concise way, and chapters include highlights, bold text,
and italics to point out other areas of focus. In addition, tables provide important
information. Labs that include real-world clinical and public health examples walk
readers through exercises, ensuring that students learn essential concepts and know
how to apply them to data. The workbook provides the reader with the statistical
foundation needed to pass medical boards and certification exams in public health.
Also, those enrolled in online courses may find this workbook to be a great resource
to supplement course textbooks. Researchers in the field, particularly those new to
quantitative methods and statistical software (e.g., SAS or Stata), will find that this
book starts at an appropriate level and covers a breadth of needed material with proper
depth for a beginner. The workbook will also serve as a great reference to consult after
the initial reading.
The workbook provides a solid foundation of statistical methods, allowing math-
phobic readers to do basic statistical analyses, know when to consult a biostatistician,
understand how to communicate with a biostatistician, and interpret quantitative study
findings in the contexts of the hypotheses addressed. Many introductory biostatistics
books spend considerable time explaining statistical theory, but what students and
researchers really need to know is how to apply these theories in practice. This work-
book walks readers through just that, becoming a lifelong reference.
This workbook covers the basics—from descriptive statistics to regression analysis—
providing a survey of topics, including probability, diagnostic testing, probability dis-
tributions, estimation, hypothesis testing (one-sample, two-sample, means proportions,
nonparametric, and categorical), correlation, regression (linear and logistic), and survival
analysis. Examples are used to teach readers how to conduct analyses and interpret the
results. There is no fluff or extra verbiage. The workbook provides readers with exactly
what they need to know and shows them how to apply their knowledge to a problem.
xviii Introduction
The workbook not only provides the reader with an introduction to statistical meth-
ods but also a step-by-step how-to guide for using SAS and Stata statistical software
packages to apply these methods to data, using lots of practical hands-on examples.
Statistical package: A collection of statistical programs that describe data and perform
various statistical tests on the data.
Some of the most widely used statistical packages include the following:
• SAS—used in this book
• R
• Stata—used in this book
• SPSS
• MATLAB®
• Mathematica
• Minitab
• Excel
In addition, this workbook provides a solid foundation with concisely written text
and minimal reading required. Although it is designed for an academic course, the
workbook can be used as a self-help book that allows the user to learn by doing. The
real-world practical examples show the user how to place results in context and that
outcomes of analysis do not always go the way that the researcher predicts.
The workbook walks the readers through problems, both by hand and with statisti-
cal software. Readers can learn how the software performs the calculations, and they can
gain the ability to read and interpret SAS and Stata output. The SAS and Stata code
provided in the workbook provide readers with a solid foundation from which to start
other analyses and apply to their own datasets.
General overview
What is statistics?
Statistics
1 “The science whereby inferences are made about specific, random phenomena on the
basis of relatively limited sample material.”1
2 “The art of learning from data. It is concerned with the collection of data, their
subsequent description, and their analysis, which often lead to the drawing of
conclusions.”2
The two main branches of statistics
1 Mathematical statistics: The branch of statistics concerned with the development
of new methods of statistical inference and requires detailed knowledge of abstract
mathematics for its implementation.
2 Applied statistics: The branch of statistics involved with applying the methods of
mathematical statistics to specific subject areas such as medicine, economics, and
public health.
Introduction xix
A Biostatistics: The branch of applied statistics that applies statistical methods to
medical, biological, and public health problems. The study of biostatistics explores
the collection, organization, analysis, and interpretation of numerical data.
Basic problem of statistics
Consider a sample of data x1, x2,…,xn where x1 corresponds to the first sample point and
xn corresponds to the nth sample point. Presuming that the sample is drawn from some
population P, what inferences or conclusions can be made about P from the sample? (See
figure below.)
Data
Data are often used to make key decisions. Such decisions are said to be data driven. With
the use of technology, we are able to collect, merge, and store large amounts of data from
multiple sources. Data are often the core of any public health or clinical research study,
which often starts with a research question and the collection and analysis of data to answer
that question. It is important for researchers and practitioners to understand how to col-
lect (or extract from other sources), describe, and analyze data. Furthermore, the ability to
understand and critique data and methods used by others is increasing in importance as
the volume of research studies increases while the quality remains inconsistent.
Units and variables
In most instances, a dataset is structured as a data matrix with the unit of analysis (e.g.,
research participants, schools, research papers) in the rows and the variables in the columns.
xx Introduction
• Units (cases): The research participants or objects for which information is collected.
• Variables: Systematically collected information on each unit or research participant.
Types of variables
• Nominal: Unordered categories (e.g., male, female).
• Ordinal: Ordered categories (e.g., mild, moderate, severe).
• Ranked: Data transformation where numerical or ordinal values are replaced by the
rank when the data are sorted (e.g., top five causes of death, top three favorite
movies).
• Discrete: Has a finite number of values where both ordering and magnitude are
important (e.g., number of accidents, number of new AIDS cases in a one-year
period).
• Continuous: Has an infinite number of possible values between its minimum and
maximum values (e.g., volume of tumor, cholesterol level, time).
In this book, we will discuss ways to describe and analyze a single variable and the
relationship between two or more variables. We demonstrate and walk through key
principles by hand and supplement this with the use of a statistical package. A statisti-
cal package does what the user tells it to do. It is important to understand key concepts
so that you can arrive at accurate outcomes from a software package, read statistical out-
put, and properly interpret the results. In Chapter 1, we discuss methods for describing
sample data, and, in the rest of the chapters, we discuss ways of analyzing data to test
hypotheses.
References
1. Rosner B. Fundamentals of Biostatistics. 8th ed. Boston, MA: Cengage Learning. 2016.
2. Ross SM. Introductory Statistics. 2nd ed. Burlington, MA: Elsevier Academic Press. 2005.
1 Descriptive statistics
This chapter will focus on descriptive statistics and will include the following:
• Measures of central tendency (measures of location)
• Measures of spread (measures of dispersion)
• Measures of variability
• Graphic methods
• Outliers and standard distribution rules
Terms
• arithmetic mean (average) • mode
• bar graph • outlier
• box plot • percentiles
• Chebyshev Inequality • quartile
• decile • quintile
• descriptive statistics • range
• empirical rule • scatterplot
• geometric mean • standard deviation
• GIS map • stem-and-leaf plot
• histogram • tertile
• interquartile range • variance
• median
Introduction
A complete statistical analysis of data has several components. A good first step in
data analysis is to describe the data in some concise way, which allows the data analyst
a chance to learn about the data being considered. Descriptive statistics is the part
of statistics that is concerned with the description and summarization of data. Initial
descriptive analysis quickly provides the researcher an idea of the principal trends and
suggests where a more detailed look is necessary. The measures used in describing data
include measures of central tendency, spread, and the variability of the sample. All of
these measures can be represented in both tabular and graphic displays. We will go over
different types of graphs and displays in this chapter.
2 Biostatistics for clinical and public health research
Measures of central tendency (Measures of location)
One type of measure that is useful for summarizing data defines the center, or middle,
of the sample. Thus, this type of measure is called a measure of central tendency (also
“measure of location”). Several measures of central tendency exist, but four measures of
central tendency will be discussed in this section:
1 Arithmetic mean (average)
2 Median
3 Mode
4 Geometric mean
Arithmetic Mean: The sum of all observations divided by the number of observations.
• The arithmetic mean is what is commonly referred to as an average.
• This is the most widely used measure of central tendency. However, the arithmetic
mean, or average, is oversensitive to extreme values, meaning that the mean can
be influenced by a value that is much higher or much lower as compared to other
values in the dataset.
• We use the notation μ to denote the mean of a population and x to denote the
mean of a sample.
Equation 1.1 shows the calculation of the arithmetic mean.
x=
1
n ∑ x = x + x +…+
i =1
i
n
1 x 2 n
(1.1)
Because the mean is based on summation, knowing several properties of summation
is often useful as you begin to analyze data, specifically as the data relate to the mean.
BOX 1.1 PROPERTIES OF THE SAMPLE MEAN
Equation 1.2 shows the multiplicative property of summations:
n n
∑i =1
cx i = c ∑x
i =1
i (1.2)
Three important properties of the arithmetic mean:
1. If yi = xi + c where i = 1,…,n then y = x + c
2. If yi = cxi where i = 1,…,n then y = cx
3. If yi = c1xi + c2 where i = 1,…,n then y = c1x + c 2
Median: The median is the value in the middle of the sample variable such that 50%
of the observations are greater than or equal to the median and 50% of the observations
are less than or equal to the median.
Descriptive statistics 3
• The median is an alternate measure of central tendency (measure of location) and is
second to the arithmetic mean in familiarity.
• The median is useful in data that have outliers and extreme values; the median is
insensitive to these values.
• Calculation of the median uses only the middle points in a sample and is less sensi-
tive to the actual numeric values of the remaining data points.
Calculation: Suppose that there are n observations in a sample and that these obser-
vations are ordered from smallest (1) to largest (n). The sample median is defined as
follows:
If n is odd,
th
n + 1
Median = the
2
observation
If n is even,
th th
n n
Median = the average of the plus the + 1 observations
2 2
EXAMPLE PROBLEM 1.1
Calculate the arithmetic mean and median of Sample 1.
Sample 1: 2.15, 2.25, 2.30
To find the mean, we add all the values and divide the sum by n, which equals 3.
2.15 + 2.25 + 2.30 6.7
Mean = = = 2.23
3 3
To find
th
the median, we put all values in numerical order and find the
3 + 1
= 2nd value.
2
Median = 2nd value of Sample 1 (2.15, 2.25, 2.30) = 2.25.
PRACTICE PROBLEM 1.1
Calculate the arithmetic mean and median of Sample 2.
Sample 2: 2.15, 2.25, 2.30, 2.60
Mode: The mode is the observation that occurs most frequently.
• This measure of central tendency (measure of location) is not a useful measure if
there are a large number of possible values, each of which occurs infrequently.
• Some distributions can have more than one mode. We can classify a distribution by
the number of modes in the data.
4 Biostatistics for clinical and public health research
• If there is one mode, the distribution is unimodal. For example, in the follow-
ing sequence of numbers, there is one mode because 7 appears the most out of
any data value:
– 1 2 3 5 7 7 7 8 8 9
• If there are two modes, the distribution is bimodal. For example, the following
sequence of numbers has two modes because 5 and 6 both appear the most out
of any data value in the sequence:
– 2 3 4 5 5 6 6 7 10
• If there are three modes, the distribution is trimodal. For example, the follow-
ing sequence of numbers has three modes because 1, 2, and 6 appear the most
out of any data value in the sequence:
– 1 1 2 2 5 6 6 8 9
Arithmetic mean versus median
Because the mean is sensitive to outliers and extreme values, it is important to deter-
mine when to use the arithmetic mean versus the median. The distribution of the data
is a key factor in making this decision.
Arithmetic mean
For a symmetric distribution, the arithmetic mean is approximately the same as the
median.
For a positively skewed distribution, the arithmetic mean tends to be larger than the
median.
For a negatively skewed distribution, the arithmetic mean tends to be smaller than
the median.
Median
If the distribution is symmetric, the relative position of the points on each side of the
sample median is the same. The mean or median can be used to describe this
sample.
If the distribution is positively skewed (skewed to the right), the points above the median
tend to be farther from the median in absolute value than points below the
median. This is sometimes referred to as “having a heavy right tail.”
If a distribution is negatively skewed (skewed to the left), points below the median tend
to be farther from the median in absolute value than points above the median.
This is sometimes referred to as “having a heavy left tail.”
See Figure 1.1 for a demonstration of the relationship between the arithmetic mean
and the median and the skewed versus nonskewed distributions. The mode is also rep-
resented in the symmetric distribution on the figure.
Geometric Mean: The geometric mean is the antilogarithm of log x (see Equation 1.3).
This measure of central tendency is not often used in practice but can be useful when
dealing with biological or environmental data that are based on concentrations (e.g.,
biomarkers, blood lead levels, C-reactive protein [CRP], cortisol).
Descriptive statistics 5
Positive skew Negative skew
Mean
Median
Mode
Symmetric distribution
Figure 1.1 P
ositively and negatively skewed distribution. In the distribution with a positive skew
(top left) and the distribution with a negative skew (top right), the median is a better measure
of central tendency than the mean. The mean more accurately captures the central tendency of
the data when the distribution is symmetric (center bottom) with thin tails versus when the
data is skewed. The mean follows the tail of a skewed distribution. The relationship between
the sample mean and the sample median can be used to assess the symmetry of a distribution.
Equation 1.3 shows the log x calculation.
n
log x =
1
n ∑ log x
i =1
i
(1.3)
Example of geometric mean using data from the 2014 National
Health and Nutrition Examination Survey
The 2014 National Health and Nutrition Examination Survey (NHANES) measures
participants’ blood lead levels.1 Using these data, we computed the geometric mean for
the participants. The geometric mean for blood lead levels for the entire group of 2014
participants with nonmissing data was 0.83 ug/mL compared to the arithmetic mean
of 1.1 ug/mL. We also categorized participants into age groups and computed the geo-
metric mean of blood lead levels by age category (see Figure 1.2).
EXAMPLE PROBLEM 1.2
Calculate the geometric mean for the dataset in Sample 1 from Example Problem 1.1.
Log (2.15) = 0.33
Log (2.25) = 0.35
Log (2.30) = 0.36
0.33 + 0.35 + 0.36 1.04
log x = = = 0.3467
3 3
Geometric mean = antilog(0.3467) =100.3467= 2.22
6 Biostatistics for clinical and public health research
Geometric mean of blood lead by patients’ age category
1.6
Geometric mean of blood lead
1.4
1.2
1.0
(ug/mL)
0.8
0.6
0.4
0.2
0.0
≤5 6 to 10 11 to 18 19 to 30 31 to 64 ≥65
Age (years)
Figure 1.2 G
eometric mean of blood lead levels by patients’ age category for participants of
the 2014 National Health and Nutrition Examination Survey (NHANES). (Data
from the National Health and Nutrition Examination Survey, Centers for Disease Control
and Prevention website, https://wwwn.cdc.gov/Nchs/Nhanes/Search/nhanes13_14 .aspx,
accessed February 23, 2016.)
PRACTICE PROBLEM 1.2
Calculate the geometric mean of Sample 2.
Measures of spread
Many variables can be well described by a combination of a measure of central tendency
(measure of location) and a measure of spread. Measures of spread tell us how far or how
close together the data points are in a sample. Six measures of spread will be discussed
in this section:
1 Range
2 Quantiles
3 Percentiles
4 Interquartile range
5 Variance
6 Standard deviation
Range: The range is the difference between the largest and smallest observations of a variable.
The range measures the spread of a variable as the distance from the minimum to the
maximum value. Although the range is very easy to compute, it is sensitive to extreme
observations. The range depends on the sample size (n). The larger the n, the larger the
range tends to be. This makes it hard to compare ranges from datasets of different sizes.
Quantiles and Percentiles: Quantiles and percentiles are measures of spread that are deter-
mined by the numerical ordering of the data. They are cut points that divide the frequency
distribution into equal groups, each containing the same fraction of the total population.
Quantiles and percentiles can also be used to describe the spread of a variable. They
are less sensitive to outliers and are not greatly affected by the sample size, which is an
advantage over the range.