02 Book 2 - Biostatistics - Linares 2019
02 Book 2 - Biostatistics - Linares 2019
Fifth edition
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
2
Index
Chapter Page.
1 Biostatistics - Introduction 3
A. Descriptive Biostatistics 12
B. Inferential Biostatistics 64
9 B.1 Sampling 65
10 B.2 Determination of sample size 77
11 B.3 Basic notions of Normal Distribution 86
12 B.4 Basic notions of probability 94
13 B.5 Basic notions of correlation 102
14 B.6 Chi squared 107
15 B.7 Confidence interval 114
119
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
3
BIOSTATISTICS
1
1.1 Introduction
A high school student who wishes to continue his studies at the University, and generally
He doesn't like numbers, and has a vocation for Health Sciences careers, he decides
study medicine, dentistry, nursing, biochemistry, pharmacy, nutrition, physiotherapy or
images, leaving the numbers as far away as possible.
What a mistake; no one will ever be far from numbers; because we were born with
numbers, we live with numbers and we will die with them. We were born with Apgar
8 (assessed or calculated by the neonatologist), with a weight of 3200 grams located in
50th percentile, a heart rate of 120 beats per minute, having as range
normal between 110 and 140 with a 95% confidence interval; in our first analysis of
we had blood hemoglobin of 17 g/ml knowing that the normal range is between 16.5-
19.5 g/100 ml etc. etc.
Someone could die from a myocardial infarction because they had a cardiovascular risk.
elevated due to having a total cholesterol above 240 mg/dl, LDL cholesterol above
160 mg/dl, HDL cholesterol below 35 mg/dl, triglycerides above 50 mg/dl.
No matter the major you study, they are all part of the sciences and as such, science
grows and nourishes itself with the new knowledge gained through research,
using the scientific method (which we will study in the following chapters) and which cannot
to dispense with statistics.
Everything is measurable, just as we were taught when we were children, that distance is measured
in meters, liquids are measured in liters, weight in kilos, we later learned that not
only the metro was used, there were also centimeters, millimeters, microns,
nanomicrons, etc. Now that studying in "Health Sciences" we know that a
Red blood cells live in our body for only 100 to 120 days, measuring 7 to 7.5 μm in diameter.
(micrometer = one millionth of a meter) that exists in a cubic millimeter
more than 4 million red blood cells are trapped, if we have 5 liters of blood
our calculator will not support performing the calculation, and will only give us a result in
scientific notation of 2.5 X1013.
There are different measures and indicators of well-being (social or economic) in health and there
they have developed certain indices of 'positive health' both for operational purposes and
for research and promotion of healthy conditions, in dimensions such as the
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
4
mental health, self-esteem, job satisfaction, physical exercise, etc. The collection
data and the estimation of indicators aim to systematically generate,
evidence that allows identifying patterns and trends that help to undertake
actions for the protection and promotion of health and for the prevention and control of
disease of the population.
Among the most useful and common ways to measure general health conditions of
the population highlights the national censuses, which are conducted every ten years, which
they provide the periodic count of the population and various of its characteristics, whose
analysis allows for making estimates and projections.
Many times medical students ask themselves the following questions: Why
Is it necessary to study statistics in Medicine? What are we going to study numbers for if
In the whole program, are we only going to study muscles, bones, or tissues? Is it really a
subject that will help me in my professional life or is it simply a filler in the curriculum
of studies?
If as a new professional you research 'the evolution of HIV/AIDS in Bolivia', you will surely
You will have to conduct a study of the population (sex, race, religion, age, occupation,
economic income, level of education, marital status, etc.), investigate in the different
hospitals the cases of diagnosed positive HIV, register that information,
organize it, tabulate it, and with the data you have, answer the following questions:
How many cases of AIDS are there in Bolivia currently? Will the number increase?
AIDS cases in the next five years? Which department will have the highest rate?
of positive HIV cases?, Are the disease control mechanisms working
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
5
For a long time, the word statistic referred to numerical information about
the political states or territories. The word comes from the Latin "statisticus" which means
"of the state." In the past, statistics were only used to know the number
of inhabitants of a certain region, for the collection of taxes.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
6
CLASSIFICATION OF STATISTICS
Set of procedures necessary to collect,
STATISTICS classify, represent, and summarize (through methods
1. DESCRIPTIVE numerical and graphical) the dataset that
they form a sample obtained from a population.
PROBABILITY
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
7
a) Reasons
c) Tasas
a) Media
a) Measures of b) Median
central tendency c) Fashion
Rank
b) Mean deviation
3. Measures of dispersion or variation c) Variance
d) Standard deviation
e) Coeficiente de variación
a) Kurtosis
4. Measures of shape b) Coefficient of skewness
2. Normal distribution
3. Probability
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
8
5. Chi-square test
6. Confidence interval
In my long years as a teacher, I have been able to notice that students have serious
difficulties, with simple details that are not taken into account. For this reason, I allow myself
explain those simple yet important details:
A person who uses the calculator incorrectly without realizing it thinks that the
the answer is correct because it was the result given by the calculator, however it can
to be making tremendous mistakes.
In many countries around the world, a decimal comma is used to indicate a decimal point, but
in others, they also use a decimal point to represent the same.
Some write: 3,256 and others 3.256; without taking into account these forms of writing,
some will read 3 integers with 256 thousandths, but others will read 3256 integers, figures
totally different.
In some countries, they use a dot to separate units of thousands, while in others they use a comma.
Now then, when we buy and use a scientific calculator, as it has been
manufactured or programmed for a specific country, it may show us data using one
or another system, it is worth saying that to express a decimal, you use a decimal point or a comma.
decimal. We must identify what system our new calculator uses, so that we do not
make mistakes.
Generally, calculators that come from Asia (China, Japan, etc.) use the point.
decimal to express a decimal; therefore, if our calculator is of this type
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
9
we must mentally transform that decimal point into a decimal comma when
We transform these numerical expressions for Bolivia.
3.256
Many scientific calculators, for very large or very small values, remove the
results in scientific notation, so it is important to know and interpret the
same. For this reason, we are going to do a brief review.
Scientific notation is very useful for expressing very large or very small numbers.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
10
This abbreviation can also be used with very small numbers. When the
scientific notation is used with numbers less than one, the exponent on 10 is
negative, and the decimal moves to the left, instead of to the right.
For example:
6.5 X 10-3 0.0065
Consequently, using scientific notation, the diameter of a red blood cell is 6.5 X 10.-
3cm. (0.0065); the distance from the earth to the sun is 1.5 X 108Km (150,000,000. and the number
the number of molecules in 1 gram of water is 3.34 X 1022(33 400 000 000 000 000 000 000)
- 1.56234×1029 = 156 234 000 000 000 000 000 000 000 000
- 0.000000000000000000000000000000000000910939 kg (mass of an electron) can
be written as 9.10939×10-31kg.
1.5.3 Rounding:
It depends on the number of significant figures we want to use to provide a solution. In theory, it ...
it should always match the number of significant figures that the expression has
the fewer figures it has.
We count the number of digits we want to give and we look at the next one, if it is 5 or
mayor, the last one is increased by one unit, if it is 4 or less the last one is left as it is.
Digit less than 5: If the following decimal is less than 5, the previous one is not modified.
Example: 12,612. Rounding to 2 decimals we must take into account the third
decimal: 12,612= 12,61.
Digit greater than 5: If the following decimal is greater than or equal to 5, the previous one
increase by one unit. Example: 12.618. Rounding to 2 decimal places we should
take into account the third decimal: 12.618 = 12.62. Example: 12.615. Rounding to 2
we must take the third decimal into account: 12.615 = 12.62.
If you want to practice rounding with your computer, you can visit the following page
web, I am sure that you will not only learn, but also have fun.
Oh, and don't forget to use the decimal comma and not the decimal point!
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
11
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
12
2 A. DESCRIPTIVE BIOSTATISTICS
2.1 Introduction
To measure, it is necessary to follow a process that consists, in brief words, of the step
from a theoretical entity to a conceptual scale and, subsequently, to an operational scale.
In general, the steps followed during the measurement are the following: a) it is defined
the part of the event that will be measured, b) the scale with which it will be measured is selected, c) it
compare the measured attribute with the scale and, d) finally, a value judgment is issued
about the results of the comparison. To measure the growth of a minor, by
for example, first the variable to be measured is selected (age, weight, height); then it
they select the measurement scales (completed months, centimeters, grams);
immediately after, the attributes are compared with the selected scales (a
age of 6 months, 60 cm in height, 4,500 grams in weight) and, finally, a judgment is issued.
value, which summarizes the comparison between the found magnitudes and the criteria of
health accepted as valid at that time. As a result, the infant is qualified
as well nourished, malnourished, or overnourished.
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
13
2.2 Number:
It is a mathematical concept that expresses quantity. For example, we say that there have been
120 cases of tuberculosis detected in a certain population.
These provide an idea of the magnitude or real volume of an event. They are useful for
the allocation of resources (for example, the monthly number of births in a
Hospital establishment gives an idea of the number of beds, staff, and resources.
necessary physicists to meet this demand). When making comparisons, the use
Absolute figures have limitations, since they do not refer to the population from which
there are obtained (thus, 40 deaths per year in a population of 15,000 inhabitants,
it can be proportionally greater than 50, occurring in a population of 20,000
inhabitants). However, the comparison of absolute figures referred to the same
population in short periods of time can be a good risk estimator for
keep the denominator constant.
2.3 Rates:
It is a measure that relates the number of times an event occurs in an area and
a defined period of time, with the number of inhabitants of the population in which
it can happen.
They are composed of a numerator that expresses the frequency with which it occurs.
event (for example, 564 deaths from breast cancer in 2014 in Bolivia) and a
denominator, given by the population that is exposed to such an event (4,583,443
women). In this way, a quotient is obtained that represents the probability
mathematics of the occurrence of an event in a defined population and time. In the
For example, the obtained rate estimates the risk of each woman over 30 years old in Bolivia.
died of breast cancer during 2014.
When the denominator refers to the general population, for the purposes of the calculation of the
exposed population, the existing one as of June 30 in that place is used as a convention.
during that year (mid-year). For practical reasons, such as the numerator of the rate
it can never be greater than its denominator, the result will be less than one and
to avoid the use of decimals, the results are multiplied by a factor of
amplification by some multiple of 10 (whether 1,000, 10,000, 100,000). This same
The amplification factor is used to compare rates internationally with factors.
pre-established.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
14
In this way, the breast cancer mortality rate in women in 2014 was
12.31 deaths per 100,000 women. (564/4583443 X 100,000 = 12.31)
564
Breast cancer mortality rate = ------------------------ X 100000 = 12.31
4583443
The numerator and the denominator must have strict correspondence in three
aspects:
a) Gross, general or crude rates: When they refer to the total population.
b) Tasas específicas: Cuando están referidas a determinados segmentos de la
population in a specific form related to the event under study.
c) Adjusted or standardized rates: When they are adjusted to a
standard population
2.4 Reasons:
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
15
Example:
Maternal mortality rate: It measures the number of maternal deaths per 100,000.
births.
It results from the division of the number of maternal deaths by the number of
births, multiplied by 100,000
If in a population there were 400 maternal deaths during the year 2009 and
65,000 births
Therefore, we say that the maternal mortality rate is 650 per 100,000.
births
2.5 Proporciones:
They are figures or relative magnitudes that relate two categories of the same.
phenomenon in which one is contained within the other, that is to say, one is part
and another the whole.
Numerator It is a PART
------------------- ----------------------
Denominator It is the ALL
Proportions are measures that express the frequency with which an event occurs.
in relation to the total population in which this can occur. This measure is calculated
dividing the number of events that occurred by the population in which they occurred.
As each element of the population can contribute with only one event, it is
it is logical that being the numerator (the volume of events) a part of the denominator
(population in which the events occurred) that one can never be bigger
than this. It is for this reason that the result can never be greater than one and
always oscillate between zero and one.
Proportions express only the relationship that exists between the number of times
in which an event is presented and the total number of occasions on which it could be
to present.
For example, what proportion of the deaths that occurred in the city of Sucre in the year
Was 2013 caused by cardiovascular diseases? This is calculated by building
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
16
the quotient between the number of deaths due to cardiovascular causes (740) and the
total number of deaths that occurred that year (4,432) amplified by 100 (16.70% of the
deaths in 2013 were caused by cardiovascular diseases). The
proportions are not interpreted as a probability nor do they provide a risk
since they are not calculated with the population exposed to risk. A proportion can
considered as the estimation of a probability when calculated in a
representative sample of a certain population.
2.6 Indices:
They arise from the comparison of two rates or two ratios. For example, the quotient between
the overall mortality rate in men compared to women in 2010.
This indicator gives an idea of the existence of greater or lesser risk of a condition
depending on whether its value is greater or less than 1 (or 100%). In this case, we have:
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
17
The arithmetic mean of a statistical variable is the value obtained by adding all the
data and divide the result by the total number of data.
Its calculation aims to obtain a value to which the data or observations tend.
individuals.
To represent the population mean and the sample mean, the following are used
symbols:
µ: is the Greek letter 'mu' that will determine the mean of a population
For educational purposes, we will henceforth continue using this last symbol to
refer to the 'arithmetic mean' in general.
Formula:
X = Media
Σ= Summation
XiAll the values of the distribution
n = Number of data
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
18
Formula:
9 + 11 + 10 + 8 + 12 + 9 + 13 + 10 + 10
X=
9
Each of the ages is added together and divided by the
number of patients
92
X= 10.2 years
9
The average age of this population is 10.2 years.
Another example:
Formula:
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
19
Formula:
X = Media
Σ= Summation
XiAll the values of the distribution
fiAll frequencies
n = Number of data, that is: Σfi
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
20
21 7
22 9
23 6
24 5
TOTAL: 35
(n = Σfi)
n
Age in years Number of
of students students Xi* fI
Xi fI
20 8 20*8 = 160
21 7 21*7 = 147
22 9 22*9 = 198
23 6 23*6=138
24 5 24*5 = 120
TOTAL: 35
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
21
n
Age in years Number of
of students students ΣXi* fI
Xi fi
20 8 160
21 7 147
22 9 198
23 6 138
24 5 120
TOTAL: 35 763
763
X= 21.8 years
n 35
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
22
Total: 62
Formula:
X = Media
Σ= Summation
X' = Midpoint or class mark
of the class interval
fiAll frequencies
n = Number of data, that is to say: Σfi
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
23
Total: 62
Total: 62 809
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
24
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
25
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Grover Linares Ph.D - 2015
26
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
27
4 Median
It is the value that occupies the central position of all the data when they are ordered.
from smallest to largest and is represented with the sign Me.
According to this definition, the set of data less than or equal to the median.
will represent 50% of the data, and those that are greater than the median will represent
the other 50% of the total sample data
Since we have 9 values; to find the median, we arrange the same number of
values on the right and on the left it is worth saying that in our example we left 4
to the right and 4 to the left, with the number 10 in the middle, which
it represents the median.
Since we have 12 values; to find the median, we place 5 on the right and 5
to the left, leaving in the middle (two values because 12 is even) therefore
the numbers 7 and 8 remain; with an average of 7.5 (7+8 / 2 = 7.5) that
corresponds to the median.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
28
24 20 70
25 30 100
Total: 100 ---
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
29
13 15 61
14 5 66
15 25 91
Total: 91 ---
The first cumulative frequency equal to or greater than 45.5 is 46, corresponding
to the value Xi= 12 that represents Xk
Third step:
12 + 12 + 1 25
Applying the formula: Me = = =
2 2
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
30
95.0 – 99.0 4 24
100.0 - 104.0 2 26
105.0 - 109.0 7 33
110.0 - 114.0 5 38
115.0 - 119.0 1 39
120.0 - 124.0 1 40
TOTAL: 40n ---
40/2 = 20
First, we find out what the position of the observation is.
The average is: n / 2 = 40/2 = 20
2. The accumulated frequency that contains the average 20 is exactly 20, which is the
that we highlight corresponding to the interval 90.0 - 94.0
3. We calculate the amplitude of the interval or class:
a = 90.0 – 94.0 = 5 (there are 5 points of range from 90 to 94)
n
-Fi
2
Formula: = Li+ ( ) *a
fi
( 40 ) - 14
2 20 – 14 6
Me = 90.0 +( )* 5 = 90 + * 5 = 90 + * 5 = 95
6 6 6
= 90 + (6/6)*5 = 95
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
31
Exercise 7: What is the median of the data for the following ages?
Age (years) Frequency Frequency
XI absolute accumulated
fi Fi
0-4 10 10
5-9 16 26Fi
15 - 19 27 77
20 - 24 13 90
25 - 29 6 96
Total: 96
96/2= 48
First, we find out what the position of the observation is that
the average is: n / 2 = 96/2 = 48, since it does not match, we look for
the immediate superior which is 50, and we highlight the entire row, leaving in
the box 10 corresponding to the lower limit (Lithen 24 that
corresponds to (fi).
2. We calculate the amplitude of the interval or of the class:
a = 10 - 14 = 5
( 96 ) - 26
2 48 – 26 22
Me = 10 + ( )* 5 = 10 + ( )* 5 = 10 + *5 =
24 24 24
= 10 + (0.92) * 5 =
= 10 + 4.6 = 14.6
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
32
Fashion
5
Mode is the value that appears most frequently in a distribution. If in a group two
scores appear with the same frequency and that frequency is the maximum, the
distribution is bimodal. If there are three, it is trimodal; when there are three, we talk about
multimodal, but when all the scores of a group have the same frequency,
there is no fashion.
8; 9; 9;10;10;10;11;12;13
1; 1; 2; 2; 3; 3; 3; 4; 4; 4; 5; 6
In this case, the most repeated frequencies are 3 and 4, but since
they are combined, the average is calculated, meaning 3 + 4 = 7/2 = 3.5 therefore the
the mode is 3.5
If they were not together, it would be bimodal 3 and 4.
In this case, the most repeated frequencies are 1 and 3, therefore, since not
being together does not yield an average, resulting in bimodal 1 and 3
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
33
Exercise 5:
Ages of students Absolute frequency
Xi fI
17 5
18 10
19 20
20 15
21 26 Higher frequency.
22 4
23 10 La Moda (Mo) is 21 years old.
24 3 since its frequency 26 is
TOTAL 93 the highest.
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
34
Since 19 and 20 are consecutive, the Mode is the average of those values.
(19+20/2=19.5)
Exercise 7:
Ages of students Absolute frequency
XI fi Higher frequencies
17 5 of two similar values.
18 7
19 15 Fashion (Mo) is 21.5 years old, already
20 14 that its frequency 28 is the highest
21 28 elevated, for both cases,
22 28 21 and 22 years therefore they are
23 3 saca the media
24 2 21+22/2 = 21.5
TOTAL 102
When two similar frequencies are presented, the highest ones, and they are not
consecutive, the arithmetic mean is not calculated, leaving the two values
as modal, being in this case bimodal.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
35
Exercise 9:
d1=28 – 10 = 18
25 - 29 12 d2=28 – 12 = 16
30 - 34 11 Li20
a = 5 (20 to 24 = 5)
35 - 39 8
TOTAL 69
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
36
18
Mo= 20 + ( 5=
18 + 16
18
Mo= 20 + ( )*5=
34
Exercise 11:
Age groups Absolute frequency
Xi fi
15 - 20 12 Fashion class/position since
20 - 25 13 its frequency is the highest:
26
d126 - 13 = 13
30 - 35 14 d2=26 – 14 = 12
Li= 25
35 – 40 6 a = 5 (20 to 25 = 5)
TOTAL 71
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
37
d1
Formula: Mo = Li+ ( )a
d1+d2
13
Mo= 25 + ( )*5=
13 + 12
13
Mo= 25 + ( 25 * 5 = 27.6 years
25
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
38
6 A.2.2 QUANTILES
Quartiles, Deciles, and Percentiles
6.1 Introduction
So far, we have studied the measures of central tendency (Mean, Median and
Fashion), which shows us a central value (and only central) that represents the set of
data; regardless of what happens with the rest of the values.
For example:
Two groups of 10 patients each go for a cardiological check-up and are taken
the following resting heart rates:
Knowing that the normal resting heart rate is between 60 and 80 beats
per minute; analyzing both groups, we conclude that both have a mean, median
and a rate of 70 beats per minute; therefore we could mistakenly conclude if only
we observe these measures of central tendency that both groups of patients are
they are equal and have normal heart rates and there are no patients that call the
attention with probable pathology. (Incorrect diagnosis!)
However, by observing not only the measures of central tendency but all the data
patient by patient, we concluded that in group B, there are 4 patients with probable
cardiac disorder, 2 patients under 60 (50 and 54) and 2 over 80 (86 and
90) heartbeats per minute to study in order to know the cause of
these alterations.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
39
Grupo A: 62 63 64 65 70 70 75 76 77 78
Group B: 50 54 64 69 70 70 71 76 86 90
With the measures of position (quartiles, deciles, and percentiles) we can make cuts and
observe the different values (3, 9, and 99 cuts to achieve 4, 10, or 100 equal parts)
in different places of the ordered data value chain from lowest to highest and
to know the exact value in each cut and almost for each patient or subgroups of patients
and diagnose what happens with each of them and not just with a measure of central tendency
central that represents everyone.
Therefore, the measures of position (quartiles, deciles, and percentiles) turn out to be measures
that allow the detailed study of all the values in different positions of the
data chain, providing a diagnosis that is not general but specific to each patient and/or
subgroup of patients. (Important analysis tool! that allows not to lose
see what happens with each patient.
With a series of data arranged from smallest to largest, we can divide it into 4 parts.
equal, into 10 equal parts or into 100 equal parts and to know exactly what value
and position corresponds to each cut.
When we divide into 4 equal parts, they are called quartiles; when we divide into 10 parts,
they are called deciles and when we divide into 100 equal parts we call them
percentiles.
The quartiles are represented by the symbol 'Q', the deciles by the symbol 'D' and the
percentiles with the symbol 'P'.
To achieve 4 equal parts, we use 3 cuts, each cut is called Q.1, Q2y Q3
To achieve 10 equal parts, we use 9 cuts, each cut is called D.1, D2….and D9
To achieve 100 equal parts, we use 99 cuts, each cut is called P1, P2....and P99
USFXCh - Faculty of Medicine - Notes on Public Health II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
40
Q1 Q2 Q3 (4 sectors)
D1 D2 D3 D4 D5 D6 D7 D8 D9(10 sectors)
Me = 50.5
32, 35 44, 55 70 74
Q1 Q2 Q3
P25 = 38 D5 = 50.5 P75 = 58
P50
6.2 Cuartiles:
With 3 cuts, the fractions are equal fourths of the total data.
Having in our example 12 data points, to divide into 4 equal parts, each sector must
to have 3 data points (4 X 3 = 12). Each cut or quartile to leave 4 equal parts, the first quartile
the cut occurs between the third and fourth data points, the second quartile between the sixth and seventh
datum, and the third quartile between the ninth and tenth datum. In this way:
Q1Represents the first cut called the first quartile; leaving 25% of the values behind.
below and 75% of values above the cutoff. In our example, the cutoff falls
exactly between the value 37 and 39, therefore to know exactly what value
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
41
corresponds to Q1We take an average (37 + 39/2 = 38); therefore, the first quartile (Q1) is
equal to 38 years, which coincides with the P25.
Q2It represents the second cut called the second quartile; leaving 50% of the values.
below and 50% above the cutoff. In our example, the cutoff falls exactly
between the values 48 and 53, therefore to know exactly what value Q corresponds to2
we calculated an average (48 + 53/2 = 50.5); therefore, the second quartile (Q2It is equal to 50.5
years. Q2it coincides with the median 50.5.
Q3Represents the third cut called the third quartile, leaving 75% of the values behind.
below and 25% of the values above the cutoff. In our example, the cutoff falls
exactly between the values 57 and 59, therefore to know exactly what value
corresponds Q3we take an average (57 + 59 /2 = 58); therefore the third quartile (Q3) is
equal to 58 years, which coincides with the P75.
6.3 Deciles:
Having in our example 12 data points, to divide into 10 equal parts, each sector
It should have 1.2 data parts (1.2 X 10 = 12).
Intuitively, we can extract the fifth decile (D)5) that corresponds to half of the
chain of values, since we left 50% of the values below and 50% above
of the cut. In our example, the cut falls exactly between the value 48 and 53, therefore
to know exactly what value D corresponds to5we take an average (48 + 53 / 2 =
50.5); therefore, the fifth decile (D5It is equal to 50.5 years.5coincides with Q2with P50and with the
median.
The rest of the cuts for the other deciles would be very complicated to obtain, therefore
we must use formulas that we will apply later, to know exactly about
what value corresponds to each cut.
6.4 Percentiles
With 99 cuts, the fractions are hundredths of the total. The percentiles are 99.
values that divide the data series into 100 equal parts.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
42
Having in our example 12 data points, to divide into 100 equal parts, each sector
it must have 0.12 parts of data (0.12 X 100 = 12).
In an intuitive way, as we have done with the quartiles, we can derive the percentiles.
25, 50, and 75 that correspond to the 1st, 2nd, and 3rd quartiles. In this way, the 25th percentile is
find between the third and fourth data point, the 50th percentile between the sixth and seventh data point, and the
75th percentile between the ninth and tenth data point. This way:
P25It represents the 25th percentile; leaving 25% of the values below and 75% of the values above.
above the cut. In our example, the cut falls exactly between the value 37
and 39, therefore to know exactly what value P corresponds to25we take out a
average (37 + 39/2 = 38); therefore the 25th percentile 1 (P25is equal to 38 years.
P50It represents the 50 cut; leaving 50% of the values below and 50% above.
of the cut. In our example, the cut falls exactly between the value 48 and 53, due to
so as to know exactly what value P corresponds to50we take an average
(48 + 53/2 = 50.5); therefore, the 50th percentile (P50is equal to 50.5 years. P50coincide
with a median of 50.5.
P75Represents the third cut 75; leaving 75% of the values below and 25% above.
the values above the cutoff. In our example, the cutoff falls exactly between
the value 57 and 59, therefore to know exactly what value P corresponds to75
we take an average (57 + 59 /2 = 58); therefore the 75th percentile (P75is equal to 58
years.
In the same way as for the deciles, for the rest of the cuts for the other percentiles,
it would be very complicated to eliminate; therefore, we must use formulas that follow
we will apply, to know exactly what value corresponds to each cut.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
44
As we observed in the previous example, the use of quartiles, deciles, and percentiles is
muy útil para el diagnóstico en Medicina. Todos los parámetros medibles en las ciencias
Medical professionals according to specialties have curves distributed by percentiles.
USFXCh - Facultad deMedicina -Apuntes de Salud Pública II – Bioestadística – Dr. GróverLinares Ph.D -2015
45
CJ = Quantile to be extracted
5 - 8 - 10 - 12 - 14 - 16 - 18 - 20 - 25 - 30 - 35
J(n + 1) 3oobservación = 1
Q1= ----------------- how is whole
C Q110
1(11 + 1) 1 ( 12 ) 12
Q1 = ----------------- = ---------------- = -------- = 3
4 4 4
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
46
2 - 3 - 7 - 15 - 24 - 30
J(n + 1)
Q1= -----------------
C
1(6 + 1)
Q11.75 rounded down to the nearest whole number is 1
4 i = 1 = 1erlocation of the data Xi = 2
Q1= 2 + 0.75 1
Q1= 2.75
2 - 3 - 7 - 15 - 24 - 30
J(n + 1)
Q2= -----------------
C
2(6 + 1)
Q2= ----------------- = 3,5 redondear al inmediato inferior = 3
4 i = 3 = 3erplace of the data Xi= 7
Second step: Apply the complete formula
J(n + 1)
CJ = Xi + ------------- - i Xi + 1Xi
C
Xi + 1= 3heplace + 1 place
Q2= 7 + 3.5 - 3 15 - 7 = 4tolugar = 15
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
47
Q2= 7 + 0.5 8
Q2= 7 + 4 = 11
Q2= 11
Take Q3from the following data:
2 - 3 - 7 - 15 - 24 - 30
Q3= 24 + 0.25 6
Q325.5
Take D7from the following data:
2 - 3 - 7 - 15 - 24 - 30
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
48
D7= 15 + 0.9 9
D7= 23.1
2 - 3 - 7 - 15 - 24 - 30
P80= 24 + 0,6 6
P8027.6
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
49
Xi fi Fi
30 - 34 3 3
35 - 39 8 11
40 - 44 11 22
45 - 49 9 31Fi-1
50Li- 54 4fi 35
35
J(n / c) – Fi - 1
CJ = Li + -------------------- *a
fi
0.5
D9= 50 + -------------- * 5
4
D9= 50 + 0.125 * 5
D9= 50 + 0.625
D950,625
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
50
Xi fi Fi
30 - 34 3 3
35 - 39 8 11
40 - 44 11 22 Fi-1
45Li49 9yes 31
50 - 54 4 35
35
J(n / c) – Fi - 1
CJ = Li + -------------------- *a
fi
First step: Use the following part of the presented formula
n
P76= J -------
c
35
P76=76 ------- = 26.6 I highlight the immediate superior 'Fi' to 26.6 (31)
100
Second step: Apply the complete formula
26.6 - 22
P76= 45 + -------------- * 5
9
4.6
P76= 45 + -------------- * 5
9
P76= 45 + 0,51 * 5
P76= 45 + 2,55
P76= 47.55
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
51
The degree to which numerical data tend to spread around some average value
it's called variation or dispersion. A measure of dispersion is important from two
points of view:
a) It can be used to show the degree of variation between the values of the data
observed; thus a small dispersion in the grades of a group of
students, will indicate that they are approximately equal in their performance; on the other
side, a greater dispersion will imply that the students are very
unequal in their performance.
b) Secondly, it can be used to complement an average, to
to describe a dataset or to compare a series of information with
another. When the dispersion is low, the average value becomes highly
significant, on the other hand, if the dispersion is high, the mean (or the measure of tendency
central) becomes little or not representative at all.
The route
The mean deviation
Variance
Standard deviation.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Grover Linares Ph.D - 2015
52
One of the simplest measures of dispersion is the range, also called range.
The total amplitude y is the difference between the maximum and minimum values of the set.
data. For example, suppose there are two groups of 7 children, these being A and
And that both have an average of 6 years; if we only have this information.
we can say that there is no difference between the two groups; but if they give us
the additional information on the extreme ages we have: Group A ranges from 2 to
10 years old and in group B, they are between 5 and 7 years old, it is clearly observed that, although
both groups have the same mean, they are very different due to the variability of the
ages, let's see the following:
Group A: 10 - 2 = 8 years of experience
Group B: 7 - 5 = 2 years of travel
Group Θ Θ Θ Θ Θ Θ Θ
A 1 2 3 4 5 6 7 8 9 10
Group ΘΘΘΘΘΘ
B 1 2 3 4 5 Θ 7 8 9 10
6
This observation indicates that in group A, the ages of the children are
distributed between 10 and 2 and in group B, between 7 and 5 years.
However, this measure only considers extreme data, which is why
it does not inform us about how the data is distributed as a whole
(intermediate data)
To calculate the distance, the following formula is developed:
Exercise 1:
a) 4, 5, 5, 6, 7 Rec. = 7 - 4 = 3
Another measure of dispersion is the mean deviation, which includes all the data in the
calculation of the average of the deviations (or differences) in relation to some value
central, such as the mean, median, or mode. When the mean is taken as a value
central, the mean deviation is obtained, that is, the arithmetic mean of the deviations
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
53
around the median. If the median is taken as the central value, the deviation is had
median, etc.
∑( )
DM =
Steps Procedure
4 + 4 + 5 + 7 equals 20
= = =5
4 4
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
54
1+1+0+2 4
DM = = =
4 4
For the calculation of the mean deviation in grouped data, the following is developed
next formula:
Exercise 3: Calculate the Mean Deviation from the following grouped data.
The value of the arithmetic mean applying the learned procedures is 5.36
Steps Procedures
1st The absolute value is determined 1st Sin
of each difference between thesign
values that the variable takes and its Calif Freq.
Absol. Xi- X (Xi- X)
arithmetic mean Xi fi
Arithmetic mean = 5.36 3 1 3-5,36=-2,36 2.36
4 5 4-5.36=-1.36 1.36
5 8 5-5.36= 0.36 0.36
6 6 6-5,36= 0,64 0.64
7 5 7-5,36= 1,64 1.64
2nd
According to the formula, the Calif Freq.
values absolutes of the Absol. Xi- X) Σ(Xi- X)fI
differences multiply by the Xi fi
absolute frequencies and sayings 3 1 2.36 2.36 x 1 = 2.36
products partials they must 4 5 1.36 1.36 x 5 = 6.8
5 8 0.36 0.36x8=2.88
to join
6 6 0.64 0,64X6=3,84
7 5 1.64 1,64X5=8,2
n= 25 Σ 24.08
3rd 24.08
3rd To obtain the final result, DM = = 0.96
the previous sum is divided by the 25
total cases
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
55
Applying the formula: (Omitting some steps that are understood as given).
Calif Frequent.
Absol. Xi- X) (Xi- X)fi
XI fI
3 1 2,36 2,36
4 5 1.36 6.8
5 8 0.36 2,88
6 6 0.64 3.84
7 5 1.64 8.2
Σ n=25 24.08
Σ ( XI- X ) If 24.08
DM = = = 0.96
n 25
For the calculation of the mean deviation of grouped data with class intervals,
the same steps are followed as in the previous case, taking care to
determine, beforehand, the midpoint of the intervals, which will replace in the
formula XI
Exercise 4: Calculate the Mean Deviation of the following data where the mean is
of 4.78, according to the calculation of the learned procedure.
Σ ( Xi- X ) If 27,28
DM = = 1.09
n 25
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
56
When all elements of the population are taken, the symbols are used
σ2y σ to indicate population variance and standard deviation; on the other hand, if the
data comes from a sample, S will be used2y S to indicate the variance and
sample standard deviation respectively.
For the calculation of these measures, the following formulas must be developed:
Steps Procedure
USFXCh - Faculty of Medicine - Public Health Notes II – Biostatistics – Dr. Gróver Linares Ph.D - 2015
57
For the calculation of variance and standard deviation of grouped data, it is necessary to
develop the following formulas:
Formula variance:
S ( Xi- X )2fI
σ 2=
n
Σ ( Xi- X )2fi
σ=
n
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
58
Taking into account that the arithmetic mean according to the procedure
learned is 5.36.
Steps Procedure
3rd
3rd The previous results Calif. Frec.
they must be multiplied by Absolute.i- X)2 (Xi- X)2fI
the absolute frequencies Xi fI
that corresponds to them and, they
3 1 5.57 5.57 X 1 = 5.57
sum products 4 5 1.84 1,84 X 5 = 9,2
5 8 0.13 0,13 X 8 = 1,04
6 6 0.41 0.41 X 6 = 2.46
7 5 2.69 2,69 X 5 = 13,45
Σ 25 31.72
4th The previous result is 4th
divide by the total of 31.72
cases, being this σ 2= 1.27 Variance
result, the variance 25
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
59
For the calculation of variance and standard deviation, the same steps are followed.
that in the previous case, only, having to first determine the midpoint of
each interval.
Σ ( X’ – X )2fI
σ= = √ 5.16 = 2.27 Standard Deviation
n
Another measure that is commonly used is the coefficient of variation (CV). It is a measure
of relative dispersion of the data and is calculated by dividing the standard deviation
sample by the mean and multiplying the quotient by 100. Its usefulness lies in that
allows us to compare the dispersion or variability of two or more groups. The coefficient
variation is used to compare the homogeneity of two data series, still
when expressed in different units of measurement.
Thus, for example, if we have the weight of 5 patients (70, 60, 56, 83, and 79 Kg) whose average
es de 69,6 kg. y su desviación estándar (S) = 10,44 kg y la Talla de los mismos (150,
170, 135, 180 and 195 cm) with a mean of 166 cm and a standard deviation of 21.3
cm. The question would be: which distribution is more dispersed, weight or height? If
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
60
we compare the standard deviations we observe that the standard deviation of the
size is much greater; however, we cannot compare two variables that have
different measurement scales, so we calculate the coefficients of variation:
Deviation ∑( ) ∑( )∗ ∑( )∗
media DM = DM = DM =
Variance ∑( ) ∑( ) ∗ ∑( ) ∗
S 2= S 2= S 2=
2 2 2
Deviation ∑( − ) S=√
∑ ( − ) ∗ S=√
∑ ( ′− ) ∗
standard S =√
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
61
8 Coefficient of skewness
kurtosis
Where (g1) represents Fisher's skewness coefficient, (Xi) each of the values,
( ) the sample mean and (ni) the frequency of each value. The results of this
equations are interpreted:
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
62
a) (g1 = 0): It is accepted that the distribution is Symmetrical, that is, there exists
approximately the same amount of values on both sides of the mean. This
Value is difficult to achieve, which is why people tend to adopt the values that are.
close whether positive or negative (± 0.5).
b) (g1 > 0): The curve is asymmetrically positive, so the values tend to
gather more on the left side than on the right side of the mean.
c) (g1 < 0): The curve is asymmetrically negative so the values tend to
gather more on the right side of the medium
Certainly, the larger the number (Positive or Negative), the greater the distance will be.
what separates the clustering of values from the mean.
8.2 Kurtosis
This measure determines the degree of concentration that the values in the region present.
central of the distribution. Through the Coefficient of Kurtosis, we can identify if
there is a high concentration of values (Leptokurtic), a normal concentration
(Mesocurtic) or a low concentration (Platykurtic).
Where (g2) represents the kurtosis coefficient, (Xi) each of the values, ( ) the
average of the sample and (nor) the frequency of each value. The results of this formula are
they interpret:
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
63
When the data distribution has a skewness coefficient (g1 = ±0.5) and
a kurtosis coefficient of (g2 = ±0.5) is referred to as a Normal Curve. This criterion is
of utmost importance since for most statistical procedures
inference requires that the data be normally distributed.
The main advantage of the normal distribution lies in the assumption that 95% of the
values are within a distance of two standard deviations from the mean
arithmetic; that is, if we take the mean and add twice the deviation and
then we subtract two standard deviations from the mean, 95% of the cases would be found
within the range that makes up these values.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
64
B. INFERENTIAL BIOSTATISTICS
Inferential statistics results from applying probability to the results that already
we know from descriptive statistics. The results of that application will come
expressed, therefore, in probabilistic language.
The result is perhaps strange, diffuse but precise; and based on the results we achieved
with inferential statistics we can for example state that: "There is an association
statistically significant between the Municipal Health Index and Maternal Mortality
(p < 0.001 this means with a 99.99% probability). The municipalities with an Index
Municipal Health very low has a Maternal Mortality Rate of 5.79 (IC95%: 5.59
5.99 times higher than the municipalities with an Average ISM.
The claims that inferential statistics allows us to make carry a risk, and who
The USA must know this. It's not difficult, anyway, because all these statements
They are formulated in terms of risk, safety, and insecurity: of probability.
The two types of problems that statistical techniques solve are 'estimation and
hypothesis contrast”. In both cases, it involves generalizing the obtained information.
in a sample to a population. These techniques require that the sample be as much as possible.
random.
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
65
9 B.1 Sampling
9.1 Introduction
A fundamental aspect in the design of clinical studies is the determination of the size.
of an appropriate sample. If the sample size is very small, the study will have low
statistical power and consequently, the estimates will be less precise and the
the probability of finding significant differences between treatments or groups will be
smaller. On the other hand, if the sample size is very large, one will be doing a
misuse of research resources and subjecting more patients to tests than the
strictly necessary.
But first, it is important to study the terminology and the concepts that we will use in
these two chapters:
9.2 Individual:
USFXCh - Faculty of Medicine - Public Health Notes II – Biostatistics – Dr. Gróver Linares Ph.D - 2015
66
9.3 Population:
The population, whether its total number is known or unknown, is classified as: finite,
if the population number is known infinity if the population number is
unknown.
9.4 Sample:
Speed
b) Cost
c) Feasibility
d) Accuracy
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
67
Regarding the first three reasons, it is obvious that there is greater speed and lower cost.
in studying a hundred people compared to a thousand or more, and it's better to do it by situations
of human, physical resources and logistical support. In terms of accuracy, it refers to
Given that with less workload, it is possible to employ better qualified personnel
that guarantees a measurement of the phenomenon of interest with greater precision and power
supervise better to produce more accurate results.
USFXCh - Faculty of Medicine - Notes on Public Health II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
68
Population
The chosen ones of the sample cannot be made by their own will; if possible, they must
choose randomly.
2 4 6 8 1 1 1 1 1
1 3 5 7 9 0 1 2 1 4 1 6 1 8 1
1 ["3"] 5 7 9
2 2 2 2 2 3 3 3 3
2 1 2 3 2 5 2 7 2 9 3 1 3 3 3 5 3 7 3
0 2 4 6 8 0 2 4 6 8
To choose and know who the selected ones are, there are 2 types of sampling that exist
They can use: Probabilistic and non-probabilistic.
USFXCh - Faculty of Medicine - Notes on Public Health II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
69
Where all individuals of the population have equal conditions, they have the
same chances of being part of the sample.
Where the individuals of the population to be chosen are incorporated by personal criteria.
the subjectives of the researcher.
Simple random
2. Systematic random
3. Stratified sampling
4. Cluster sampling
Single-stage sampling
6. Multi-stage sampling
Accidental sampling
2. Purposive or convenience sampling
3. Sample of volunteers
If possible, it is better to use probabilistic ones because statistically they have better
support and reliability; since non-probabilistic methods tend to present biases.
unwanted information, which can confuse the results.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
70
A. Probabilistic or Random
a) Urn method:
A simple although little
practice of obtaining a sample
random is the technique "of the urn".
It consists of placing tokens in a ballot box.
with the names or numbers of each
element of the population and then of
mix them properly, it is extracted
as many elements as there should be
sample that has been decided to be chosen.
Due to this careful mixture before
each extraction, each element has the
same chance of being selected.
b) Use of the random digit table:
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
71
It is 38 and we must choose 3 people, we start at number 12, and we continue to the
right; the next two numbers leaving those greater than 38, we get 27 and 5;
therefore, the persons marked with the numbers 12, 27, and 5 are chosen.
2 4 6 8 1 1 1 1
1 3 7 9 0 1 2 1 4 1 6 1 8 1
1 3 5 7 9
2 2 2 2 3 3 3 3
2 1 2 3 2 5 2 7 2 9 3 1 3 3 3 5 3 7 3
0 2 4 6 8 0 2 4 6 8
It is also possible to use a computer medium, such as STATStm v.2 or others, where
it is necessary to enter the sample size, the lower limit number (which in our
the previous example is 1) and the upper limit number (which in our example is 38)
35
10
30
Entering the data into the computer, we observe that the selected ones from the sample
they are the people numbered 35, 10 and 30 who will be subjects of the
research study
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
72
2 4 6 8 10 12 14 16 18
1 3 5 7 9 11 13 15 17 19
21 23 25 27 29 31 33 35 37
20 22 24 26 28 30 32 34 36 38
Población = 38 personas
Sample = 3 people
35
10
30
Sample
A. Probabilistic or Random
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
73
A. Probabilístico ó Aleatorio
Example:
1st, 2nd, and 3rd year of the Faculty of Medicine.
According to the number, calculate percentages.
A. Probabilistic or Random
USFXCh - Faculty of Medicine - Notes on Public Health II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
74
A. Probabilistic or Random
A. Probabilistic or Random
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
75
Types of sampling
B. Non-Probabilistic or Non-Random
B. Non-Probabilistic or Non-Random
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
76
B. Non-Probabilistic or Non-Random
B. Non-Probabilistic or Non-Random
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
77
10.1 Introduction
Every research study inherently involves determining during the design phase
sample size, necessary for its execution. Not carrying out this process,
it can lead us to two different situations: the first is that we carry out the study without it
appropriate number of patients, so we will not be able to be precise in estimating the
parameters and we will not find significant differences when in reality
yes they exist. The second situation is that we could study an unnecessary number of
patients, which implies not only a loss of time and an increase in resources
unnecessary but also the quality of the study, given this increase, may be affected
affected in a negative sense.
A frequently asked question that researchers receive is: What percentage of the
Is the population a good sample? Unfortunately, there is no satisfactory answer.
for all cases; the appropriate sample size is determined by various
factors, for which the optimal size must be determined in each case, taking into
count the particularities of the study.
In statistics, the sample size is the number of subjects that make up the
sample extracted from a population, necessary for the obtained data to be
representatives of the same.
The parameters taken into account for the calculation of sample size are:
Level of confidence
Proportion
Margin of error (Absolute accuracy)
Value of Q
Population or study universe
a) Level of confidence
The confidence level is represented by the letter Z and measures, as its name indicates, the
confidence level of a result in a sample study, which allows
generalize and that we can find the same data in the rest of the population when
which represents the sample. Therefore, logically a study will have a level
100% confidence if the research is conducted on 100% of the population; without
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
78
90% 1.65
91% 1,695
92% 1,751
93 % 1,812
94% 1,881
95% 1.96
96% 2,054
97% 2,170
98 % 2,326
99 % 2,576
b) Proportion
The proportion is represented by the letter 'P'. It represents the percentage or proportion.
of cases that we intend to find in our research based on the percentage
or proportion of cases found in other studies in similar populations where
we want to conduct our research study.
The literature review conducted in the "Theoretical Framework" of the research protocol,
will provide us with information on results or proportions found in different
latitudes of the world. Given the existence of different results for example if
we want to conduct a study on the prevalence of diabetes in the city of Sucre;
we observe in the literature that in Mexico in a study they found a 3% of
diabetics, in Ecuador 2.5% and in Tarija 1%; which of the 3 data do we adopt as
value 'P' for determining our sample size?; of course from
Tarija since it is the closest to the city of Sucre. It is also possible to do a
preliminary pilot study and achieve a more realistic approach in the city of Sucre itself.
If we cannot determine this proportion or percentage in the population
we predetermined the ratio or percentage as 50%.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
79
The margin of error is represented by the letter 'd'. We must have noticed that
the value of the proportion studied above may be different from one place to another,
and being inclusive means being different in the same place in different research, therefore
value adopted for the application in our sample size calculation may
be different compared to the one we find in our future research; for
both trying to cushion these differences, as well as some differences in the
reading and interpretation of results in the used equipment, or possible errors
humans, the statistical method expects to introduce the parameter 'margin of error' which
goes between 1 and 5%.
The smaller our margin of error, the larger our sample size will be.
On the contrary, the larger our margin of error, the larger our sample size.
will be smaller.
d) Value of Q
It is a value obtained from the difference between 100 and the proportional value or
"P" that we adopted
Q = 100 - P = 100 - 1 = 99
Q = 99
This value of Q is only used when applying the formula for obtaining
show manually and not with a software package, since the very
computer calculates automatically.
Depending on whether this population is infinite or finite, the sample size calculation
it differs using a different formula as we see below:
USFXCh - Faculty of Medicine - Public Health Notes II – Biostatistics – Dr. Gróver Linares Ph.D - 2015
80
n = Sample size
Z2Level of trust or security sought
P = Percentage or proportion of cases that are assumed to exist in the population that we
I am interested in studying due to previous studies, in the same research place or in another.
similar. If it is not known, it is assumed that there is 50%.
Q = Difference of the percentage or proportion to be studied. That is, Q = 100 - P
d = Desired precision or estimated tolerable margin of error
Exercise:
In a certain population, it is desired to estimate the % of women who use contraceptive methods.
What sample size is required to ensure a 95% confidence level,
that the estimation error does not exceed 3%. Previous studies indicate that this
the percentage or proportion of women using contraceptive methods reaches 25%.
10.3.2 Calculation of sample size with known population and/or finite universe
The formula used to determine the sample size with a known population
it is the following:
n = Sample size
N = Known population (number of inhabitants) of the place where it will take place
research.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
81
Exercise:
In the locality of 'Rio Hondo' with 4500 inhabitants over the age of 35; it is proposed
to know the blood glucose level of a population over 35 years old; to determine if
It is necessary to establish a food education program.
There are precedents for this measurement in a similar locality that provides a proportion.
the percentage of hyperglycemia of 14%.
N = 4500
Z = 1.96
P = 14
Q = 100-14 = 86
d=2
It is worth noting that the sample size for this study is 920 people.
A higher level of confidence and a smaller margin of error require a larger sample size.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
82
Immediately, a dialog with 3 windows opens: in the first one that appears, you place
the 'arrow' over 'sampling', appearing a second window where the 'arrow' is
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
83
put in "Sample Size Calculation" and then in the third window press the
left mouse button on 'Proportion'.
Proportion
Another window immediately appears asking to enter the data to make the
sample size calculation:
Each of the requested data is introduced, which in the previous example has a size
population of 4500; an expected proportion of 14 (the program uses the system of
decimal point score to indicate a fraction, therefore 14.000 means
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
84
14 integers with 000 thousandths, which it does automatically when entered the
percentage 14); then the Confidence Level is introduced in terms of percentage, which
in our example it is 95 (the program automatically adds the fraction ".0";
finally, the margin of error that the program uses should be introduced
equivalent to 'Absolute Precision' which in our example is 2% (automatically
the program introduces the fraction ".000".
The phrase 'Design effect' that appears automatically is not taken into account.
As we can observe; with the input parameters, which are the same, the
used manually; the sample size calculation obtained in a few seconds
It is also of 920 people. Therefore, we verify that the Analysis Program
Epidemiological, it achieves exactly the same result, saving a lot of time without
options for procedural errors.
To be reasonably sure that our study will detect the association, the
The study must be sufficiently large so that the sampling error is
controlled.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
85
In the academic experience, to demonstrate and make it easier for the student to understand
Research methodology: the relationship between sample size and power
statistics, using the 'Program for epidemiological analysis of tabulated data
version 3.0 "EPIDAT", we proceed to perform sample size calculations with
different parameters: Confidence level, and margin of error or absolute precision, for
one same population or universe and expected proportion; in the assumption of wanting to carry out
a research study on the use of emergency services during management
2011, in a neighborhood of the city of Sucre; where there is a population of 680 people,
with the background that in another neighborhood of the city of Sucre it was determined that
10% of the population used this service:
Calculus 1 Calculus 2
N = 680 Size of N = 680 Size of
Z = 90 Sample Z = 99 Sample
d=5 d=5
p = 10 n = 86 p = 10 n = 177
Calculus 3 Calculus 4
However, if we increase the confidence level to 99% and lower the margin of error
to only 1% as in "calculation 4", we achieve maximum statistical power, and as
We can notice that the sample size increases to 611 people.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
86
11.1 Introduction
The graph of its density function has a bell-shaped form and is symmetrical.
of a certain parameter. This curve is known as the bell curve.
Distribution of triglycerides in
students of the Medicine degree
120
100
80
60
40
Triglicéridos mg/dl
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
87
For example: we consider it normal for an adult to have a systolic blood pressure of
130 mm of mercury and abnormal to have a systolic pressure of 210 mm of mercury.
To establish the boundaries between what is normal and pathological, it is necessary to know the
distribution of the variable under study in normal individuals.
The graph of the normal distribution resembles a symmetrical bell. The mean, the median
and the mode of the distribution has the same value. The distribution is completely
defined by the mean and the standard deviation.
S S S X S S S
68.27%
95.45%
99.73%
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
88
X - 1S and X + 1S = 68.27%
X - 2S and X + 2S = 95.45%
X - 3S and X + 3S = 99.73%
1S, 2S and 3S means adding or subtracting (±) the value of the standard deviation.
multiplied by 1, 2, or 3.
To calculate the area under the normal curve for a given value of the variable 'x'.
normal distribution area tables have been constructed with the following
characteristics:
The total area under the normal curve is equal to 1 (which is equivalent to 100%)
3. In the table, the 1st column contains the integer and decimal, the second decimal is
find at the top (1st row)
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
89
4. The values of the 1st column and the 1st row represent the values of Z, while
that the values contained in the area represent the probabilities.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
90
AREA CALCULATION
CALCULATION OF AREAS
To calculate the area under the normal curve at
from a certain value of the variable 'x',
it is necessary to transform the original variable into
that the data is given in such a way that its
average and its standard deviation have
these values. This transformed variable is
it is called standard normal variable and is symbolized
by 'Z' or rather:
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
91
CALCULATION OF AREAS
Where:
Z = Number of standard deviations from the mean
X = Some value of interest
X = Arithmetic mean of the normal distribution
S = Standard deviation of the normal distribution.
CALCULATION OF AREAS
Example: Let's assume that in the face of a determination of
hematocrit in the blood we have to decide if this
is value normal or not. We accept that the hematocrit has
normal distribution with an average of 48% and deviation
4% standard. Let us assume that in a patient it is
Find a value of 56%. What is the probability of
How can this happen while being healthy?
X = 48 56 – 48 8
X-X
Z = ------- Z = ----------- = ----- = 2 Z = 2.00
S S=4
4 4
X = 56
This means that the hematocrit is 56%
is located 2 standard deviations from
average.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
92
CALCULATION OF AREAS
In the normal distribution table, the area
corresponding to the value noted at the intersection
from the row corresponding to 2.00 of the first
column and the corresponding column to 0.00 in the
the first row is 0.0228.
This means that according to the model of the
normal distribution, the probability of finding
hematocrits equal to or greater than 56% is equal
0.0228; or by multiplying this value by 100 it is
equal to 2.28%, which means that it is likely
that there is 2.28% of healthy individuals with
values equal to or greater than 56 % of
hematocrit
AREA CALCULATION
Similarly, the table allows for the calculation of other
probabilities, such as the one of finding
values in a certain interval of the variable "x"
For which it will be necessary to keep in mind that the
total area is worth 1.
For example: We would like to know the probability
to find hematocrit values between 45% and
50%. Buscamos “Z” para ambos valores:
X = 48
S=4
X = 45 and 50
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
93
CALCULATION OF AREAS
45 - 48 3 P1 = 0,2266
Z = --------- = ----- = - 0.75
4 4
50 - 48 2
Z = --------- = ----- = 0.50 P2 = 0,3085
4 4
Adding the extreme areas P1 and P2 and subtracting "1" from the total surface,
we found the sought probability:
P1 + P2 =
0,2266 + 0,3085 = 0,5351
1 – 0,5351 = 0,4649
So the probability of finding values between 45% and 50% of
hematocrit is 0.4649 or in other words, 46.49% of the
healthy individuals have a hematocrit between 45 and 50%.
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
94
12.1 Introduction
6 Possible Outcomes
1 2 3 4 5 6
1/6 + 1/6 + 1/6 + 1/6 + 1/6 + 1/6 =1
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
95
0.5 + 0.5 = 1
50% chance of getting heads and 50% chance of getting tails
50 + 50 = 100%
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
96
Properties of probability
1st Experiment
2nd Experiment
USFXCh - Faculty of Medicine - Public Health Notes II – Biostatistics – Dr. Gróver Linares Ph.D - 2015
97
Probability
0.5 of the face
currency = 0.5
Probability
0 from the cross of
1 2 5 10 20 50 100 200 500 1000 2000 5000 10000
moneda =0.5
Nº lanzamientos
Law of chance:
Properties of probability:
USFXCh - Faculty of Medicine - Public Health II Notes – Biostatistics – Dr. Gróver Linares Ph.D - 2015
98
Laplace's rule
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
99
P(A) + P(A) = 1
Examples:
1. When flipping a coin, the probability of
que salga cara es:
0.5 + 0.5 = 1
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
100
P(A) + P(A) = 1
Examples:
2. What is the probability of getting 5 when rolling?
a die?
0.17 + 0.83 = 1
P(A) + P(A) = 1
Examples:
In a group formed by 7 patients with hypertension
arterial and 3 of diabetes, 2 people are chosen at random.
What is the probability that I will come out sick from
diabetes? n = 10 people
h = 3 diabetes
d = 7 hyp.
P(A) = 3/10 = 0.30 or 30%
P(A) = 7/10 = 0.70 or 70%
0.30 + 0.70 = 1
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
101
P(A) + P(A) = 1
Examples:
4. In a group made up of 3 tuberculosis patients and 9
healthy people, 4 people are chosen at random.
What is the probability of getting 1 sick person out of
tuberculosis? n = 12personas
h = 3tubercul.
d = 9 years
P(A) = 3/12 = 0.25 or 25%
P(A) = 9/12 = 0.75 or 75%
0.25 + 0.75 = 1
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
102
a) Determine if the two variables are correlated, that is, if the values of
a variable tends to be higher or lower for higher or lower values
low of the other variable.
b) To be able to predict the value of one variable given a specified value of the other
variable.
c) Assess the level of agreement between the values of the two variables.
The quantification of the strength of the linear relationship between two quantitative variables, is
studies through the calculation of the Pearson correlation coefficient. This coefficient
it oscillates between -1 and +1. A value of -1 indicates a linear relationship or positive straight line
perfect. A correlation close to zero indicates that there is no linear relationship between the two.
variables.
The graphical representation of the data to demonstrate the relationship between the value
of the correlation coefficient and the shape of the graph is fundamental since there are
non-linear relationships.
The Pearson correlation coefficient (r) can be calculated for any data set,
however, the validity of the hypothesis test on the correlation between the variables
requires in the strict sense: a) that the two variables come from a random sample
of individuals. b) that at least one of the variables has a normal distribution in the
population from which the sample is drawn. For the valid calculation of an interval of
confidence of the correlation coefficient r both variables must have a distribution
normal. If the data does not have a normal distribution, one or both variables can be
transform (logarithmic transformation) or if not, a coefficient would be calculated
non-parametric correlation (Spearman's correlation coefficient) that has the same
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
103
meaning that the Pearson correlation coefficient is calculated using the rank
from the observations.
The calculation of the correlation coefficient (r) between weight and height of 20 boys is
see the attached table. The covariance, which in this example is the product of weight
(kg) per size (cm), so that it has no dimension and is a coefficient, it is divided by the
standard deviation of X (size) and the standard deviation of Y (weight) with which we obtain
the Pearson correlation coefficient which in this case is 0.885 and indicates a
important correlation between the two variables. It is evident that the fact that the
a strong correlation does not imply causation. If we square the coefficient of
correlation we will obtain the coefficient of determination (r20.783) that indicates to us that the
78.3% of the variability in weight is explained by the child's height. Therefore, there are
other variables that modify and explain the variability in the weight of these children. The
the introduction of more variables with multivariate analysis techniques will allow us to identify
the importance of what other variables may have on weight.
Y X
Peso Talla
(Kg) (cm)
9 72 5.65 1.4 7.91
10 76 9.65 2.4 23.16
6 59 -7.35 -1.6 11.76
8 68 1.65 0.4 0.66
10 60 -6.35 2.4 -15.24
5 58 -8.35 -2.6 21.71
8 70 3.65 0.4 1.46
7 65 -1.35 -0.6 0.81
4 54 -12.35 -3.6 44.46
11 83 16.65 3.4 56.61
7 64 -2.35 -0.6 1.41
7 66 -0.35 -0.6 0.21
6 61 -5.35 -1.6 8.56
8 66 -0.35 0.4 -0.14
5 57 -9.35 -2.6 24.31
11 81 14.65 3.4 49.81
5 59 -7.35 -2.6 19.11
9 71 4.65 1.4 6.51
6 62 -4.35 -1.6 6.96
10 75 8.65 2.4 20.76
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
104
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
105
a non-linear relationship such as the weight of the newborn and the time of
gestation. In this case, the r underestimates the association when measured linearly. The
non-parametric methods would be better used in this case to show if
the variables tend to rise together or move in directions
different.
800
n=28; r= -0,628;p<0,01
700
600
500
400
300
200
100
0
0 0,1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
-100
Human Development Index
At a higher Human Development Index, there is a trend of decreasing the Ratio.
on Maternal Mortality
Through the Pearson's correlation statistical test (-0.628), it was confirmed, for
the population under study, the existence of a significant direct relationship at the level 0.01
Kendall's Tau-b correlation coefficient -0.484, significant at the 0.01 level
Spearman's Rho correlation coefficient -0.654, significant at the 0.01 level
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
106
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
107
Let us suppose that we want to study the possible association between the fact that a
pregnant women smoke during pregnancy and the child is born with low birth weight. For the
So, it is about seeing if the probability of having low weight is different in pregnant women than in
smoking or in pregnant women who do not smoke during pregnancy. To answer this question
a follow-up study is conducted on a cohort of 2000 pregnant women, to whom
inquire about their smoking habits during pregnancy and also determine the weight
of the newborn. The results of this study are shown in Table 2.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
108
Given a contingency table like the one above, we can pose different questions.
issues. Firstly, it will be sought to determine if there is a statistically significant relationship
significant among the studied variables. Secondly, we will be interested in quantifying
the relationship and study its clinical relevance.
This last issue can be resolved through the so-called association measures.
or effect (relative risk (RR), odds ratio (OR), absolute risk reduction (ARR))
that have already been addressed in other works. On the other hand, to respond to the first
Question, the methodology for analyzing contingency tables will depend on several
aspects such as: the number of categories of the variables to be compared, of the fact that
que las categorías estén ordenadas o no, del número de grupos independientes de
subjects that are being considered or from the question that one wishes to answer.
This article will present the calculation and interpretation of the χ2 test as a method.
analysis standard in the case of independent groups.
The chi-squared test allows us to determine whether two qualitative variables are associated or not. If at
At the end of the study, we conclude that the variables are not related; we can say that.
a certain level of confidence, previously established, that both are independent.
To compute it, it is necessary to calculate the expected frequencies (those that should
to have observed whether the independence hypothesis were true), and compare them with the
observed frequencies in reality. In general, for a r x k table (r rows and k
columns), the value of the χ 2 statistic is calculated as follows:
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
109
where:
It denotes the observed frequencies. It is the number of observed cases.
classified in row i of column j.
Thus, the chi-squared statistic χ 2 measures the difference between the value that should result if the two
variables were independent and what has been observed in reality. The greater
the greater that difference is (and, therefore, the value of the statistic), the greater the relationship between
both variables. The fact that the differences between the observed values and
expected values are squared to convert any difference into positive.
The chi-squared test is thus a non-directed test (two-tailed test) that indicates whether there is
there is no relationship between two factors but not in what sense such an association occurs.
To obtain the expected values E, these are calculated through the product of the
marginal totals divided by the total number of cases (n). For the simplest case of
a 2x2 table like Table 1, must be:
For the example data in Table 2, the expected values would be calculated as follows:
So the observed and expected values for the proposed example data
shown in Table 3.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
110
The value of the χ2 statistic, for this specific example, would then be given by
like:
H0: There is no association between the variables (in the example, the low weight of the child and the fact
Smoking during pregnancy is independent, they are not associated.
Yes, there is an association between the variables, that is, low weight and smoking during the
gestation are associated.
Under the null hypothesis of independence, it is known that the values of the statistic χ2 are
they are distributed according to a known distribution called chi-squared, which depends on a
parameter called degrees of freedom (g.l.). For the case of a contingency table.
of r rows and k columns, the degrees of freedom are equal to the product of the number of rows minus 1 (r-1) by
the number of columns minus 1 (k-1). Thus, in the case where the relationship is studied
between two dichotomous variables (2x2 table) the degrees of freedom are 1.
If the null hypothesis is true, the obtained value should be within the greater range.
probability according to the corresponding chi-square distribution. The p-value that is usually
reporting that most statistical packages is nothing more than the probability of obtaining,
according to that distribution, a more extreme datum than the one provided by the test or,
equivalently, the probability of obtaining the observed data if it were true the
independence hypothesis. If the p-value is very small (usually it is considered
p<0.05) it is unlikely that the null hypothesis will be fulfilled and it should be rejected.
In Table 4, the degrees of freedom are determined (in the first column) and the value of
α (in the first row). The number that determines its intersection is the critical value.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
111
For the case of a 2x2 Table, the expression (1) of the χ 2 statistic can be simplified and
to be obtained as:
When the sample size is small, the use of the chi-square distribution for
approximating the frequencies may introduce some bias in the calculations, so that the
the value of the χ 2 statistic tends to be larger. Sometimes a correction is used for
eliminate this bias that, in the case of 2x2 tables, is known as the correction of
Yates:
In the previous example, the calculation of the χ 2 statistic with Yates' correction would give us
a value of 2Yχ =38.43 (p<0.01) instead of χ 2 =40.04. There is no consensus on the
literature on the use or non-use of this Yates' conservative correction, which with
reduced samples make it difficult to reject the null hypothesis, although the effect is practically
imperceptible when working with larger samples.
However, it is worth mentioning that the use of Yates' correction does not exempt
certain requirements about the sample size necessary for the use of the
Chi-squared statistic χ 2. As a general rule, it will be required that 80% of the cells in a table
contingency should have expected values greater than 5. Thus, in a 2x2 table it will be
necesario que todas las celdas verifiquen esta condición, si bien en la práctica suele
allow one of them to display expected frequencies slightly below
this value. In those cases where this requirement is not verified, there is a test,
proposed by R.A. Fisher, which can be used as an alternative to the χ2 test and that is
known as Fisher's exact test. The procedure consists of evaluating the probability
associated with all the 2x2 tables that can be formed with the same marginal totals
that the observed data, under the assumption of independence. The calculations, although
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
112
elementary, are somewhat cumbersome, so they will not be included in this work, being
multiple references that can be consulted in this regard.
To conclude, it is important to emphasize that there are other statistical methods that allow us to analyze
the relationship between qualitative variables, which complement the information
obtained by the statistic χ 2. On one hand, the analysis of the standardized residuals.
will allow to verify the direction in which the relationship between the studied variables occurs.
There are also other measures of association, many of which are effective.
especially useful when one of the variables is measured on a nominal scale or
ordinal, which allow quantifying the degree of relationship that exists between both factors.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
113
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
114
Population
p > 0.05
Security Sample
Results
C.I. Confidence interval
The confidence interval describes the variability between the measurement obtained in a
study (sample) and the actual measure of the population (the real value). Corresponds to
a range of values, whose distribution is normal and in which it is found, with
high probability, the real value of a certain variable. This 'high
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
115
The probability that the true value of the parameter is found in the
The constructed interval is called the confidence level, and it is denoted 1 - α.
The probability of making a mistake is called the significance level and is symbolized as α.
Generally, confidence intervals are constructed with 1 - α = 95% (or significance
α = 5%). Less frequent are the intervals with α=10% or α=1%.
Example:
38 39 39 40 41 41 43 45 45 45
45 45 45 46 46 46 47 47 47 47
47 48 48 48 49 50 50 51 51 51
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
116
S S
IC95= X - 1.96 x ---------- - IC95= X + 1.96 x ----------
√n √n
3.7 3.7
IC95= 45.7 - 1.96 x ---------- - CI9545.7 + 1.96 x ----------
√30 Square root of 30
3.7 3.7
IC95= 45.7 - 1.96 x ---------- - CI95= 45.7 + 1.96 x ----------
5.4 5.4
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
117
Therefore, the confidence interval for the mean of the hematocrit study of
The studied population with 95% confidence is between 44.35% and
47,05 ^
Formula:
^
IC95= p±1.96 x √p^ x (1 ^– p) / n
Example:
IC95: Z = 1.96
IC95= 0.26 - 1.96 x √0.19 / 825 - IC95= 0.26 + 1.96 x √0,19 / 825
For example, let's suppose that the hypothesis is raised that the average height
female sex birth in the city of Sucre is equal to the average
national of 52 centimeters.
50 centimeters
s=2
n= 30
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
119
When constructing a 95% confidence interval for the population mean, one
obtains:
As Z for 95% is
Formula:
S equivalent to 1.96
IC95= X ± Z x ---------- S
IC95= X ± 1.96 x ----------
square root of n square root of n
S
IC95= X - 1.96 x ---------- - S
IC95= X + 1.96 x ----------
√n √n
2
IC95= 50 - 1.96 x ---------- - 2
IC95= 50 + 1.96 x ----------
√30 √30
2
IC95= 50 - 1.96 x ---------- - 2
IC95= 50 + 1.96 x ----------
5.48 5.48
Therefore, the birth size in girls from Sucre varies between 48.28 and 50.72, with
a 95% confidence.
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015