0% found this document useful (0 votes)
55 views119 pages

02 Book 2 - Biostatistics - Linares 2019

This document presents an introduction to biostatistics in the context of health sciences. It explains that biostatistics is essential for measuring and analyzing health-related data, conducting medical research, and comparing populations. It also provides a brief statistical summary of the HIV/AIDS epidemic in Bolivia, including the number of reported and estimated cases, the main routes of transmission, and the distribution by departments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views119 pages

02 Book 2 - Biostatistics - Linares 2019

This document presents an introduction to biostatistics in the context of health sciences. It explains that biostatistics is essential for measuring and analyzing health-related data, conducting medical research, and comparing populations. It also provides a brief statistical summary of the HIV/AIDS epidemic in Bolivia, including the number of reported and estimated cases, the main routes of transmission, and the distribution by departments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1

MAYOR, ROYAL AND PONTIFICAL UNIVERSITY OF SAN FRANCISCO


XAVIER OF CHUQUISACA
SCHOOL OF MEDICINE
PUBLIC HEALTH I

Dr. Gróver Linares Padilla Ph.D.

Fifth edition

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
2

Index

Chapter Page.

1 Biostatistics - Introduction 3

A. Descriptive Biostatistics 12

2 A.1 Frequency measures in health 12


3 A.2 Measures of Position: A.2.1 Measures of 17
tendencia central: – Media aritmética
4 Median 27
5 Fashion 32
6 A.2.2 Cuantiles: Cuartiles, deciles, percentiles 38
7 A.3 Measures of dispersion 51
8 A.4 Measures of shape: Coefficient of
asymmetry and kurtosis 61

B. Inferential Biostatistics 64

9 B.1 Sampling 65
10 B.2 Determination of sample size 77
11 B.3 Basic notions of Normal Distribution 86
12 B.4 Basic notions of probability 94
13 B.5 Basic notions of correlation 102
14 B.6 Chi squared 107
15 B.7 Confidence interval 114
119

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
3

BIOSTATISTICS
1
1.1 Introduction

A high school student who wishes to continue his studies at the University, and generally
He doesn't like numbers, and has a vocation for Health Sciences careers, he decides
study medicine, dentistry, nursing, biochemistry, pharmacy, nutrition, physiotherapy or
images, leaving the numbers as far away as possible.

What a mistake; no one will ever be far from numbers; because we were born with
numbers, we live with numbers and we will die with them. We were born with Apgar
8 (assessed or calculated by the neonatologist), with a weight of 3200 grams located in
50th percentile, a heart rate of 120 beats per minute, having as range
normal between 110 and 140 with a 95% confidence interval; in our first analysis of
we had blood hemoglobin of 17 g/ml knowing that the normal range is between 16.5-
19.5 g/100 ml etc. etc.

Someone could die from a myocardial infarction because they had a cardiovascular risk.
elevated due to having a total cholesterol above 240 mg/dl, LDL cholesterol above
160 mg/dl, HDL cholesterol below 35 mg/dl, triglycerides above 50 mg/dl.

Where do we get those values to classify people as normal or


abnormal? Of course, from research studies with statistical calculations about
populations.

No matter the major you study, they are all part of the sciences and as such, science
grows and nourishes itself with the new knowledge gained through research,
using the scientific method (which we will study in the following chapters) and which cannot
to dispense with statistics.

Everything is measurable, just as we were taught when we were children, that distance is measured
in meters, liquids are measured in liters, weight in kilos, we later learned that not
only the metro was used, there were also centimeters, millimeters, microns,
nanomicrons, etc. Now that studying in "Health Sciences" we know that a
Red blood cells live in our body for only 100 to 120 days, measuring 7 to 7.5 μm in diameter.
(micrometer = one millionth of a meter) that exists in a cubic millimeter
more than 4 million red blood cells are trapped, if we have 5 liters of blood
our calculator will not support performing the calculation, and will only give us a result in
scientific notation of 2.5 X1013.

There are different measures and indicators of well-being (social or economic) in health and there
they have developed certain indices of 'positive health' both for operational purposes and
for research and promotion of healthy conditions, in dimensions such as the

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
4

mental health, self-esteem, job satisfaction, physical exercise, etc. The collection
data and the estimation of indicators aim to systematically generate,
evidence that allows identifying patterns and trends that help to undertake
actions for the protection and promotion of health and for the prevention and control of
disease of the population.

Among the most useful and common ways to measure general health conditions of
the population highlights the national censuses, which are conducted every ten years, which
they provide the periodic count of the population and various of its characteristics, whose
analysis allows for making estimates and projections.

To allow comparisons over time within the same population or


between different populations, standardized measurement procedures are required.

The measurement of health status requires standardized procedures.


universally accepted and comparable, that can be interpreted in the same way
way anywhere in the world.

Many times medical students ask themselves the following questions: Why
Is it necessary to study statistics in Medicine? What are we going to study numbers for if
In the whole program, are we only going to study muscles, bones, or tissues? Is it really a
subject that will help me in my professional life or is it simply a filler in the curriculum
of studies?

Everyone, absolutely everyone (you too), wants to be excellent professionals.


Many will succeed, I hope you will too. I can surely tell you that if
wants to be an excellent professional, one of the keys to achieving this is not to stay in the
mediocrity repeating and accepting what their peers are doing anywhere in the world
they investigated and contributed new knowledge to science; you must investigate and
become the reference for others, to share their research in
conferences, forums, meetings; and/or presenting research papers in journals
scientific. To achieve all this, you must research, know the methodology of
research and use biostatistics.

When we talk about biostatistics, it is often thought of as a relationship of numerical data.


presented in an orderly and systematic manner.

If as a new professional you research 'the evolution of HIV/AIDS in Bolivia', you will surely
You will have to conduct a study of the population (sex, race, religion, age, occupation,
economic income, level of education, marital status, etc.), investigate in the different
hospitals the cases of diagnosed positive HIV, register that information,
organize it, tabulate it, and with the data you have, answer the following questions:

How many cases of AIDS are there in Bolivia currently? Will the number increase?
AIDS cases in the next five years? Which department will have the highest rate?
of positive HIV cases?, Are the disease control mechanisms working

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
5

satisfactory results?, Is the staff available sufficient?, How many


How many cases of mortality due to HIV/AIDS exist per year? etc.

Summary of the AIDS epidemic in Bolivia


Number of people 7,642 cases Transmission route of HIV
registered with HIV of
1984 to March 2012 Sexual 94%

Number of people 38,210 cases Perinatal 3%


estimated with HIV to
March 2012 (5 Blood-related 1%
persons for each
diagnosed case
Distribution by departments
Santa Cruz 52%
HIV transmission according to the The Peace 20%
sexual orientation Cochabamba 17%
Heterosexual 80 % Me 3%
Homosexual 15% Tarija 2%
Bisexual 5% Chuquisaca 2%

1 in every 262 inhabitants in Bolivia Oruro 2%


they live with HIV - AIDS. Potosí 1%
6 out of 10 people infected with Pando 1%
HIV/AIDS affects those between the ages of 15 and 34.

The reason bio-statistics arises is that the world is filled with


variations, for example, the age of people varies, their height, their marital status, their
eye color, the type of disease they may have, varies its treatment, their
forecast, etc. This is how biostatistics arises due to these variations, which
will be occupied with studying these variations to draw certain conclusions
conclusions.

Once convinced of the usefulness of biostatistics, let's begin to see


What is Statistics and some basic concepts that will allow us to proceed
understanding it in a better way.

For a long time, the word statistic referred to numerical information about
the political states or territories. The word comes from the Latin "statisticus" which means
"of the state." In the past, statistics were only used to know the number
of inhabitants of a certain region, for the collection of taxes.

Statistics is the set of methods necessary for collecting, classifying,


to represent and summarize data, as well as to make inferences (to extract
scientificconclusionsfromthem.

Biostatistics is the science that studies the methods and procedures of


vital facts for the: Collection Classification Presentation Analysis and
Data interpretation.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
6

Biostatistics contributes to the analysis and solution of health issues.


building "Health Indicators".

Indicators: They are values or statistical expressions that attempt to quantify in


direct or partial form, different phenomena under study.

1.2. Classification of biostatistics

CLASSIFICATION OF STATISTICS
Set of procedures necessary to collect,
STATISTICS classify, represent, and summarize (through methods
1. DESCRIPTIVE numerical and graphical) the dataset that
they form a sample obtained from a population.

PROBABILITY

Set of methods that, relying on calculation of


STATISTICS
2. INFERENTIAL
probabilities and based on the data of a sample,
allow valid conclusions to be drawn for the
population under study.

Inferential statistics can be considered as the methods that make it possible


the estimation of a characteristic of a population, or the making of a decision with
regarding a population, based solely on sample results.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
7

1.3 Measures in Descriptive Biostatistics

The measures used in descriptive biostatistics are as follows:

a) Reasons

1. Measures of frequency b) Proportions

c) Tasas

a) Media
a) Measures of b) Median
central tendency c) Fashion

2. Measures of position a) Quartiles


b) Quantiles b) Deciles
c) Percentiles

Rank
b) Mean deviation
3. Measures of dispersion or variation c) Variance
d) Standard deviation
e) Coeficiente de variación

a) Kurtosis
4. Measures of shape b) Coefficient of skewness

1.4 Measures in Inferential Biostatistics

The measures used in inferential biostatistics are as follows:

a) Finite universe a) Probabilistic sampling

Determination of sample size b) Non-probabilistic sampling


b) Infinite universe

2. Normal distribution

3. Probability

4. Linear regression and correlation

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
8

5. Chi-square test

6. Confidence interval

1.5 A necessary clarification before continuing with the following chapters of


biostatistics

In my long years as a teacher, I have been able to notice that students have serious
difficulties, with simple details that are not taken into account. For this reason, I allow myself
explain those simple yet important details:

1.5.1 Is the calculator programmed with a decimal point or a decimal comma?

A person who uses the calculator incorrectly without realizing it thinks that the
the answer is correct because it was the result given by the calculator, however it can
to be making tremendous mistakes.

In many countries around the world, a decimal comma is used to indicate a decimal point, but
in others, they also use a decimal point to represent the same.

For example, to represent 3 integers with 256 thousandths.

Some write: 3,256 and others 3.256; without taking into account these forms of writing,
some will read 3 integers with 256 thousandths, but others will read 3256 integers, figures
totally different.

In some countries, they use a dot to separate units of thousands, while in others they use a comma.

To represent the year 2015

Some write 2.015 and others 2,015

What system do we use in Bolivia? To express decimals, we use a comma.


decimal not the decimal point. We only use the point to separate whole numbers,
miles, millions, etc. In the previous example in Bolivia, three and a half is written as "3.5" and for
To write the year 2015, we write '2.015' or simply '2015'.

Now then, when we buy and use a scientific calculator, as it has been
manufactured or programmed for a specific country, it may show us data using one
or another system, it is worth saying that to express a decimal, you use a decimal point or a comma.
decimal. We must identify what system our new calculator uses, so that we do not
make mistakes.

Generally, calculators that come from Asia (China, Japan, etc.) use the point.
decimal to express a decimal; therefore, if our calculator is of this type

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
9

we must mentally transform that decimal point into a decimal comma when
We transform these numerical expressions for Bolivia.

3.256

Calculator with decimal point Calculator with decimal comma

1.5.2 Scientific notation:

Many scientific calculators, for very large or very small values, remove the
results in scientific notation, so it is important to know and interpret the
same. For this reason, we are going to do a brief review.

Any number can be written in powers of ten as a product of its


factors, the first factor being a number between 1 and 9 and the second the
power of ten. This process is called scientific notation.

Scientific notation is very useful for expressing very large or very small numbers.

It has three parts:

A single digit integer part


The other significant figures like the decimal part
A power of ten that gives the order of magnitude of the number

Example: 3,287 X 10123287 000 000 000


Each zero in the numbers above represents a multiple of 10. For example, the number
100 represents 2 multiples of 10 (10 x 10 = 100). In scientific notation, 100 can
to be written as 1 by 2 multiples of 10:

100 = 1 X 10 X 10 = 1 X 102(in scientific notation)


For example

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
10

5.7 X 106= 5700000

This abbreviation can also be used with very small numbers. When the
scientific notation is used with numbers less than one, the exponent on 10 is
negative, and the decimal moves to the left, instead of to the right.

For example:
6.5 X 10-3 0.0065

Consequently, using scientific notation, the diameter of a red blood cell is 6.5 X 10.-
3cm. (0.0065); the distance from the earth to the sun is 1.5 X 108Km (150,000,000. and the number

the number of molecules in 1 gram of water is 3.34 X 1022(33 400 000 000 000 000 000 000)

- 1.56234×1029 = 156 234 000 000 000 000 000 000 000 000
- 0.000000000000000000000000000000000000910939 kg (mass of an electron) can
be written as 9.10939×10-31kg.

Final note: In scientific notation, the numeral base is always represented as a


simple digit followed by decimals if necessary. Therefore, the number 0.0065
it is always represented as 6.5 x 10-3, nowaslike.65 x 10-2or 65 x 10-4.

1.5.3 Rounding:

It depends on the number of significant figures we want to use to provide a solution. In theory, it ...
it should always match the number of significant figures that the expression has
the fewer figures it has.

We count the number of digits we want to give and we look at the next one, if it is 5 or
mayor, the last one is increased by one unit, if it is 4 or less the last one is left as it is.

Digit less than 5: If the following decimal is less than 5, the previous one is not modified.
Example: 12,612. Rounding to 2 decimals we must take into account the third
decimal: 12,612= 12,61.

Digit greater than 5: If the following decimal is greater than or equal to 5, the previous one
increase by one unit. Example: 12.618. Rounding to 2 decimal places we should
take into account the third decimal: 12.618 = 12.62. Example: 12.615. Rounding to 2
we must take the third decimal into account: 12.615 = 12.62.

If you want to practice rounding with your computer, you can visit the following page
web, I am sure that you will not only learn, but also have fun.
Oh, and don't forget to use the decimal comma and not the decimal point!
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
11

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
12

2 A. DESCRIPTIVE BIOSTATISTICS

A1. FREQUENCY MEASURES IN HEALTH

Numbers, Rates, Ratios,


Proportions and Indices

2.1 Introduction

Measurement consists of assigning a number or a rating to some property


specific to an individual, a population, or an event using certain rules. However,
Measurement is a process of abstraction. Strictly speaking, the individual is not measured.
but rather abstracts a certain feature of him, separating it from other properties. One does not measure the child.
but rather obtains information about their height or weight. In addition, what is done is
compare the measured attribute in other individuals (or in the same individual at another time
moment), in order to assess their changes over time or when it occurs in
conditions different from the original ones.

To measure, it is necessary to follow a process that consists, in brief words, of the step
from a theoretical entity to a conceptual scale and, subsequently, to an operational scale.

In general, the steps followed during the measurement are the following: a) it is defined
the part of the event that will be measured, b) the scale with which it will be measured is selected, c) it
compare the measured attribute with the scale and, d) finally, a value judgment is issued
about the results of the comparison. To measure the growth of a minor, by
for example, first the variable to be measured is selected (age, weight, height); then it
they select the measurement scales (completed months, centimeters, grams);
immediately after, the attributes are compared with the selected scales (a
age of 6 months, 60 cm in height, 4,500 grams in weight) and, finally, a judgment is issued.
value, which summarizes the comparison between the found magnitudes and the criteria of
health accepted as valid at that time. As a result, the infant is qualified
as well nourished, malnourished, or overnourished.

As can be seen, measurement is an instrumental process only in appearance, since


the selection of the part to be measured, the measurement scale, and the health criteria
What will be used as elements of judgment must be the result of a decision-making process.
Theoretical. In other words, only what has been conceived before can be measured.
theoretically. Measurement, however, allows us to achieve a high degree of
objectivity when using instruments, scales, and criteria accepted as valid by the
most of the scientific community.

USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
13

The frequency of any event can be measured in five ways:

2.2 Number:

It is a mathematical concept that expresses quantity. For example, we say that there have been
120 cases of tuberculosis detected in a certain population.

These provide an idea of the magnitude or real volume of an event. They are useful for
the allocation of resources (for example, the monthly number of births in a
Hospital establishment gives an idea of the number of beds, staff, and resources.
necessary physicists to meet this demand). When making comparisons, the use
Absolute figures have limitations, since they do not refer to the population from which
there are obtained (thus, 40 deaths per year in a population of 15,000 inhabitants,
it can be proportionally greater than 50, occurring in a population of 20,000
inhabitants). However, the comparison of absolute figures referred to the same
population in short periods of time can be a good risk estimator for
keep the denominator constant.

2.3 Rates:

Rates are magnitudes that express the dynamics of an event in a population.


over time, it is worth noting that they measure the intensity, frequency, or speed of a
phenomenon in relation to the universe that is capable of producing the same or that is
exposed to its production in a specific place and for a period of time
also determined.

It is a measure that relates the number of times an event occurs in an area and
a defined period of time, with the number of inhabitants of the population in which
it can happen.

They are composed of a numerator that expresses the frequency with which it occurs.
event (for example, 564 deaths from breast cancer in 2014 in Bolivia) and a
denominator, given by the population that is exposed to such an event (4,583,443
women). In this way, a quotient is obtained that represents the probability
mathematics of the occurrence of an event in a defined population and time. In the
For example, the obtained rate estimates the risk of each woman over 30 years old in Bolivia.
died of breast cancer during 2014.

When the denominator refers to the general population, for the purposes of the calculation of the
exposed population, the existing one as of June 30 in that place is used as a convention.
during that year (mid-year). For practical reasons, such as the numerator of the rate
it can never be greater than its denominator, the result will be less than one and
to avoid the use of decimals, the results are multiplied by a factor of
amplification by some multiple of 10 (whether 1,000, 10,000, 100,000). This same
The amplification factor is used to compare rates internationally with factors.
pre-established.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
14

In this way, the breast cancer mortality rate in women in 2014 was
12.31 deaths per 100,000 women. (564/4583443 X 100,000 = 12.31)

564
Breast cancer mortality rate = ------------------------ X 100000 = 12.31
4583443

The numerator and the denominator must have strict correspondence in three
aspects:

a) Nature of the phenomenon: The event in the numerator must be capable


to be produced or to affect the population listed in the denominator.
b) Place: Both the event that appears in the numerator and the one that appears in the
denominators must correspond to the same geographical area or place.
c) Period: The frequency of the event and the exposed population must correspond.
in the same period of time.

Fees are classified into three types by their nature:

a) Gross, general or crude rates: When they refer to the total population.
b) Tasas específicas: Cuando están referidas a determinados segmentos de la
population in a specific form related to the event under study.
c) Adjusted or standardized rates: When they are adjusted to a
standard population

The rates for the phenomenon they measure can be:

a) Birth rates: They measure events related to births in the


population.
b) Mortality rates: They measure the events related to the deaths that occur
in the population.
c) Morbidity rates: They measure events related to diseases or
pathologies that occur in the population.

2.4 Reasons:

It is a mathematical indicator that establishes the relationship of two parts of a whole.


Yes. A ratio expresses the relationship between two events. It is the quotient of two quantities.
in which the numerator is not included in the denominator. (Number of individuals of
a category with the number of individuals in the other). In this case, the interpretation
the quotient does not refer to a probability or a risk, as is the case with the rate.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
15

Example:

Maternal mortality rate: It measures the number of maternal deaths per 100,000.
births.

It results from the division of the number of maternal deaths by the number of
births, multiplied by 100,000

If in a population there were 400 maternal deaths during the year 2009 and
65,000 births

So: 400/65,000 X 100,000 = 615

Therefore, we say that the maternal mortality rate is 650 per 100,000.
births

2.5 Proporciones:

They are figures or relative magnitudes that relate two categories of the same.
phenomenon in which one is contained within the other, that is to say, one is part
and another the whole.

Numerator It is a PART
------------------- ----------------------
Denominator It is the ALL

Proportions are measures that express the frequency with which an event occurs.
in relation to the total population in which this can occur. This measure is calculated
dividing the number of events that occurred by the population in which they occurred.
As each element of the population can contribute with only one event, it is
it is logical that being the numerator (the volume of events) a part of the denominator
(population in which the events occurred) that one can never be bigger
than this. It is for this reason that the result can never be greater than one and
always oscillate between zero and one.

Proportions express only the relationship that exists between the number of times
in which an event is presented and the total number of occasions on which it could be
to present.

For example, what proportion of the deaths that occurred in the city of Sucre in the year
Was 2013 caused by cardiovascular diseases? This is calculated by building

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
16

the quotient between the number of deaths due to cardiovascular causes (740) and the
total number of deaths that occurred that year (4,432) amplified by 100 (16.70% of the
deaths in 2013 were caused by cardiovascular diseases). The
proportions are not interpreted as a probability nor do they provide a risk
since they are not calculated with the population exposed to risk. A proportion can
considered as the estimation of a probability when calculated in a
representative sample of a certain population.

Another example: If in a population of 25,000 inhabitants, 1,500 are diagnosed


Patients with diabetes, the proportion of diabetes in that population is 1,500/25,000
= 0.06 (if we multiply by 100 = 6%). The value of a proportion can vary like this.
0 to 1, and is usually expressed as a percentage.

2.6 Indices:

They arise from the comparison of two rates or two ratios. For example, the quotient between
the overall mortality rate in men compared to women in 2010.

This indicator gives an idea of the existence of greater or lesser risk of a condition
depending on whether its value is greater or less than 1 (or 100%). In this case, we have:

Sex Deaths Population Rate x 1000 Index


Men 2368 104.000 22.77
1.28 (128%)
Women 2064 116,000 17.79

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
17

A2. POSITION MEASURES


3
Arithmetic Mean

The arithmetic mean of a statistical variable is the value obtained by adding all the
data and divide the result by the total number of data.

Its calculation aims to obtain a value to which the data or observations tend.
individuals.

To represent the population mean and the sample mean, the following are used
symbols:

µ: is the Greek letter 'mu' that will determine the mean of a population

X: It is the symbol used to determine the mean of a analyzed 'sample'.

For educational purposes, we will henceforth continue using this last symbol to
refer to the 'arithmetic mean' in general.

Formula:

X = Media
Σ= Summation
XiAll the values of the distribution
n = Number of data

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
18

Formula:

Calculate the average age of 9 patients:


Edades: 9; 11; 10; 8; 12; 9; 13; 10; 10

9 + 11 + 10 + 8 + 12 + 9 + 13 + 10 + 10
X=
9
Each of the ages is added together and divided by the
number of patients
92
X= 10.2 years
9
The average age of this population is 10.2 years.

Another example:
Formula:

Calculate the average heart rate of 10 patients:


Número de latidos: 60; 62; 68; 70; 76; 76;79;79;82; 82
S 60+62+68+70+76+76+79+79+82+82
X=
10
Each of the ages is added together and divided by the
number of patients
734
X= 73.4 beats per minute
10

USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
19

Age in years Number of


of students students
XI Fi
20 8
21 7
22 9
23 6
24 5
TOTAL: 35

Formula:

X = Media
Σ= Summation
XiAll the values of the distribution
fiAll frequencies
n = Number of data, that is: Σfi

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
20

Age in years Number of


of students students

21 7
22 9
23 6
24 5
TOTAL: 35
(n = Σfi)

n
Age in years Number of
of students students Xi* fI
Xi fI
20 8 20*8 = 160
21 7 21*7 = 147
22 9 22*9 = 198
23 6 23*6=138
24 5 24*5 = 120
TOTAL: 35

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
21

n
Age in years Number of
of students students ΣXi* fI
Xi fi
20 8 160
21 7 147
22 9 198
23 6 138
24 5 120
TOTAL: 35 763

Age in years of Number of


students students ΣXi* fi
XI fI
20 8 160
21 7 147
22 9 198
6 138
23 5 120
24
TOTAL: n = 35 763

763
X= 21.8 years
n 35

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
22

3. Arithmetic mean of grouped data in


frequencies with class intervals
People's age Number
in years
Xi fi
0–4 8
5–9 12
10–14 15
15–19 13
20 - 24 14

Total: 62

Formula:

X = Media
Σ= Summation
X' = Midpoint or class mark
of the class interval
fiAll frequencies
n = Number of data, that is to say: Σfi

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
23

People's age Number Midpoint


in years
XI fi X’
0–4 8 0 + 4/2 = 2
5-9 12 5 + 9/2 = 7
10–14 15 10 + 14/2 = 12
15–19 13 15 + 19/2 = 17
20 - 24 14 20 + 24/2 = 22

Total: 62

People's age Number Midpoint


in years
XI fi X’ Xi* fi
0–4 8 2 2 * 8 = 16
5–9 12 7 7 * 12 = 84
10–14 15 12 12 * 15 = 180
15–19 13 17 17 * 13 = 221
20 - 24 14 22 22 * 14 = 308

Total: 62 809

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
24

People's age Number Midpoint


in years
Xi fi X' Xi* fI
0–4 8 2 2 * 8 = 16
5–9 12 7 7 * 12 = 84
10-14 15 12 12 * 15 = 180
15–19 13 17 17 * 13 = 221
20 - 24 14 22 22 * 14 = 308

Total: n=62 809


809
X= 13.05 years
62

Advantages of theArithmetic Mean:


It is a concept familiar to most people and
intuitively clear
It is a measure that can be calculated and is unique.
that each dataset has one and only one mean.
3. In the calculation of the mean, each is taken into account
observation of the dataset
The mean is a reliable measure because
it is determined with greater certainty than other characteristics
from a dataset.
Moya Calderón 2001

USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
25

Disadvantages of the Arithmetic Mean:


The arithmetic mean can be affected by values.
extremes that are not represented from the rest of
observations. Therefore, when this is being used
measured in an analysis, it is worth noting the
representativeness of extreme values and influence
what this has about the result.
Moya Calderón 2001

Disadvantages of the Arithmetic Mean:


Calculating the arithmetic mean is tedious because it
they use all observations in the calculations (unless,
of course the short data method is used
grouped to approximate the mean.
Moya Calderón 2001

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Grover Linares Ph.D - 2015
26

Disadvantages of the Arithmetic Mean:


3. The arithmetic mean cannot be calculated for a
set of data that has open class intervals
at the ends.
Moya Calderón 2001

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
27

4 Median

It is the value that occupies the central position of all the data when they are ordered.
from smallest to largest and is represented with the sign Me.

According to this definition, the set of data less than or equal to the median.
will represent 50% of the data, and those that are greater than the median will represent
the other 50% of the total sample data

Median of ungrouped series

Exercise 1: Determine the median of the following values:

Edades de 9 pacientes: 9; 11; 10; 8; 12; 9; 13; 10; 10

First sort from lowest to highest:


8 ; 9; 9; 10; 10; 10; 11; 12; 13

Since we have 9 values; to find the median, we arrange the same number of
values on the right and on the left it is worth saying that in our example we left 4
to the right and 4 to the left, with the number 10 in the middle, which
it represents the median.

In this example, we have 9 values, which represents an odd number, so


the average of them represents a single number, if instead of an odd number
for even values, the two middle values are summed and divided
between 2 (an average is taken), as we present below:

Exercise No. 2 Determine the median of the following values:

Edades de 12 pacientes: 4; 8; 10; 9; 8; 7; 6; 6; 4; 10; 5; 8

First sort from lowest to highest:


4; 4; 5; 6; 6; 7; 8; 8; 8; 9; 10; 10

Since we have 12 values; to find the median, we place 5 on the right and 5
to the left, leaving in the middle (two values because 12 is even) therefore
the numbers 7 and 8 remain; with an average of 7.5 (7+8 / 2 = 7.5) that
corresponds to the median.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
28

2. Median of grouped series

a) Simple grouped presentation

The following formula is used:

Exercise 3: Find the median age of the following group of people


Age (years) Frequency Frequency
Xi absolute accumulated
fI Fi
20 5
21 10
22 15
23 20
24 20
25 30
Total: 100
First step: Get the cumulative frequency
Age (years) Frequency Frequency
Xi absolute accumulated
fi Fi
20 5 5
21 10 15
22 15 30

24 20 70
25 30 100
Total: 100 ---

Second step: Divide the sum of the values f by 2.i:100 / 2 = 50

The first cumulative frequency equal to or greater than 50 is 50, corresponding


to the value Xi23 which will be called Xk

Third step: Apply the formula:


Xk+ Xk+1
XkValue that matches in the Me=
column X (23) with the 2
cumulative frequency
remarked (50) 23 + 23 + 1 23 + 24 47
Applying the formula: Me = = = 23.5
2 2 2

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
29

Therefore, the median is 23.5 years.

Exercise 4: Find the median age of the following group of people

First step: Get the cumulative frequency


Edad (años) Frequency Frequency
XI absolute accumulated
fi Fi
10 10 10
11 12 22

13 15 61
14 5 66
15 25 91
Total: 91 ---

Second step: Divide the sum of the values f by 2.I91/2 = 45.5

The first cumulative frequency equal to or greater than 45.5 is 46, corresponding
to the value Xi= 12 that represents Xk

Third step:

12 + 12 + 1 25
Applying the formula: Me = = =
2 2

Therefore, the median will be 12.5 years.

b) Grouped presentation with class intervals

n = Total number of observations or sum of the absolute frequencies (fi)


LILower limit of the class that contains the marked average
FiCumulative absolute frequency of the previous immediate class,
to the one that contains the highlighted average.
fiAbsolute frequency of the class where the average is highlighted
a = Amplitude of the class interval where the marked average is

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
30

Exercise 6: What is the median of the blood glucose data of 40 patients?

Blood sugar Frequency Frequency


absolute accumulated
fi Fi
70.0 - 74.0 3 3
75.0 - 79.0 4 7
80.0 - 84.0 2 9
85.0 - 89.0 5 14Fi

95.0 – 99.0 4 24
100.0 - 104.0 2 26
105.0 - 109.0 7 33
110.0 - 114.0 5 38
115.0 - 119.0 1 39
120.0 - 124.0 1 40
TOTAL: 40n ---
40/2 = 20
First, we find out what the position of the observation is.
The average is: n / 2 = 40/2 = 20
2. The accumulated frequency that contains the average 20 is exactly 20, which is the
that we highlight corresponding to the interval 90.0 - 94.0
3. We calculate the amplitude of the interval or class:
a = 90.0 – 94.0 = 5 (there are 5 points of range from 90 to 94)

4. We apply the formula to determine the value of the median, replacing

n
-Fi
2
Formula: = Li+ ( ) *a
fi

( 40 ) - 14
2 20 – 14 6
Me = 90.0 +( )* 5 = 90 + * 5 = 90 + * 5 = 95
6 6 6

= 90 + (6/6)*5 = 95

Explanation: Li = 90.0 is the lower limit of 90.0 to 95.0


n = 40 is the total number of patients that is divided by 2 according to the formula
Fi=14 is the cumulative frequency prior to 20, which was the
average
which contained 20
fi = 6 is the figure that corresponds horizontally to the Frequency
absolute of the class where the average is highlighted.

USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
31

a = 5 what is the amplitude between 90 and 95

Exercise 7: What is the median of the data for the following ages?
Age (years) Frequency Frequency
XI absolute accumulated
fi Fi
0-4 10 10
5-9 16 26Fi

15 - 19 27 77
20 - 24 13 90
25 - 29 6 96
Total: 96

96/2= 48
First, we find out what the position of the observation is that
the average is: n / 2 = 96/2 = 48, since it does not match, we look for
the immediate superior which is 50, and we highlight the entire row, leaving in
the box 10 corresponding to the lower limit (Lithen 24 that
corresponds to (fi).
2. We calculate the amplitude of the interval or of the class:
a = 10 - 14 = 5

We apply the formula to determine the value of the median, replacing

( 96 ) - 26
2 48 – 26 22
Me = 10 + ( )* 5 = 10 + ( )* 5 = 10 + *5 =
24 24 24

= 10 + (0.92) * 5 =
= 10 + 4.6 = 14.6

Explanation: Li = 10 is the lower limit of 10 to 14


n = 96 is the total number of people divided by 2 according to the formula
Fi = 26 is the cumulative frequency before 50 marked
fi = 24 is the figure that corresponds horizontally to the
cumulative frequency 50, in the number column
patients.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
32

Fashion
5
Mode is the value that appears most frequently in a distribution. If in a group two
scores appear with the same frequency and that frequency is the maximum, the
distribution is bimodal. If there are three, it is trimodal; when there are three, we talk about
multimodal, but when all the scores of a group have the same frequency,
there is no fashion.

Ungrouped data mode

Exercise 1: Determine the Mode of the following data:

Edades de 9 pacientes: 9; 11; 10; 8; 12; 9; 13; 10; 10

First, it is sorted from least to greatest.

8; 9; 9;10;10;10;11;12;13

The mode is 10 because it is the one that repeats the most.

Exercise 2: Determine the Mode of the following data:

Número de hijos en 12 mujeres: 6; 1; 5; 3; 4; 2; 3; 2; 3; 4; 1; 3

First arrange from least to greatest:

1; 1; 2; 2; 3; 3; 3; 4; 4; 4; 5; 6

In this case, the most repeated frequencies are 3 and 4, but since
they are combined, the average is calculated, meaning 3 + 4 = 7/2 = 3.5 therefore the
the mode is 3.5
If they were not together, it would be bimodal 3 and 4.

Exercise 3: Determine the Mode of the following data

Número de hijos de 10 mujeres:1; 1; 1; 2; 2;3; 3; 3; 4; 4

In this case, the most repeated frequencies are 1 and 3, therefore, since not
being together does not yield an average, resulting in bimodal 1 and 3

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
33

2. Simple grouped data mode

Exercise 4: Determine the mode of a group of second-grade students


year of Medicine according to the age distribution:

Student ages Absolute frequency


Xi fi
17 2
18 5
19 22 Higher frequency.
20 16
21 8 La Moda (Mo) is 19 years old.
22 6 since its frequency 22 is
23 3 the highest.
24 1
TOTAL 63

Exercise 5:
Ages of students Absolute frequency
Xi fI
17 5
18 10
19 20
20 15
21 26 Higher frequency.
22 4
23 10 La Moda (Mo) is 21 years old.
24 3 since its frequency 26 is
TOTAL 93 the highest.

Exercise 6: Determine the mode of another group of second-grade students


year of Medicine, according to the age distribution:

Ages of students Absolute frequency


Xi fI Higher frequencies
17 3 of two similar values.
Since they are together, it is taken out.
18 4
19 22 an average
20 22
Fashion (Mo) is 19.5 years old,
21 7
22 5 since its frequency 22 is
23 3 the highest, for both
24 2 cases, 19 and 20 years
TOTAL 68

USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
34

Since 19 and 20 are consecutive, the Mode is the average of those values.
(19+20/2=19.5)

Exercise 7:
Ages of students Absolute frequency
XI fi Higher frequencies
17 5 of two similar values.
18 7
19 15 Fashion (Mo) is 21.5 years old, already
20 14 that its frequency 28 is the highest
21 28 elevated, for both cases,
22 28 21 and 22 years therefore they are
23 3 saca the media
24 2 21+22/2 = 21.5
TOTAL 102

Exercises 8: Determine the mode of another group of second-year students


of Medicine, according to the age distribution:

Ages of students Absolute frequency


XI fi
17 2
Higher frequencies.
18 5
19 20
The Mode (Mo) is 19 and 22
20 16 years, since they are not
21 14
consecutive. (bimodal).
22 20
23 3
24 1
TOTAL 81

When two similar frequencies are presented, the highest ones, and they are not
consecutive, the arithmetic mean is not calculated, leaving the two values
as modal, being in this case bimodal.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
35

Exercise 9:

Students' ages Absolute frequency


Xi fi
17 14 Higher frequencies.
18 26
19 17 The Mode (Mo) is 18 and 23
20 15 years, as they are not
21 12 consecutive. (bimodal)
22 9
23 26
24 2
TOTAL 121

3. Mode with values grouped in class intervals

LiLower limit of the modal class.


d1Difference between the absolute frequency of the modal class and the
frequency
absolute from the previous class.
d2Difference between the absolute frequency of the modal class and the
frequency
absolute of the subsequent class.
a = Amplitude of the modal class interval

Exercise 10: Determine the mode of a group of people, according to


age group:

Age groups Absolute frequency


XI fI Modal class since its
15 - 19 10 the highest frequency is: 28

d1=28 – 10 = 18
25 - 29 12 d2=28 – 12 = 16
30 - 34 11 Li20
a = 5 (20 to 24 = 5)
35 - 39 8
TOTAL 69

USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
36

Replacing according to the formula:


d1
Formula: Mo = LI+ ( )a
d1+d2

18
Mo= 20 + ( 5=
18 + 16

18
Mo= 20 + ( )*5=
34

Mo = 20 + (0.529411764) * 5 = 20 + 2.6 = 22.6 years

It is worth noting that Mo = 22.6 years

Explanation of where the data came from:


Li = 20 Límite inferior del intervalo de la clase modal
d1Difference between the absolute frequency of the modal scale (28)
y the
frequency of the previous class (10) 28-10 = 18.
d2Difference between the absolute frequency of the modal class (28)
y the
frequency of the subsequent class (12) 28-12= 16
a = 5Amplitude of the modal class interval 20 - 24 = 5

Exercise 11:
Age groups Absolute frequency
Xi fi
15 - 20 12 Fashion class/position since
20 - 25 13 its frequency is the highest:
26
d126 - 13 = 13
30 - 35 14 d2=26 – 14 = 12
Li= 25
35 – 40 6 a = 5 (20 to 25 = 5)
TOTAL 71

USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
37

Replacing according to the formula:

d1
Formula: Mo = Li+ ( )a
d1+d2

13
Mo= 25 + ( )*5=
13 + 12

13
Mo= 25 + ( 25 * 5 = 27.6 years
25

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
38

6 A.2.2 QUANTILES
Quartiles, Deciles, and Percentiles

6.1 Introduction

So far, we have studied the measures of central tendency (Mean, Median and
Fashion), which shows us a central value (and only central) that represents the set of
data; regardless of what happens with the rest of the values.

For example:

Two groups of 10 patients each go for a cardiological check-up and are taken
the following resting heart rates:

GrupoA: 62 63 64 65 70 7075 76 77 78 X = 700/10 = 70

Group B: 50 54 64 69 70 7071 76 86 90 X = 700/10 = 70

Knowing that the normal resting heart rate is between 60 and 80 beats
per minute; analyzing both groups, we conclude that both have a mean, median
and a rate of 70 beats per minute; therefore we could mistakenly conclude if only
we observe these measures of central tendency that both groups of patients are
they are equal and have normal heart rates and there are no patients that call the
attention with probable pathology. (Incorrect diagnosis!)

However, by observing not only the measures of central tendency but all the data
patient by patient, we concluded that in group B, there are 4 patients with probable
cardiac disorder, 2 patients under 60 (50 and 54) and 2 over 80 (86 and
90) heartbeats per minute to study in order to know the cause of
these alterations.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
39

Grupo A: 62 63 64 65 70 70 75 76 77 78

Group B: 50 54 64 69 70 70 71 76 86 90

With the measures of position (quartiles, deciles, and percentiles) we can make cuts and
observe the different values (3, 9, and 99 cuts to achieve 4, 10, or 100 equal parts)
in different places of the ordered data value chain from lowest to highest and
to know the exact value in each cut and almost for each patient or subgroups of patients
and diagnose what happens with each of them and not just with a measure of central tendency
central that represents everyone.

Therefore, the measures of position (quartiles, deciles, and percentiles) turn out to be measures
that allow the detailed study of all the values in different positions of the
data chain, providing a diagnosis that is not general but specific to each patient and/or
subgroup of patients. (Important analysis tool! that allows not to lose
see what happens with each patient.

With a series of data arranged from smallest to largest, we can divide it into 4 parts.
equal, into 10 equal parts or into 100 equal parts and to know exactly what value
and position corresponds to each cut.

When we divide into 4 equal parts, they are called quartiles; when we divide into 10 parts,
they are called deciles and when we divide into 100 equal parts we call them
percentiles.

The quartiles are represented by the symbol 'Q', the deciles by the symbol 'D' and the
percentiles with the symbol 'P'.

To achieve 4 equal parts, we use 3 cuts, each cut is called Q.1, Q2y Q3
To achieve 10 equal parts, we use 9 cuts, each cut is called D.1, D2….and D9
To achieve 100 equal parts, we use 99 cuts, each cut is called P1, P2....and P99

USFXCh - Faculty of Medicine - Notes on Public Health II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
40

Q1 Q2 Q3 (4 sectors)

D1 D2 D3 D4 D5 D6 D7 D8 D9(10 sectors)

P25 P50 P75 P99(100 sectors)

Example with real data from 12 patients of different ages:

Me = 50.5

32, 35 44, 55 70 74

Q1 Q2 Q3
P25 = 38 D5 = 50.5 P75 = 58
P50

P2525% of 12 data P7575% of 12 data


= 3 data to the left- 9 data to the left
from the 39 from the 59

6.2 Cuartiles:

With 3 cuts, the fractions are equal fourths of the total data.

Having in our example 12 data points, to divide into 4 equal parts, each sector must
to have 3 data points (4 X 3 = 12). Each cut or quartile to leave 4 equal parts, the first quartile
the cut occurs between the third and fourth data points, the second quartile between the sixth and seventh
datum, and the third quartile between the ninth and tenth datum. In this way:

Q1Represents the first cut called the first quartile; leaving 25% of the values behind.
below and 75% of values above the cutoff. In our example, the cutoff falls
exactly between the value 37 and 39, therefore to know exactly what value

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
41

corresponds to Q1We take an average (37 + 39/2 = 38); therefore, the first quartile (Q1) is
equal to 38 years, which coincides with the P25.

Q2It represents the second cut called the second quartile; leaving 50% of the values.
below and 50% above the cutoff. In our example, the cutoff falls exactly
between the values 48 and 53, therefore to know exactly what value Q corresponds to2
we calculated an average (48 + 53/2 = 50.5); therefore, the second quartile (Q2It is equal to 50.5
years. Q2it coincides with the median 50.5.

Q3Represents the third cut called the third quartile, leaving 75% of the values behind.
below and 25% of the values above the cutoff. In our example, the cutoff falls
exactly between the values 57 and 59, therefore to know exactly what value
corresponds Q3we take an average (57 + 59 /2 = 58); therefore the third quartile (Q3) is
equal to 58 years, which coincides with the P75.

6.3 Deciles:

With 9 cuts, the fractions are tenths of the total.

Having in our example 12 data points, to divide into 10 equal parts, each sector
It should have 1.2 data parts (1.2 X 10 = 12).

D1leave 10% of the values below and 90% above


D2leave 20% of the values below and 80% above
D3leave 30% of the values below and 70% above
D4leave 40% of the values below and 60% above
D5leave 50% of the values below and 50% above
D6leave 60% of the values below and 40% above
D7leave 70% of the values below and 30% above
D8leave 80% of the values below and 20% above
D9leave 90% of the values below and 10% above

Intuitively, we can extract the fifth decile (D)5) that corresponds to half of the
chain of values, since we left 50% of the values below and 50% above
of the cut. In our example, the cut falls exactly between the value 48 and 53, therefore
to know exactly what value D corresponds to5we take an average (48 + 53 / 2 =
50.5); therefore, the fifth decile (D5It is equal to 50.5 years.5coincides with Q2with P50and with the
median.

The rest of the cuts for the other deciles would be very complicated to obtain, therefore
we must use formulas that we will apply later, to know exactly about
what value corresponds to each cut.

6.4 Percentiles

With 99 cuts, the fractions are hundredths of the total. The percentiles are 99.
values that divide the data series into 100 equal parts.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
42

Having in our example 12 data points, to divide into 100 equal parts, each sector
it must have 0.12 parts of data (0.12 X 100 = 12).

P1leave 1% of the values below and 99% above


P2leave 2% of the values below and 98% above
P3leave 3% of the values below and 97% above
D34leave 34% of the values below and 66% above
P70leave 70% of the values below and 30% above
P86leave 86% of the values below and 14% above
And so on

In an intuitive way, as we have done with the quartiles, we can derive the percentiles.
25, 50, and 75 that correspond to the 1st, 2nd, and 3rd quartiles. In this way, the 25th percentile is
find between the third and fourth data point, the 50th percentile between the sixth and seventh data point, and the
75th percentile between the ninth and tenth data point. This way:

P25It represents the 25th percentile; leaving 25% of the values below and 75% of the values above.
above the cut. In our example, the cut falls exactly between the value 37
and 39, therefore to know exactly what value P corresponds to25we take out a
average (37 + 39/2 = 38); therefore the 25th percentile 1 (P25is equal to 38 years.

P50It represents the 50 cut; leaving 50% of the values below and 50% above.
of the cut. In our example, the cut falls exactly between the value 48 and 53, due to
so as to know exactly what value P corresponds to50we take an average
(48 + 53/2 = 50.5); therefore, the 50th percentile (P50is equal to 50.5 years. P50coincide
with a median of 50.5.

P75Represents the third cut 75; leaving 75% of the values below and 25% above.
the values above the cutoff. In our example, the cutoff falls exactly between
the value 57 and 59, therefore to know exactly what value P corresponds to75
we take an average (57 + 59 /2 = 58); therefore the 75th percentile (P75is equal to 58
years.

In the same way as for the deciles, for the rest of the cuts for the other percentiles,
it would be very complicated to eliminate; therefore, we must use formulas that follow
we will apply, to know exactly what value corresponds to each cut.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
44

As we observed in the previous example, the use of quartiles, deciles, and percentiles is
muy útil para el diagnóstico en Medicina. Todos los parámetros medibles en las ciencias
Medical professionals according to specialties have curves distributed by percentiles.

Use of percentiles in monitoring growth and fetal development, using


percentiles:

USFXCh - Facultad deMedicina -Apuntes de Salud Pública II – Bioestadística – Dr. GróverLinares Ph.D -2015
45

6.6 Quantiles of ungrouped data

Se utiliza la siguiente fórmula genérica:

CJ = Quantile to be extracted

Xi=Value of the indicated place

J= Quantiles that we are asked to obtain (Q 1,2,3; D 1,2…9; P 1,2,3….99)


n = Number of data or values X
+1= Add 1 unit to the number of data
c = Number of sectors that the requested quantile has Q = 4; D = 10; P = 100
Indicated place
Xi+1 = Value of the indicated place + 1 place

a) If the subscript is an integer

Take Q1from the following data:

5 - 8 - 10 - 12 - 14 - 16 - 18 - 20 - 25 - 30 - 35

First step: Use the following part of the formula presented.

J(n + 1) 3oobservación = 1
Q1= ----------------- how is whole
C Q110
1(11 + 1) 1 ( 12 ) 12
Q1 = ----------------- = ---------------- = -------- = 3
4 4 4

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
46

b) If the subscript is decimal

Take Q1Q2Q3D7y P8from the following data:

2 - 3 - 7 - 15 - 24 - 30

First step: Use the following part of the presented formula

J(n + 1)
Q1= -----------------
C
1(6 + 1)
Q11.75 rounded down to the nearest whole number is 1
4 i = 1 = 1erlocation of the data Xi = 2

Second step: Apply the full formula


J(n + 1)
CJ = Xi + ------------- - i Xi + 1Xi
C
Xi + 1= 1theplace + 1 place
Q1= 2 + 1.75 - 1 3 - 2 = 2dolugar = 3

Q1= 2 + 0.75 1

Q1= 2 + 0.75 = 2.75

Q1= 2.75

Take out Q2from the following data:

2 - 3 - 7 - 15 - 24 - 30

First step: Use the following part of the formula presented

J(n + 1)
Q2= -----------------
C
2(6 + 1)
Q2= ----------------- = 3,5 redondear al inmediato inferior = 3
4 i = 3 = 3erplace of the data Xi= 7
Second step: Apply the complete formula
J(n + 1)
CJ = Xi + ------------- - i Xi + 1Xi
C
Xi + 1= 3heplace + 1 place
Q2= 7 + 3.5 - 3 15 - 7 = 4tolugar = 15

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
47

Q2= 7 + 0.5 8

Q2= 7 + 4 = 11

Q2= 11
Take Q3from the following data:

2 - 3 - 7 - 15 - 24 - 30

First step: Use the following part of the presented formula.


J(n + 1)
Q3= -----------------
C
3(6 + 1)
Q35.25 round down to the nearest whole number = 5
4 i = 5 = 5hedata location Xi= 24
Second step: Apply the complete formula
J(n + 1)
CJ = Xi + ------------- - i Xi + 1Xi
C
Xi + 1= 5erplace + 1 place
Q3= 24 + 5,25 - 5 30 - 24 = 6tolugar = 30

Q3= 24 + 0.25 6

Q324 + 1.5 = 25.5

Q325.5
Take D7from the following data:

2 - 3 - 7 - 15 - 24 - 30

First step: Use the following part of the presented formula


J(n + 1)
D7= -----------------
C
7(6 + 1)
D7= ----------------- = 4.9 round down to the nearest lower integer = 4
10 i = 4 = 4erdata location Xi = 15
Second step: Apply the complete formula
J(n + 1)
CJ = Xi + ------------- - i Xi + 1Xi
C
Xi + 1= 4erplace + 1 place
D7= 15 + 4.9 - 4 24 - 15 = 5tolugar = 24

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
48

D7= 15 + 0.9 9

D7= 15 + 8.1 = 23.1

D7= 23.1

Take out P80from the following data:

2 - 3 - 7 - 15 - 24 - 30

First step: Use the following part of the presented formula.


J(n + 1)
P80= -----------------
C
80(6 + 1)
P80= ----------------- = 5.6 round down to the nearest lower integer = 5
100 i = 5 = 5erlocation of the data Xi= 24
Second step: Apply the complete formula
J(n + 1)
CJ = Xi + ------------- - i Xi + 1Xi
C
Xi + 1= 5erplace + 1 place
P80= 24 + 5,6 - 5 30 - 24 = 6tolugar = 30

P80= 24 + 0,6 6

P80= 24 + 3.6 = 27.6

P8027.6

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
49

6.7 Quantiles of grouped data

The following generic formula is used

With the following hematocrit data, obtain D.9

Xi fi Fi
30 - 34 3 3
35 - 39 8 11
40 - 44 11 22
45 - 49 9 31Fi-1
50Li- 54 4fi 35
35
J(n / c) – Fi - 1
CJ = Li + -------------------- *a
fi

First step: Use the following part of the formula presented


n
D9J -------
c
35
D9= 9 ------- = 31.5 I highlight the immediate
10

Second step: Apply the complete formula


31.5 - 31
D9= 50 + -------------- * 5
4

0.5
D9= 50 + -------------- * 5
4

D9= 50 + 0.125 * 5

D9= 50 + 0.625

D950,625

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
50

With the following hematocrit data, obtain P.76

Xi fi Fi
30 - 34 3 3
35 - 39 8 11
40 - 44 11 22 Fi-1
45Li49 9yes 31
50 - 54 4 35
35
J(n / c) – Fi - 1
CJ = Li + -------------------- *a
fi
First step: Use the following part of the presented formula
n
P76= J -------
c
35
P76=76 ------- = 26.6 I highlight the immediate superior 'Fi' to 26.6 (31)
100
Second step: Apply the complete formula
26.6 - 22
P76= 45 + -------------- * 5
9

4.6
P76= 45 + -------------- * 5
9

P76= 45 + 0,51 * 5

P76= 45 + 2,55

P76= 47.55

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
51

A.3. MEASURES OF DISPERSION OR VARIATION

7 Rango, Desviación Media, Varianza,


Desviación Estándar, Coeficiente de variación
7.1 Introduction

The measures of central tendency, as seen earlier, present us


Information about the behavior of the data through a value that tends to
to locate oneself at a more or less central point. However, it does not provide us with information
about the dispersion or 'scattering' that the observed data may have in its
set. For example: If we have information that in two courses there was a
utilization of 60 points average out of 100 in each course, what conclusions
Can we obtain based on this information? Certainly in both courses the
the utilization is the same. There arises, then, the need to complement a
measure of central tendency with a measure of dispersion to have information
broader about the dataset that is under analysis.

The degree to which numerical data tend to spread around some average value
it's called variation or dispersion. A measure of dispersion is important from two
points of view:

a) It can be used to show the degree of variation between the values of the data
observed; thus a small dispersion in the grades of a group of
students, will indicate that they are approximately equal in their performance; on the other
side, a greater dispersion will imply that the students are very
unequal in their performance.
b) Secondly, it can be used to complement an average, to
to describe a dataset or to compare a series of information with
another. When the dispersion is low, the average value becomes highly
significant, on the other hand, if the dispersion is high, the mean (or the measure of tendency
central) becomes little or not representative at all.

To calculate the variations, a central point of the values is used as a reference.


observed, that is, some measure of central tendency. In practice, it turns out
of much application the measure of dispersion calculated around the arithmetic mean.
Among the most widely used are:

The route
The mean deviation
Variance
Standard deviation.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Grover Linares Ph.D - 2015
52

7.2 Range or Distance

One of the simplest measures of dispersion is the range, also called range.
The total amplitude y is the difference between the maximum and minimum values of the set.
data. For example, suppose there are two groups of 7 children, these being A and
And that both have an average of 6 years; if we only have this information.
we can say that there is no difference between the two groups; but if they give us
the additional information on the extreme ages we have: Group A ranges from 2 to
10 years old and in group B, they are between 5 and 7 years old, it is clearly observed that, although
both groups have the same mean, they are very different due to the variability of the
ages, let's see the following:
Group A: 10 - 2 = 8 years of experience
Group B: 7 - 5 = 2 years of travel

Group Θ Θ Θ Θ Θ Θ Θ
A 1 2 3 4 5 6 7 8 9 10
Group ΘΘΘΘΘΘ
B 1 2 3 4 5 Θ 7 8 9 10
6

This observation indicates that in group A, the ages of the children are
distributed between 10 and 2 and in group B, between 7 and 5 years.
However, this measure only considers extreme data, which is why
it does not inform us about how the data is distributed as a whole
(intermediate data)
To calculate the distance, the following formula is developed:

Exercise 1:
a) 4, 5, 5, 6, 7 Rec. = 7 - 4 = 3

b) 60, 30, 80, 90, 100 Rec. = 100 - 30 = 70

7.3Desviación media (DM)

Another measure of dispersion is the mean deviation, which includes all the data in the
calculation of the average of the deviations (or differences) in relation to some value
central, such as the mean, median, or mode. When the mean is taken as a value
central, the mean deviation is obtained, that is, the arithmetic mean of the deviations

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
53

around the median. If the median is taken as the central value, the deviation is had
median, etc.

Theoretically, the sum of the deviations from the mean is zero.


for the calculation of the mean deviation, absolute deviations are taken (without
their signs)

7.3.1 The mean deviation in ungrouped data:

∑( )
DM =

Exercise 2: Calculate the DM of the following data on the number of children: 4,


4, 5, 7

Steps Procedure

1st The arithmetic mean is calculated 1st


ΣX
X =
n

4 + 4 + 5 + 7 equals 20

= = =5
4 4

2nd The absolute value is determined 2nd Σ (Xi - X)


of the differences or deviations of
each value that the variable takes = (4-5) , (4-5) , (5-5) , (7-5) =
regarding your media (without (-1), (-1), (0), (2) =
take the sign into account = 1 , 1 , 0 , =2
negative).

3rd The values are added together 3rd Σ 1 + 1 + 0 + 2 = 4


absolutes

4th The previous result is divided 4th 4


between the number of cases DM = =1
observed, obtained from 4
this way the final result.

USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
54

Developing the formula we have:

(4-5) + (4-5) + (5-5) + (7-5) (-1)+(-1)+(0)+(2)


= = =
4 4

1+1+0+2 4
DM = = =
4 4

7.3.2 Mean deviation when grouped data is available:

For the calculation of the mean deviation in grouped data, the following is developed
next formula:

[Link] In grouped data with a single class

Exercise 3: Calculate the Mean Deviation from the following grouped data.
The value of the arithmetic mean applying the learned procedures is 5.36

Steps Procedures
1st The absolute value is determined 1st Sin
of each difference between thesign
values that the variable takes and its Calif Freq.
Absol. Xi- X (Xi- X)
arithmetic mean Xi fi
Arithmetic mean = 5.36 3 1 3-5,36=-2,36 2.36
4 5 4-5.36=-1.36 1.36
5 8 5-5.36= 0.36 0.36
6 6 6-5,36= 0,64 0.64
7 5 7-5,36= 1,64 1.64
2nd
According to the formula, the Calif Freq.
values absolutes of the Absol. Xi- X) Σ(Xi- X)fI
differences multiply by the Xi fi
absolute frequencies and sayings 3 1 2.36 2.36 x 1 = 2.36
products partials they must 4 5 1.36 1.36 x 5 = 6.8
5 8 0.36 0.36x8=2.88
to join
6 6 0.64 0,64X6=3,84
7 5 1.64 1,64X5=8,2
n= 25 Σ 24.08
3rd 24.08
3rd To obtain the final result, DM = = 0.96
the previous sum is divided by the 25
total cases

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
55

Applying the formula: (Omitting some steps that are understood as given).

Calif Frequent.
Absol. Xi- X) (Xi- X)fi
XI fI
3 1 2,36 2,36
4 5 1.36 6.8
5 8 0.36 2,88
6 6 0.64 3.84
7 5 1.64 8.2
Σ n=25 24.08

Σ ( XI- X ) If 24.08
DM = = = 0.96
n 25

[Link] In grouped data with class intervals

For the calculation of the mean deviation of grouped data with class intervals,
the same steps are followed as in the previous case, taking care to
determine, beforehand, the midpoint of the intervals, which will replace in the
formula XI

It is emphasized that, in order to develop the formula beforehand (as an operation


(assistant), the arithmetic mean must be calculated.

Exercise 4: Calculate the Mean Deviation of the following data where the mean is
of 4.78, according to the calculation of the learned procedure.

Calif. Point Frequency


medium Absolute X' - X (Xi- X) (Xi- X) fi
Xi X’ fi
Σ 2-3 2.5 3 2,5 – 4,78 = -2,28 2.28 2,28 X 3 = 6,84
25 3-4 3.5 4 3,5 – 4,78 = -1,28 1.28 1,28 X 4 = 5,12
4-5 4.5 6 4,5 – 4,78 = -0,28 0.28 0.28 X 6 = 1.68
5-6 5.5 7 5,5 – 4,78 = 0,72 0.72 0,72 X 7 = 5,04
6-7 6.5 5 6,5 – 4,78 = 1,72 1.72 1,72 X 5 = 8,6
27,28

Σ ( Xi- X ) If 27,28
DM = = 1.09
n 25

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
56

7.4 Variance and Standard Deviation

One of the most useful measures of dispersion within a statistical analysis,


it is the standard deviation, which is a measure that considers how far from the
means are located each of the observed values and it is defined as
positive square root of the variance. A measure that is in a way prior to the
standard deviation is the variance. The advantage of these measures compared to the
DM takes the differences with their respective signs.

Very important to take into account:

When all elements of the population are taken, the symbols are used
σ2y σ to indicate population variance and standard deviation; on the other hand, if the
data comes from a sample, S will be used2y S to indicate the variance and
sample standard deviation respectively.

7.4.1 The variance and standard deviation with ungrouped data

For the calculation of these measures, the following formulas must be developed:

Formula to calculate variance:

Formula to calculate standard deviation:

Exercise 5: Calculate σ2y σ based on the following data: 4, 4, 5, 7


X=5
Before applying the formula, the arithmetic mean must be calculated.

Steps Procedure

Developing the operations


What does the formula indicate to us:

1st The difference between 1st is calculated. ["-1","-1","0","2"]


each value of the variable in
relation to your average equals negative one
(-1), (0), (2)

USFXCh - Faculty of Medicine - Public Health Notes II – Biostatistics – Dr. Gróver Linares Ph.D - 2015
57

Each 2nd is squared = (-1)2, (-1)2(0)2, (2)2


difference

3rd The results are added 3rd 1+1+0+4


previous ones to then divide σ 2= 1.5 Variance
among the total number of cases 4

The previous result is the 4th.


variance, to calculate the σ = √ 1.5 = 1.2247 = 1.2 Deviates-
standard deviation is extracted standard action
the positive square root

7.4.2 The variance and standard deviation with grouped data

For the calculation of variance and standard deviation of grouped data, it is necessary to
develop the following formulas:

Formula variance:

S ( Xi- X )2fI
σ 2=
n

Standard deviation formula:

Σ ( Xi- X )2fi
σ=
n

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
58

Exercise 6: With grouped data with a single class.

Taking into account that the arithmetic mean according to the procedure
learned is 5.36.

Steps Procedure

First, the difference is calculated.


among the different values Calif. Frequency
what takes the variable in Absol. (XI- X)
relationship to its average. Xi fI
3 1 -2.36
4 5 4 - 5.36 = -1.36
5 8 5 – 5,36 = -0,36
6 6 6 – 5,36 = 0,64
7 5 7 – 5,36 = 1,64

2nd The previous results 2nd


are squared Calif. Frequency
Absol. (Xi- X)2
3 1 - 2.3625.57
4 5 1.362= 1.84
5 8 0.3620.13
6 6 0.642= 0.41
7 5 1.642= 2.69

3rd
3rd The previous results Calif. Frec.
they must be multiplied by Absolute.i- X)2 (Xi- X)2fI
the absolute frequencies Xi fI
that corresponds to them and, they
3 1 5.57 5.57 X 1 = 5.57
sum products 4 5 1.84 1,84 X 5 = 9,2
5 8 0.13 0,13 X 8 = 1,04
6 6 0.41 0.41 X 6 = 2.46
7 5 2.69 2,69 X 5 = 13,45
Σ 25 31.72
4th The previous result is 4th
divide by the total of 31.72
cases, being this σ 2= 1.27 Variance
result, the variance 25

5th According to the formula, the


5th σ = √ 1.27 =1.13 Deviation
standard deviation will be Standard
the positive square root
of the variance.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
59

Exercise 7: Variance and standard deviation with grouped data


with class intervals.

For the calculation of variance and standard deviation, the same steps are followed.
that in the previous case, only, having to first determine the midpoint of
each interval.

No. Point Frequency.


children half Absolute. (X' - X) (Xi- X)2 (Xi- X)2fI
XI X’ fi
1-2 1.5 8 1,5 – 4,75 = -3,25 10.6 10.6 X 8 = 84.8
3–4 3.5 12 3,5 – 4,75 = -1,25 1.56 1,56 X 12 = 18,72
5-6 5.5 7 5,5 – 4,75 = 0,75 0.56 0,56 X 7 = 3,92
7-8 7.5 13 7,5 – 4,75 = 2,75 7.6 7,6 X 13 = 98,8
Σ 40 206.24

(X' - X)2fi 206.24


σ 2= = Variance
n 40

Σ ( X’ – X )2fI
σ= = √ 5.16 = 2.27 Standard Deviation
n

7.5 Coefficient of Variation

Another measure that is commonly used is the coefficient of variation (CV). It is a measure
of relative dispersion of the data and is calculated by dividing the standard deviation
sample by the mean and multiplying the quotient by 100. Its usefulness lies in that
allows us to compare the dispersion or variability of two or more groups. The coefficient
variation is used to compare the homogeneity of two data series, still
when expressed in different units of measurement.

It should be noted that as the coefficient of variation decreases, it is observed


greater homogeneity in the data or in other words, the data is more
concentrated around the average.

Thus, for example, if we have the weight of 5 patients (70, 60, 56, 83, and 79 Kg) whose average
es de 69,6 kg. y su desviación estándar (S) = 10,44 kg y la Talla de los mismos (150,
170, 135, 180 and 195 cm) with a mean of 166 cm and a standard deviation of 21.3
cm. The question would be: which distribution is more dispersed, weight or height? If

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
60

we compare the standard deviations we observe that the standard deviation of the
size is much greater; however, we cannot compare two variables that have
different measurement scales, so we calculate the coefficients of variation:

Response: The most dispersed distribution is that of weight.

Summary of formulas - Measures of dispersion

Measures of No data Grouping data Grouping data


dispersion grouped simple with interval

Range R = Maximum Xi - Minimum Xi

Deviation ∑( ) ∑( )∗ ∑( )∗
media DM = DM = DM =

Variance ∑( ) ∑( ) ∗ ∑( ) ∗
S 2= S 2= S 2=

2 2 2
Deviation ∑( − ) S=√
∑ ( − ) ∗ S=√
∑ ( ′− ) ∗
standard S =√

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
61

A4. Shaping Measures

8 Coefficient of skewness
kurtosis

8.1 Coefficient of skewness


This measure allows us to identify if the data is distributed evenly.
around the central point (Arithmetic mean). Asymmetry has three states
different as we see in the following figure; each of which defines in a way
concise how the data is distributed with respect to the axis of asymmetry. It is said that the
Asymmetry is positive when most of the data is above the value.
From the arithmetic mean, the curve is symmetrical when they are distributed approximately the
the same number of values on both sides of the mean and is known as skewness
negative when the largest amount of data accumulates in values less than
the average.

Skewness curve Curve Skewness curve


positive symmetric negative

The coefficient of skewness is represented by the following mathematical equation:

Where (g1) represents Fisher's skewness coefficient, (Xi) each of the values,
( ) the sample mean and (ni) the frequency of each value. The results of this
equations are interpreted:

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
62

a) (g1 = 0): It is accepted that the distribution is Symmetrical, that is, there exists
approximately the same amount of values on both sides of the mean. This
Value is difficult to achieve, which is why people tend to adopt the values that are.
close whether positive or negative (± 0.5).

b) (g1 > 0): The curve is asymmetrically positive, so the values tend to
gather more on the left side than on the right side of the mean.

c) (g1 < 0): The curve is asymmetrically negative so the values tend to
gather more on the right side of the medium

Certainly, the larger the number (Positive or Negative), the greater the distance will be.
what separates the clustering of values from the mean.

8.2 Kurtosis
This measure determines the degree of concentration that the values in the region present.
central of the distribution. Through the Coefficient of Kurtosis, we can identify if
there is a high concentration of values (Leptokurtic), a normal concentration
(Mesocurtic) or a low concentration (Platykurtic).

To calculate the kurtosis coefficient, the formula is used:

Where (g2) represents the kurtosis coefficient, (Xi) each of the values, ( ) the
average of the sample and (nor) the frequency of each value. The results of this formula are
they interpret:

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
63

(g2 = 0) the distribution is Mesokurtic: Just like in skewness, it is quite


it is difficult to find a kurtosis coefficient of zero (0), so it is usually accepted
the close values (± 0.5 approx.).

b) (g2 > 0) the distribution is Leptokurtic

(g2 < 0) the distribution is Platykurtic

When the data distribution has a skewness coefficient (g1 = ±0.5) and
a kurtosis coefficient of (g2 = ±0.5) is referred to as a Normal Curve. This criterion is
of utmost importance since for most statistical procedures
inference requires that the data be normally distributed.

The main advantage of the normal distribution lies in the assumption that 95% of the
values are within a distance of two standard deviations from the mean
arithmetic; that is, if we take the mean and add twice the deviation and
then we subtract two standard deviations from the mean, 95% of the cases would be found
within the range that makes up these values.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
64

B. INFERENTIAL BIOSTATISTICS

We define it as a branch of inferential statistics that uses a set of methods


that, relying on probability calculations and based on the data from a sample
significant, allow for generalization and obtaining valid conclusions for the entire population
in study.

Inferential statistics results from applying probability to the results that already
we know from descriptive statistics. The results of that application will come
expressed, therefore, in probabilistic language.

The result is perhaps strange, diffuse but precise; and based on the results we achieved
with inferential statistics we can for example state that: "There is an association
statistically significant between the Municipal Health Index and Maternal Mortality
(p < 0.001 this means with a 99.99% probability). The municipalities with an Index
Municipal Health very low has a Maternal Mortality Rate of 5.79 (IC95%: 5.59
5.99 times higher than the municipalities with an Average ISM.

The claims that inferential statistics allows us to make carry a risk, and who
The USA must know this. It's not difficult, anyway, because all these statements
They are formulated in terms of risk, safety, and insecurity: of probability.

Inference is always performed in approximate terms and declaring a certain level of


confidence. For example, if in a sample of n = 500 soldiers a height is obtained
average height X = 172 cm, one can reach a conclusion of the following type: the average height,
of all the soldiers is between 171 cm and 173 cm, and this statement is
It is carried out with a confidence level of 95%. (This means that it will be correct 95% of the time.
from the studies conducted under the same conditions as this one and in the remaining 5%
an error will occur.)

The two types of problems that statistical techniques solve are 'estimation and
hypothesis contrast”. In both cases, it involves generalizing the obtained information.
in a sample to a population. These techniques require that the sample be as much as possible.
random.

Knowing that inferential statistics performs probability calculations for a whole


population from a sample, therefore for didactic aspects we will begin to
study the determination of sample size in the following chapter.

USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
65

9 B.1 Sampling

9.1 Introduction

One of the important purposes of developing any research is to be able to


generalize from a sample to a larger population. The quality and reliability of the
results will mainly depend on the quality and scientific rigor with which it was chosen
the sample.

A fundamental aspect in the design of clinical studies is the determination of the size.
of an appropriate sample. If the sample size is very small, the study will have low
statistical power and consequently, the estimates will be less precise and the
the probability of finding significant differences between treatments or groups will be
smaller. On the other hand, if the sample size is very large, one will be doing a
misuse of research resources and subjecting more patients to tests than the
strictly necessary.

En los dos capítulos que a continuación estudiamos nos referiremos a la clasificación y


application of sampling and the determination of sample size.

But first, it is important to study the terminology and the concepts that we will use in
these two chapters:

9.2 Individual:

It is defined as the elementary unit of study, which belongs to a population. It is the


element that gives rise to the value of the variables. The individual and/or unit of study can
to be a person man or woman, an animal, a plant or an object, a medical history,
an X-ray, etc.
.

USFXCh - Faculty of Medicine - Public Health Notes II – Biostatistics – Dr. Gróver Linares Ph.D - 2015
66

In Health Sciences, we will not only study people, as health is


same depend on their environment, on animals, on plants or on objects, that
they will also be studied.

9.3 Population:

It is the set of individuals and/or units of study; they can also be


people, animals, plants or things.

The population, whether its total number is known or unknown, is classified as: finite,
if the population number is known infinity if the population number is
unknown.

This classification is important for the use of formulas in determining size.


as a sample.

9.4 Sample:

It is the selection of a number of study units from a defined population.


It is an important part of the design and methodology of a research study, as it
strongly related to the degree of generalization that can be made
from the results obtained from a specific study.

When conducting research, there are several reasons to sample:

Speed
b) Cost
c) Feasibility
d) Accuracy

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
67

Regarding the first three reasons, it is obvious that there is greater speed and lower cost.
in studying a hundred people compared to a thousand or more, and it's better to do it by situations
of human, physical resources and logistical support. In terms of accuracy, it refers to
Given that with less workload, it is possible to employ better qualified personnel
that guarantees a measurement of the phenomenon of interest with greater precision and power
supervise better to produce more accurate results.

To carry out a sampling, we have to answer three questions:

a) What is the population under study?


b) How many people are needed in the sample?
c) How to select the sample?

A sample must be:

a) Representative: It implies all the important characteristics of the population.


from which it was taken, in similar proportions. This is so that the researcher
can make valid inferences regarding the entire population from which it obtained its
show, that is to say it can meet one of the requirements to transpose the
results of your sample towards the population from which it was obtained.

b) Adequate: It refers to its size and answers the second question. It


calculate with various established formulas according to whether the study seeks a
existing proportion in a population, differences between the means or differences
between the proportions of two populations. To answer the third question,
It is necessary to understand the different sampling methods, which are covered in this chapter.
we will study.

USFXCh - Faculty of Medicine - Notes on Public Health II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
68

Population

The chosen ones of the sample cannot be made by their own will; if possible, they must
choose randomly.

If from this population of 38 people we must choose 3, whom do we choose?

2 4 6 8 1 1 1 1 1
1 3 5 7 9 0 1 2 1 4 1 6 1 8 1
1 ["3"] 5 7 9

2 2 2 2 2 3 3 3 3
2 1 2 3 2 5 2 7 2 9 3 1 3 3 3 5 3 7 3
0 2 4 6 8 0 2 4 6 8

To choose and know who the selected ones are, there are 2 types of sampling that exist
They can use: Probabilistic and non-probabilistic.

USFXCh - Faculty of Medicine - Notes on Public Health II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
69

9.4.1 Types of sampling

[Link] Probabilistic sampling:

Where all individuals of the population have equal conditions, they have the
same chances of being part of the sample.

[Link] Non-probabilistic sampling:

Where the individuals of the population to be chosen are incorporated by personal criteria.
the subjectives of the researcher.

Both types of sampling are classified as:

Simple random
2. Systematic random
3. Stratified sampling
4. Cluster sampling
Single-stage sampling
6. Multi-stage sampling

Accidental sampling
2. Purposive or convenience sampling
3. Sample of volunteers

If possible, it is better to use probabilistic ones because statistically they have better
support and reliability; since non-probabilistic methods tend to present biases.
unwanted information, which can confuse the results.

Next, we will describe each of the types of probability sampling.


also called random calls

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
70

A. Probabilistic or Random

a) Urn method:
A simple although little
practice of obtaining a sample
random is the technique "of the urn".
It consists of placing tokens in a ballot box.
with the names or numbers of each
element of the population and then of
mix them properly, it is extracted
as many elements as there should be
sample that has been decided to be chosen.
Due to this careful mixture before
each extraction, each element has the
same chance of being selected.
b) Use of the random digit table:

These random number tables are distributed at random, starting at


any point on the table continuing up, down, to the right or left is
they obtain the desired random numbers. As in the previous example, if the population

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
71

It is 38 and we must choose 3 people, we start at number 12, and we continue to the
right; the next two numbers leaving those greater than 38, we get 27 and 5;
therefore, the persons marked with the numbers 12, 27, and 5 are chosen.

2 4 6 8 1 1 1 1
1 3 7 9 0 1 2 1 4 1 6 1 8 1
1 3 5 7 9

2 2 2 2 3 3 3 3
2 1 2 3 2 5 2 7 2 9 3 1 3 3 3 5 3 7 3
0 2 4 6 8 0 2 4 6 8

It is also possible to use a computer medium, such as STATStm v.2 or others, where
it is necessary to enter the sample size, the lower limit number (which in our
the previous example is 1) and the upper limit number (which in our example is 38)

35
10
30

Entering the data into the computer, we observe that the selected ones from the sample
they are the people numbered 35, 10 and 30 who will be subjects of the
research study

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
72

2 4 6 8 10 12 14 16 18
1 3 5 7 9 11 13 15 17 19

21 23 25 27 29 31 33 35 37
20 22 24 26 28 30 32 34 36 38

Población = 38 personas
Sample = 3 people

35
10
30

Sample

A. Probabilistic or Random

To achieve the sample interval, the


total population divided by the number of
sample.
Example:
Population 2000 people
Show 100 people
Sample interval = 20

Start with a number between 1 and 20 and then


add 20 until reaching the sample 100

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
73

A. Probabilístico ó Aleatorio

The population is divided into strata.

Example:
1st, 2nd, and 3rd year of the Faculty of Medicine.
According to the number, calculate percentages.

Población total: 1200


Sample: 120

A. Probabilistic or Random

The population is divided, just like in the previous one.


case, by strata or conglomerates.
Example: Neighborhoods, municipalities

One can choose the neighborhoods or municipalities by


sampling and the people of these as well by
sampling in proportion to size.

USFXCh - Faculty of Medicine - Notes on Public Health II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
74

A. Probabilistic or Random

Sampling is taken into account in a single one.


stage.
Example: Only Faculties of a University

A. Probabilistic or Random

Sampling in multiple stages is taken into account.


Example:
First faculties and then courses and students

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
75

Types of sampling
B. Non-Probabilistic or Non-Random

It means collecting data from anyone.


what happens on the street, or another place under no
established standard.
CAUTION! It generates biases!

B. Non-Probabilistic or Non-Random

Seeking for the sample to be representative.


The selected individuals have knowledge
about the topic

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
76

B. Non-Probabilistic or Non-Random

Giving equal opportunities in the


investigation of those involved
Example:
Sample of 200 people:
50% women and 50% men

B. Non-Probabilistic or Non-Random

Widely used in experimental medicine.


Under the voluntary consent of the
people and acceptance of the conditions and
risks

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
77

10 B.2 Determination of sample size

10.1 Introduction

Every research study inherently involves determining during the design phase
sample size, necessary for its execution. Not carrying out this process,
it can lead us to two different situations: the first is that we carry out the study without it
appropriate number of patients, so we will not be able to be precise in estimating the
parameters and we will not find significant differences when in reality
yes they exist. The second situation is that we could study an unnecessary number of
patients, which implies not only a loss of time and an increase in resources
unnecessary but also the quality of the study, given this increase, may be affected
affected in a negative sense.

A frequently asked question that researchers receive is: What percentage of the
Is the population a good sample? Unfortunately, there is no satisfactory answer.
for all cases; the appropriate sample size is determined by various
factors, for which the optimal size must be determined in each case, taking into
count the particularities of the study.

In statistics, the sample size is the number of subjects that make up the
sample extracted from a population, necessary for the obtained data to be
representatives of the same.

10.2 Parameters for sample size calculation

The parameters taken into account for the calculation of sample size are:
Level of confidence
Proportion
Margin of error (Absolute accuracy)
Value of Q
Population or study universe

a) Level of confidence

The confidence level is represented by the letter Z and measures, as its name indicates, the
confidence level of a result in a sample study, which allows
generalize and that we can find the same data in the rest of the population when
which represents the sample. Therefore, logically a study will have a level
100% confidence if the research is conducted on 100% of the population; without

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
78

embargo as it is only a sample, the results can no longer be 100%


of confidence level, as this level will decrease from 99% as it progresses
let the sample size be smaller.

The recommendation for the results of an investigation to have sufficient


statistical significance, the confidence level must not be less than 90%; therefore the
confidence level for sample size determination is in an interval of
90 to 99%.

When a statistical package is used for sample size calculation, only


The percentage of the confidence level we want to calculate must be entered:
however, when using the formulas manually, we should instead write
the percentage of confidence level, note the value of 'Z' according to the percentage of level
reliable according to the data detailed below:

90% 1.65
91% 1,695
92% 1,751
93 % 1,812
94% 1,881
95% 1.96
96% 2,054
97% 2,170
98 % 2,326
99 % 2,576

b) Proportion

The proportion is represented by the letter 'P'. It represents the percentage or proportion.
of cases that we intend to find in our research based on the percentage
or proportion of cases found in other studies in similar populations where
we want to conduct our research study.

The literature review conducted in the "Theoretical Framework" of the research protocol,
will provide us with information on results or proportions found in different
latitudes of the world. Given the existence of different results for example if
we want to conduct a study on the prevalence of diabetes in the city of Sucre;
we observe in the literature that in Mexico in a study they found a 3% of
diabetics, in Ecuador 2.5% and in Tarija 1%; which of the 3 data do we adopt as
value 'P' for determining our sample size?; of course from
Tarija since it is the closest to the city of Sucre. It is also possible to do a
preliminary pilot study and achieve a more realistic approach in the city of Sucre itself.
If we cannot determine this proportion or percentage in the population
we predetermined the ratio or percentage as 50%.

c) Margin of error (Absolute precision)

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
79

The margin of error is represented by the letter 'd'. We must have noticed that
the value of the proportion studied above may be different from one place to another,
and being inclusive means being different in the same place in different research, therefore
value adopted for the application in our sample size calculation may
be different compared to the one we find in our future research; for
both trying to cushion these differences, as well as some differences in the
reading and interpretation of results in the used equipment, or possible errors
humans, the statistical method expects to introduce the parameter 'margin of error' which
goes between 1 and 5%.

The smaller our margin of error, the larger our sample size will be.
On the contrary, the larger our margin of error, the larger our sample size.
will be smaller.

d) Value of Q

The symbol is the same, namely "Q".

It is a value obtained from the difference between 100 and the proportional value or
"P" that we adopted

For example, if we adopt a percentage of 1% for the study of diabetes, then


we say

Q = 100 - P = 100 - 1 = 99
Q = 99

This value of Q is only used when applying the formula for obtaining
show manually and not with a software package, since the very
computer calculates automatically.

e) Population or study universe

When calculating a sample size, this sample comes from a population.


from the study that was previously defined.

This population can be known (quantified or finite), or otherwise, it can be


unknown (not quantified or infinite).

Depending on whether this population is infinite or finite, the sample size calculation
it differs using a different formula as we see below:

10.3 Sample size calculation with unknown and known population

10.3.1 Sample size calculation with unknown population and/or universe


infinite

USFXCh - Faculty of Medicine - Public Health Notes II – Biostatistics – Dr. Gróver Linares Ph.D - 2015
80

When the population size is unknown, the following formula is used:

n = Sample size
Z2Level of trust or security sought
P = Percentage or proportion of cases that are assumed to exist in the population that we
I am interested in studying due to previous studies, in the same research place or in another.
similar. If it is not known, it is assumed that there is 50%.
Q = Difference of the percentage or proportion to be studied. That is, Q = 100 - P
d = Desired precision or estimated tolerable margin of error

Exercise:

In a certain population, it is desired to estimate the % of women who use contraceptive methods.
What sample size is required to ensure a 95% confidence level,
that the estimation error does not exceed 3%. Previous studies indicate that this
the percentage or proportion of women using contraceptive methods reaches 25%.

Z = 95 % = 1.96 1.962(25 * 75) 3.8416 (1875) 7203


P = 25% n= = = 800
Q = 100 - 25 = 75 32 9 9
d = 3
The sample size is equal to 800 women.

10.3.2 Calculation of sample size with known population and/or finite universe

The formula used to determine the sample size with a known population
it is the following:

n = Sample size
N = Known population (number of inhabitants) of the place where it will take place
research.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
81

Z2Level of confidence or security sought


P = Percentage or proportion of cases that are assumed to exist in the population that we
It is interesting to study based on previous studies, in the same research place or in another.
similar. If it is not known, it is assumed that there is 50%.
Q = Difference of the percentage or proportion to study. That is, Q = 100 - P
d = Desired precision or estimated tolerable margin of error

Exercise:

In the locality of 'Rio Hondo' with 4500 inhabitants over the age of 35; it is proposed
to know the blood glucose level of a population over 35 years old; to determine if
It is necessary to establish a food education program.
There are precedents for this measurement in a similar locality that provides a proportion.
the percentage of hyperglycemia of 14%.

How many subjects need to be studied if we want to have a margin of error of 2?


and a confidence level of 95%?

N = 4500
Z = 1.96
P = 14
Q = 100-14 = 86
d=2

4500(1.96)2(14 * 86) 4500(3.8416) (1204)


n = 2 (4500-1) + (1,96) (14 * 86) =
2 2 =
4(4499) + (3,8416) (1204)

4500 * 4625.29 20.813.805


= 17996 + 4625,29
= 22621.29
= 920

It is worth noting that the sample size for this study is 920 people.

A higher level of confidence and a smaller margin of error require a larger sample size.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
82

This same calculation performed manually using the indicated procedures,


it can be carried out using the computer program 'Epidemiological Analysis Program'
EPIDAT version 3.0, whose installer is included in this book.

To calculate sample size using this analysis program,


press the left mouse button with the 'arrow' on 'Methods'

Immediately, a dialog with 3 windows opens: in the first one that appears, you place
the 'arrow' over 'sampling', appearing a second window where the 'arrow' is

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
83

put in "Sample Size Calculation" and then in the third window press the
left mouse button on 'Proportion'.

Sampling Sample size calculation

Proportion

Another window immediately appears asking to enter the data to make the
sample size calculation:

Each of the requested data is introduced, which in the previous example has a size
population of 4500; an expected proportion of 14 (the program uses the system of
decimal point score to indicate a fraction, therefore 14.000 means

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
84

14 integers with 000 thousandths, which it does automatically when entered the
percentage 14); then the Confidence Level is introduced in terms of percentage, which
in our example it is 95 (the program automatically adds the fraction ".0";
finally, the margin of error that the program uses should be introduced
equivalent to 'Absolute Precision' which in our example is 2% (automatically
the program introduces the fraction ".000".

In the 'Absolute Accuracy' box, a margin of error can be entered.


minimum and another maximum, which is used to calculate a sample size with the same ones
parameters but with 2 different margins of error that always range from 1 to 5%. The
the window that says 'Increase' simply automatically indicates the difference between
the minimum and maximum margin of error.

The phrase 'Design effect' that appears automatically is not taken into account.

As we can observe; with the input parameters, which are the same, the
used manually; the sample size calculation obtained in a few seconds
It is also of 920 people. Therefore, we verify that the Analysis Program
Epidemiological, it achieves exactly the same result, saving a lot of time without
options for procedural errors.

Statistical power and sample size

Statistical power refers to the ability to detect an association of interest.


in the context of a sampling error. Let's assume there is a true association of
,
association as weaker or stronger.

To be reasonably sure that our study will detect the association, the
The study must be sufficiently large so that the sampling error is
controlled.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
85

In general terms, large studies are powerful, small studies are


weak. The concept of "small study bias" illustrates the importance of
understanding statistical power when interpreting research
epidemiological.

In the academic experience, to demonstrate and make it easier for the student to understand
Research methodology: the relationship between sample size and power
statistics, using the 'Program for epidemiological analysis of tabulated data
version 3.0 "EPIDAT", we proceed to perform sample size calculations with
different parameters: Confidence level, and margin of error or absolute precision, for
one same population or universe and expected proportion; in the assumption of wanting to carry out
a research study on the use of emergency services during management
2011, in a neighborhood of the city of Sucre; where there is a population of 680 people,
with the background that in another neighborhood of the city of Sucre it was determined that
10% of the population used this service:

Calculus 1 Calculus 2
N = 680 Size of N = 680 Size of
Z = 90 Sample Z = 99 Sample
d=5 d=5
p = 10 n = 86 p = 10 n = 177

Calculus 3 Calculus 4

N = 680 Size of N = 680 Size of


Z = 90 Sample Z = 99 Sample
d=1 d=1
p = 10 n = 532 p = 10 n = 611

With these sample size determinations, we can conclude by saying that


lower level of confidence (90%) and higher margin of error (5%), as in "calculation 1"
our sample size is only 86 people (not advisable), since our
the research will have very little statistical power.

However, if we increase the confidence level to 99% and lower the margin of error
to only 1% as in "calculation 4", we achieve maximum statistical power, and as
We can notice that the sample size increases to 611 people.

Larger sample size, greater statistical power; smaller sample size


lower statistical power; under this concept if we wanted a confidence level of
100% would be ideal, we would have to conduct the research on the total population.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
86

11 B.3 Basic notions of


Normal Distribution

11.1 Introduction

In statistics and probability, it is called 'normal distribution', Gaussian distribution or


Gaussian distribution, one of the probability distributions of continuous variable
that most frequently appears in real phenomena.

The graph of its density function has a bell-shaped form and is symmetrical.
of a certain parameter. This curve is known as the bell curve.

Next, we observe an example of normal distribution of triglycerides in


students from the Faculty of Medicine:

Distribution of triglycerides in
students of the Medicine degree
120

100

80

60

40

0 75 100 150 200 250 300

Triglicéridos mg/dl

A common problem in the field of medicine is being able to know if an individual is


healthy or sick, based on observations of healthy patients.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
87

For example: we consider it normal for an adult to have a systolic blood pressure of
130 mm of mercury and abnormal to have a systolic pressure of 210 mm of mercury.

To establish the boundaries between what is normal and pathological, it is necessary to know the
distribution of the variable under study in normal individuals.

11.2 Characteristics of the normal distribution

The graph used to represent a frequency distribution is the histogram.


that united their upper points forms the bell.

The graph of the normal distribution resembles a symmetrical bell. The mean, the median
and the mode of the distribution has the same value. The distribution is completely
defined by the mean and the standard deviation.

S S S X S S S
68.27%
95.45%
99.73%

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
88

X - 1S and X + 1S = 68.27%

X - 2S and X + 2S = 95.45%

X - 3S and X + 3S = 99.73%

Example: We have a normal distribution of


population by age, where the average is 42
years and the standard deviation is 5 years.
Based on this information we can
to affirm that:
Approximately 68% of individuals
they are between 37 and 47 years old
That is to say:

1S, 2S and 3S means adding or subtracting (±) the value of the standard deviation.
multiplied by 1, 2, or 3.

11.3 Calculation of areas

To calculate the area under the normal curve for a given value of the variable 'x'.
normal distribution area tables have been constructed with the following
characteristics:

The total area under the normal curve is equal to 1 (which is equivalent to 100%)

2. For the symmetry from zero (standardized mean) to the right as to


the left is equal to 0.5 (or 50% of the area), making both equal to 1 or
100 %.

3. In the table, the 1st column contains the integer and decimal, the second decimal is
find at the top (1st row)
USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
89

4. The values of the 1st column and the 1st row represent the values of Z, while
that the values contained in the area represent the probabilities.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
90

AREA CALCULATION

Examples of application of the curve table


normal:
Z = 0.9 = 0.1841 = 18.41%
Z= 1.53 = 0.0630 = 6.3 %
Z = 2,99 = 0,0014 = 0,14 %

CALCULATION OF AREAS
To calculate the area under the normal curve at
from a certain value of the variable 'x',
it is necessary to transform the original variable into
that the data is given in such a way that its
average and its standard deviation have
these values. This transformed variable is
it is called standard normal variable and is symbolized
by 'Z' or rather:

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
91

CALCULATION OF AREAS

Where:
Z = Number of standard deviations from the mean
X = Some value of interest
X = Arithmetic mean of the normal distribution
S = Standard deviation of the normal distribution.

CALCULATION OF AREAS
Example: Let's assume that in the face of a determination of
hematocrit in the blood we have to decide if this
is value normal or not. We accept that the hematocrit has
normal distribution with an average of 48% and deviation
4% standard. Let us assume that in a patient it is
Find a value of 56%. What is the probability of
How can this happen while being healthy?

X = 48 56 – 48 8
X-X
Z = ------- Z = ----------- = ----- = 2 Z = 2.00
S S=4
4 4
X = 56
This means that the hematocrit is 56%
is located 2 standard deviations from
average.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
92

CALCULATION OF AREAS
In the normal distribution table, the area
corresponding to the value noted at the intersection
from the row corresponding to 2.00 of the first
column and the corresponding column to 0.00 in the
the first row is 0.0228.
This means that according to the model of the
normal distribution, the probability of finding
hematocrits equal to or greater than 56% is equal
0.0228; or by multiplying this value by 100 it is
equal to 2.28%, which means that it is likely
that there is 2.28% of healthy individuals with
values equal to or greater than 56 % of
hematocrit

AREA CALCULATION
Similarly, the table allows for the calculation of other
probabilities, such as the one of finding
values in a certain interval of the variable "x"
For which it will be necessary to keep in mind that the
total area is worth 1.
For example: We would like to know the probability
to find hematocrit values between 45% and
50%. Buscamos “Z” para ambos valores:
X = 48
S=4
X = 45 and 50

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
93

CALCULATION OF AREAS

45 - 48 3 P1 = 0,2266
Z = --------- = ----- = - 0.75
4 4

50 - 48 2
Z = --------- = ----- = 0.50 P2 = 0,3085
4 4
Adding the extreme areas P1 and P2 and subtracting "1" from the total surface,
we found the sought probability:
P1 + P2 =
0,2266 + 0,3085 = 0,5351
1 – 0,5351 = 0,4649
So the probability of finding values between 45% and 50% of
hematocrit is 0.4649 or in other words, 46.49% of the
healthy individuals have a hematocrit between 45 and 50%.

USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
94

B.4 Basic Concepts of Probability


12

12.1 Introduction

Probability measures the frequency with which a result (or set


of results) when carrying out a random experiment, of which all are known
the possible results, under sufficiently stable conditions.
probability is extensively used in areas such as statistics, physics, the
mathematics, science, and philosophy to draw conclusions about probability
of potential events and the underlying mechanics of complex systems.

The probability of an event is equal to the quotient of the number of


favorable cases and the number of equally possible cases.

Example: Each roll of the die, each one of the 6


numbers, has a 1/6 chance of coming up.

6 Possible Outcomes

1 2 3 4 5 6
1/6 + 1/6 + 1/6 + 1/6 + 1/6 + 1/6 =1

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
95

It is not known in advance the


Facing a Random Phenomenon
result

Is it possible to have a number that measures the likelihood of


that each of the events occurs?

Frequencies and Law of chance

Let's consider the Random Phenomenon of 'Tossing a coin in the air'

0.5 + 0.5 = 1
50% chance of getting heads and 50% chance of getting tails

50 + 50 = 100%

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
96

Properties of probability

The sum of the probabilities of all events


The elements of a phenomenon is equal to the unit.
For example, let's consider the experiment of throwing a
coin in the air and observe the result. The two possible
results are the two Elemental Events "Observe
heads" and "Observe cross." If the coin flipped is
correct (not rigged), the probability associated with
each of the two elemental events is 0.5. Thus, the
sum of the probabilities of the two events
elementary aspects of the phenomenon are:

1 = 0.5 (Prob. Heads) + 0.5 (Prob. Tails)

Frequencies and Law of Randomness

1st Experiment

Nº Caras = 6 Frecuencia Relativa (Cara) = 0.6


=1
Nº Cruz = 4 Frecuencia Relativa (Cruz) = 0,4

2nd Experiment

Nº Caras = 4 Frecuencia Relativa (Cara) = 0.4


=1
Nº Cruz = 6 Frecuencia Relativa (Cruz) = 0,6

USFXCh - Faculty of Medicine - Public Health Notes II – Biostatistics – Dr. Gróver Linares Ph.D - 2015
97

Frequencies and Law of Chance


It has been observed that, as a proper coin is tossed a
increasing number of times, the relative frequency of heads is going
stabilizing around a fixed number (0.5)
Graphic illustration of the law of chance
1

Probability
0.5 of the face
currency = 0.5

Probability
0 from the cross of
1 2 5 10 20 50 100 200 500 1000 2000 5000 10000
moneda =0.5
Nº lanzamientos

Law of chance:

In a long series of trials, the relative frequency of an event tends to


stabilize around a fixed number called Probability of the event.

Properties of probability:

The probability of an event A, P(A), is ALWAYS a number between


between 0 and 1
For every event A, P(A) ≥ 0 and P(A) ≤ 1

The probability of an event is equal to 1 minus the probability of the event.


opposite.
For example: If the probability of being male in a population is 0.49, then the
the probability of not being it (that is, of being a woman) is 1-0.49 = 0.51

USFXCh - Faculty of Medicine - Public Health II Notes – Biostatistics – Dr. Gróver Linares Ph.D - 2015
98

CONCEPT OF PROBABILITY AND PROPERTIES


(Reviewing the initial concept)

Laplace's rule

The probability of an event or occurrence is


equal to the quotient of the number of cases
favorable to the event and the number of possible cases
of the phenomenon:

P(A) = Probability of an event occurring

h = Safe cases that an event occurs.


Example: Given or coin 1 probability

n = Possible or probable cases


Example: Given 6 possible or coin 2

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
99

Contrary probability of an event:


It is the quotient between the number of unfavorable cases of
same and the number of equally possible cases

P(A) = Probability of the opposite occurring of an event

d = Safe cases that an event does not occur.


Example: Given 5 or coin 1 probability

n = Possible or probable cases


Example: Given 6 possible or 2 coins

P(A) + P(A) = 1

Examples:
1. When flipping a coin, the probability of
que salga cara es:

P(A) = 1/2 = 0.5 or 50% n=2


h=1
P(A) = 1/2 = 0.5 or 50% d=1

0.5 + 0.5 = 1

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
100

P(A) + P(A) = 1

Examples:
2. What is the probability of getting 5 when rolling?
a die?

P(A) = 1/6 = 0.17 or 17% n=6


h=1
P(A) = 5/6 = 0.83 or 83% d=5

0.17 + 0.83 = 1

P(A) + P(A) = 1
Examples:
In a group formed by 7 patients with hypertension
arterial and 3 of diabetes, 2 people are chosen at random.
What is the probability that I will come out sick from
diabetes? n = 10 people
h = 3 diabetes
d = 7 hyp.
P(A) = 3/10 = 0.30 or 30%
P(A) = 7/10 = 0.70 or 70%
0.30 + 0.70 = 1

USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
101

P(A) + P(A) = 1
Examples:
4. In a group made up of 3 tuberculosis patients and 9
healthy people, 4 people are chosen at random.
What is the probability of getting 1 sick person out of
tuberculosis? n = 12personas
h = 3tubercul.
d = 9 years
P(A) = 3/12 = 0.25 or 25%
P(A) = 9/12 = 0.75 or 75%
0.25 + 0.75 = 1

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
102

B.5 Basic Notions of Correlation


13
13.1 Introduction

In the analysis of clinical-epidemiological studies, the issue often arises


need to determine the relationship between two quantitative variables in a group of
subjects. The objectives of this analysis are usually:

a) Determine if the two variables are correlated, that is, if the values of
a variable tends to be higher or lower for higher or lower values
low of the other variable.

b) To be able to predict the value of one variable given a specified value of the other
variable.

c) Assess the level of agreement between the values of the two variables.

Correlation is the study of the association between two quantitative variables.


Calculation is the first step to determine the relationship between the variables.

The quantification of the strength of the linear relationship between two quantitative variables, is
studies through the calculation of the Pearson correlation coefficient. This coefficient
it oscillates between -1 and +1. A value of -1 indicates a linear relationship or positive straight line
perfect. A correlation close to zero indicates that there is no linear relationship between the two.
variables.

The graphical representation of the data to demonstrate the relationship between the value
of the correlation coefficient and the shape of the graph is fundamental since there are
non-linear relationships.

13.2 Pearson Correlation Coefficient

The Pearson correlation coefficient (r) can be calculated for any data set,
however, the validity of the hypothesis test on the correlation between the variables
requires in the strict sense: a) that the two variables come from a random sample
of individuals. b) that at least one of the variables has a normal distribution in the
population from which the sample is drawn. For the valid calculation of an interval of
confidence of the correlation coefficient r both variables must have a distribution
normal. If the data does not have a normal distribution, one or both variables can be
transform (logarithmic transformation) or if not, a coefficient would be calculated
non-parametric correlation (Spearman's correlation coefficient) that has the same

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
103

meaning that the Pearson correlation coefficient is calculated using the rank
from the observations.

The calculation of the correlation coefficient (r) between weight and height of 20 boys is
see the attached table. The covariance, which in this example is the product of weight
(kg) per size (cm), so that it has no dimension and is a coefficient, it is divided by the
standard deviation of X (size) and the standard deviation of Y (weight) with which we obtain
the Pearson correlation coefficient which in this case is 0.885 and indicates a
important correlation between the two variables. It is evident that the fact that the
a strong correlation does not imply causation. If we square the coefficient of
correlation we will obtain the coefficient of determination (r20.783) that indicates to us that the
78.3% of the variability in weight is explained by the child's height. Therefore, there are
other variables that modify and explain the variability in the weight of these children. The
the introduction of more variables with multivariate analysis techniques will allow us to identify
the importance of what other variables may have on weight.

Table 1. Calculation of the Pearson correlation coefficient between


the height and weight variables of 20 male children

Y X
Peso Talla
(Kg) (cm)
9 72 5.65 1.4 7.91
10 76 9.65 2.4 23.16
6 59 -7.35 -1.6 11.76
8 68 1.65 0.4 0.66
10 60 -6.35 2.4 -15.24
5 58 -8.35 -2.6 21.71
8 70 3.65 0.4 1.46
7 65 -1.35 -0.6 0.81
4 54 -12.35 -3.6 44.46
11 83 16.65 3.4 56.61
7 64 -2.35 -0.6 1.41
7 66 -0.35 -0.6 0.21
6 61 -5.35 -1.6 8.56
8 66 -0.35 0.4 -0.14
5 57 -9.35 -2.6 24.31
11 81 14.65 3.4 49.81
5 59 -7.35 -2.6 19.11
9 71 4.65 1.4 6.51
6 62 -4.35 -1.6 6.96
10 75 8.65 2.4 20.76

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
104

Sx= Desviación típica x = 8.087

SyStandard deviation and = 2.137

The correlation coefficient, as previously indicated, ranges from -1 to +1.


finding in the middle the value 0 that indicates there is no linear association between
the two variables under study. A coefficient of reduced value does not indicate
there doesn't necessarily have to be a correlation since the variables can present

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
105

a non-linear relationship such as the weight of the newborn and the time of
gestation. In this case, the r underestimates the association when measured linearly. The
non-parametric methods would be better used in this case to show if
the variables tend to rise together or move in directions
different.

Another example of the application of Pearson's correlation coefficient (r):

Maternal mortality rate and Development Index


Human

800
n=28; r= -0,628;p<0,01
700

600

500

400

300

200

100

0
0 0,1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
-100
Human Development Index
At a higher Human Development Index, there is a trend of decreasing the Ratio.
on Maternal Mortality
Through the Pearson's correlation statistical test (-0.628), it was confirmed, for
the population under study, the existence of a significant direct relationship at the level 0.01
Kendall's Tau-b correlation coefficient -0.484, significant at the 0.01 level
Spearman's Rho correlation coefficient -0.654, significant at the 0.01 level

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
106

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
107

14 B.6 Chi Square - X2

Salvador Pita Fernández(1), Sonia Pértega Díaz(2)


Family Doctor. Health Center of Cambre (A Coruña).
Clinical Epidemiology and Biostatistics Unit. Juan Canalejo Hospital Complex (A
Coruña.

In biomedical research, we often encounter data or variables of


qualitative type, through which a group of individuals is classified into two or more
mutually exclusive categories. Proportions are a common way of
to express frequencies when the variable under study has two possible responses,
how to present or not present an event of interest (disease, death, healing, etc.). When
The aim is to compare two or more groups of subjects regarding a variable.
categorical, results are often presented in the form of double-entry tables that
they are called contingency tables. Thus, the simplest situation for comparison
between two qualitative variables is one in which both have only two possible
response options (that is, dichotomous variables). In this situation the table of
contingency is reduced to a two by two table as shown in Table 1.

Let us suppose that we want to study the possible association between the fact that a
pregnant women smoke during pregnancy and the child is born with low birth weight. For the
So, it is about seeing if the probability of having low weight is different in pregnant women than in
smoking or in pregnant women who do not smoke during pregnancy. To answer this question
a follow-up study is conducted on a cohort of 2000 pregnant women, to whom
inquire about their smoking habits during pregnancy and also determine the weight
of the newborn. The results of this study are shown in Table 2.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
108

In Table 1, a, b, c, and d are the observed frequencies of the event in reality.


our study example (43, 207, 105, and 1647), with n (2000) being the total number of
studied cases, and a+b, c+d, a+c and b+d the marginal totals. In the example, a+b=250
it would be the total number of women who smoke during pregnancy, c+d=1750 the number
total number of non-smoking women, a+c=148 the number of children with low birth weight and
b+d=1852 the number of children with normal birth weight.

Given a contingency table like the one above, we can pose different questions.
issues. Firstly, it will be sought to determine if there is a statistically significant relationship
significant among the studied variables. Secondly, we will be interested in quantifying
the relationship and study its clinical relevance.

This last issue can be resolved through the so-called association measures.
or effect (relative risk (RR), odds ratio (OR), absolute risk reduction (ARR))
that have already been addressed in other works. On the other hand, to respond to the first
Question, the methodology for analyzing contingency tables will depend on several
aspects such as: the number of categories of the variables to be compared, of the fact that
que las categorías estén ordenadas o no, del número de grupos independientes de
subjects that are being considered or from the question that one wishes to answer.

There are different statistical procedures for the analysis of tables


contingency such as the χ 2 test, Fisher's exact test, McNemar's test, or the
Cochran's Q test, among others.

This article will present the calculation and interpretation of the χ2 test as a method.
analysis standard in the case of independent groups.

The chi-squared test in the independence test of qualitative random variables.

The chi-squared test allows us to determine whether two qualitative variables are associated or not. If at
At the end of the study, we conclude that the variables are not related; we can say that.
a certain level of confidence, previously established, that both are independent.
To compute it, it is necessary to calculate the expected frequencies (those that should
to have observed whether the independence hypothesis were true), and compare them with the
observed frequencies in reality. In general, for a r x k table (r rows and k
columns), the value of the χ 2 statistic is calculated as follows:

USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
109

where:
It denotes the observed frequencies. It is the number of observed cases.
classified in row i of column j.

It is the number of expected or theoretical frequencies. It is the number of expected cases.


corresponding to each row and column. It can be defined as that frequency that
It would be observed if both variables were independent.

Thus, the chi-squared statistic χ 2 measures the difference between the value that should result if the two
variables were independent and what has been observed in reality. The greater
the greater that difference is (and, therefore, the value of the statistic), the greater the relationship between
both variables. The fact that the differences between the observed values and
expected values are squared to convert any difference into positive.
The chi-squared test is thus a non-directed test (two-tailed test) that indicates whether there is
there is no relationship between two factors but not in what sense such an association occurs.

To obtain the expected values E, these are calculated through the product of the
marginal totals divided by the total number of cases (n). For the simplest case of
a 2x2 table like Table 1, must be:

For the example data in Table 2, the expected values would be calculated as follows:

So the observed and expected values for the proposed example data
shown in Table 3.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
110

The value of the χ2 statistic, for this specific example, would then be given by
like:

In light of this result, what we need to do now is to propose a contrast of


hypothesis between the null hypothesis:

H0: There is no association between the variables (in the example, the low weight of the child and the fact
Smoking during pregnancy is independent, they are not associated.

And the alternative hypothesis:

Yes, there is an association between the variables, that is, low weight and smoking during the
gestation are associated.

Under the null hypothesis of independence, it is known that the values of the statistic χ2 are
they are distributed according to a known distribution called chi-squared, which depends on a
parameter called degrees of freedom (g.l.). For the case of a contingency table.
of r rows and k columns, the degrees of freedom are equal to the product of the number of rows minus 1 (r-1) by
the number of columns minus 1 (k-1). Thus, in the case where the relationship is studied
between two dichotomous variables (2x2 table) the degrees of freedom are 1.

If the null hypothesis is true, the obtained value should be within the greater range.
probability according to the corresponding chi-square distribution. The p-value that is usually
reporting that most statistical packages is nothing more than the probability of obtaining,
according to that distribution, a more extreme datum than the one provided by the test or,
equivalently, the probability of obtaining the observed data if it were true the
independence hypothesis. If the p-value is very small (usually it is considered
p<0.05) it is unlikely that the null hypothesis will be fulfilled and it should be rejected.

In Table 4, the degrees of freedom are determined (in the first column) and the value of
α (in the first row). The number that determines its intersection is the critical value.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
111

corresponding. Thus, if the obtained χ 2 statistic takes a greater value


it will be said that the difference is significant. Thus, for a confidence level of 95% (α =0.05) the
The theoretical value of a chi-square distribution with one degree of freedom is 3.84. For
α = 0.01 is 6.63 and for α = 0.005 it is 7.88. As for the calculation of χ 2 in
the example we obtained a value of 40.04, which exceeds the value for α = 0.005, we will be able to
to conclude that the two variables are not independent, but are associated
(p<0.005). Therefore, in light of the results, we reject the null hypothesis (H0) and
We accept the alternative hypothesis (Ha) as likely true.

For the case of a 2x2 Table, the expression (1) of the χ 2 statistic can be simplified and
to be obtained as:

When the sample size is small, the use of the chi-square distribution for
approximating the frequencies may introduce some bias in the calculations, so that the
the value of the χ 2 statistic tends to be larger. Sometimes a correction is used for
eliminate this bias that, in the case of 2x2 tables, is known as the correction of
Yates:

In the previous example, the calculation of the χ 2 statistic with Yates' correction would give us
a value of 2Yχ =38.43 (p<0.01) instead of χ 2 =40.04. There is no consensus on the
literature on the use or non-use of this Yates' conservative correction, which with
reduced samples make it difficult to reject the null hypothesis, although the effect is practically
imperceptible when working with larger samples.

However, it is worth mentioning that the use of Yates' correction does not exempt
certain requirements about the sample size necessary for the use of the
Chi-squared statistic χ 2. As a general rule, it will be required that 80% of the cells in a table
contingency should have expected values greater than 5. Thus, in a 2x2 table it will be
necesario que todas las celdas verifiquen esta condición, si bien en la práctica suele
allow one of them to display expected frequencies slightly below
this value. In those cases where this requirement is not verified, there is a test,
proposed by R.A. Fisher, which can be used as an alternative to the χ2 test and that is
known as Fisher's exact test. The procedure consists of evaluating the probability
associated with all the 2x2 tables that can be formed with the same marginal totals
that the observed data, under the assumption of independence. The calculations, although

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
112

elementary, are somewhat cumbersome, so they will not be included in this work, being
multiple references that can be consulted in this regard.

To conclude, it is important to emphasize that there are other statistical methods that allow us to analyze
the relationship between qualitative variables, which complement the information
obtained by the statistic χ 2. On one hand, the analysis of the standardized residuals.
will allow to verify the direction in which the relationship between the studied variables occurs.
There are also other measures of association, many of which are effective.
especially useful when one of the variables is measured on a nominal scale or
ordinal, which allow quantifying the degree of relationship that exists between both factors.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
113

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
114

B.7 Confidence Interval


15
When a health researcher conducts a study, taking a sample
of the total population, for example the arithmetic mean of the heart rate in
1000 people; if it averages 70 beats per minute; being only one of a
Part of it, that result cannot be generalized to the entire population, saying
those people who have less or more than 70 beats per minute are people
that have some heart problem. Therefore, to generalize to the whole
the population must create a confidence interval (60 to 80 beats per minute) that gives them
allow with a 95% confidence (p>0.05), that within that range is the
heart rate of 95% of the normal population.

Population
p > 0.05

Security Sample

Estimation Experiment/ Measurement

Results
C.I. Confidence interval

I. Concept of Confidence Interval (CI).

In the context of estimating a population parameter, a confidence interval is


a range of values (calculated in a sample) in which is found the
true value of the parameter, with a determined probability.

The confidence interval describes the variability between the measurement obtained in a
study (sample) and the actual measure of the population (the real value). Corresponds to
a range of values, whose distribution is normal and in which it is found, with
high probability, the real value of a certain variable. This 'high

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
115

probability" has been established by consensus at 95%. Thus, an interval of


95% confidence indicates that the value is within the given range.
to determine a parameter with 95% certainty

The probability that the true value of the parameter is found in the
The constructed interval is called the confidence level, and it is denoted 1 - α.
The probability of making a mistake is called the significance level and is symbolized as α.
Generally, confidence intervals are constructed with 1 - α = 95% (or significance
α = 5%). Less frequent are the intervals with α=10% or α=1%.

II. Confidence interval for a mean or average:

When studying a sample size of a population and obtaining a


arithmetic mean or average of the studied values, is only a measure of
central tendency, which may have an error margin when inferring that data to the
entire population from where the sample was obtained. To achieve a range of
values in which 95% of the population is found, we obtain an Interval
of Trust (IC95).

To achieve an IC95we need to calculate the arithmetic mean first,


standard deviation and know the number of people studied.

Example:

The following data are the percentage values of hematocrit obtained


for 30 second-year students of the Faculty of Medicine who are between
18 and 20 years old:

38 39 39 40 41 41 43 45 45 45

45 45 45 46 46 46 47 47 47 47

47 48 48 48 49 50 50 51 51 51

Having obtained the following results from this data:

Arithmetic mean of: X = 45.7


Variance: S2= 14
Standard deviation of: S = 3.7
Total number: n = 30

To construct a confidence interval with 95% confidence, for the


average hematocrit of the sample of 30 students, we used the following
formula:

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
116

Formula: S As Z for 95% is S


IC95= X ± Z x ---------- equivalent to 1.96 IC95= X ± 1.96 x ----------
√n sqrt(n)
Breaking down the formula, we would have the following statement of Interval of
Trust:

S S
IC95= X - 1.96 x ---------- - IC95= X + 1.96 x ----------
√n √n

Replacing the found values we have:

3.7 3.7
IC95= 45.7 - 1.96 x ---------- - CI9545.7 + 1.96 x ----------
√30 Square root of 30

3.7 3.7
IC95= 45.7 - 1.96 x ---------- - CI95= 45.7 + 1.96 x ----------
5.4 5.4

IC95= 45,7 - 1,96 x 0,69 IC95= 45.7 + 1.96 x 0.69

IC95= 45,7 - 1,35 - IC95= 45,7 + 1,35

IC95= 45,7 – 1,35 - IC95= 45,7 + 1,35

IC95= 44,35 - 47,05

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
117

Therefore, the confidence interval for the mean of the hematocrit study of
The studied population with 95% confidence is between 44.35% and
47,05 ^

III. Confidence Interval for a Proportion ( p ).

In this case, it is important to construct a confidence interval for a proportion or


a population percentage (for example, the percentage of obese individuals,
smokers, etc.)

Proceeding in a manner analogous to the case of the Confidence Interval for a


average, we can build a 95% confidence interval for the proportion
population; this proportion is represented^ by the symbol "p"

Formula:

^
IC95= p±1.96 x √p^ x (1 ^– p) / n

Therefore, in disaggregated form we have:

IC95= p^ - 1.96 x √p^ x (1 ^– p) / ICn95= p + 1.96


^ x √p^ x (1 ^– p) / n

Example:

In a prevalence study of risk factors in a cohort of 825 women


In the fertile age group of the city of Sucre, it was found that 26% were obese.
95% confidence interval (CI95for the proportion of obese women in the
the city of Sucre is given by:

Total number: n = 825

Percentage of obese women: %= 26 (To obtain proportion in relative numbers you


divide
between 26/100 = 0.26
^
Proportion of obese people: p = 0.26

IC95: Z = 1.96

IC95= 0.26 - 1.96 x √0.26 x (1 - 0.26)- /IC825


= 0.26 95 + 1.96 x √0.26 x (1 - 0.26) / 825
USFXCh - Faculty of Medicine - Public Health II Notes - Biostatistics - Dr. Gróver Linares Ph.D - 2015
118

IC95= 0.26 - 1.96 x √0.26 x (0.74) / 825


IC = 0.26 + 1.96 x
95 √0,26 x (0,74) / 825

IC95= 0.26 - 1.96 x √0.19 / 825 - IC95= 0.26 + 1.96 x √0,19 / 825

IC95= 0.26 - 1.96 x √0.0002303 - IC95= 0.26 + 1.96 x √0.0002303

IC95= 0,26 - 1,96 x 0,0151756 - IC95= 0,26 + 1,96 x 0,0151756

IC95= 0,26 – 0,03 - IC950.26 + 0.03

IC95= 0.23 - IC95= 0.29


If we multiply by 100 to know the percentage of obese women of childbearing age in
the study population with a 95% confidence, we observe that there are 23
29 %.
IC95= 23 % - 29 %

IV. Use of Confidence Intervals to Verify Hypotheses.

Confidence intervals allow for the verification of hypotheses posed regarding


population parameters.

For example, let's suppose that the hypothesis is raised that the average height
female sex birth in the city of Sucre is equal to the average
national of 52 centimeters.

When taking a sample of 30 newborns from the newborns of the city of


Sugar in study, was obtained:

50 centimeters
s=2
n= 30

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015
119

When constructing a 95% confidence interval for the population mean, one
obtains:

As Z for 95% is
Formula:
S equivalent to 1.96
IC95= X ± Z x ---------- S
IC95= X ± 1.96 x ----------
square root of n square root of n

S
IC95= X - 1.96 x ---------- - S
IC95= X + 1.96 x ----------
√n √n
2
IC95= 50 - 1.96 x ---------- - 2
IC95= 50 + 1.96 x ----------
√30 √30
2
IC95= 50 - 1.96 x ---------- - 2
IC95= 50 + 1.96 x ----------
5.48 5.48

IC95= 50 – 0,72 - IC95= 50 + 0.72

IC95= 48.28 - 50.72

Therefore, the birth size in girls from Sucre varies between 48.28 and 50.72, with
a 95% confidence.

As the interval does not include the value 52 centimetersposed in the


hypothesis, then this is rejected with 95% confidence (or a p-value less than
0.05.

USFXCh - Faculty of Medicine - Public Health Notes II - Biostatistics - Dr. Gróver Linares Ph.D - 2015

You might also like