0% found this document useful (0 votes)
67 views

Applied Biostatistics 2020 - 01 Basics, Centrality and Dispersion

This document provides an overview of an elective module on applied biostatistics taught by Alexandr Parlesak from February 3rd to March 13th, 2020. The module consists of 72 hours of face-to-face learning, 84 hours of directed learning, and 84 hours of autonomous learning, totaling 10 ECTS points. Key topics that will be covered include basics, centrality, dispersion, probability, statistical inference, and how to perform statistical analyses and interpret results using the R software package. The overall objectives are for students to understand fundamental statistical principles and analyses and develop skills to evaluate quantitative data and research claims.

Uploaded by

Yuki NoShinku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Applied Biostatistics 2020 - 01 Basics, Centrality and Dispersion

This document provides an overview of an elective module on applied biostatistics taught by Alexandr Parlesak from February 3rd to March 13th, 2020. The module consists of 72 hours of face-to-face learning, 84 hours of directed learning, and 84 hours of autonomous learning, totaling 10 ECTS points. Key topics that will be covered include basics, centrality, dispersion, probability, statistical inference, and how to perform statistical analyses and interpret results using the R software package. The overall objectives are for students to understand fundamental statistical principles and analyses and develop skills to evaluate quantitative data and research claims.

Uploaded by

Yuki NoShinku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Global Nutrition & Health

Elective Module “Applied Biostatistics”

Teacher: Alexandr Parlesak


February 3rd – March 13th, 2020

University College
Copenhagen

72 h Face-to-face learning;
84 h Directed learning;
84 h Autonomous learning

10 ECTS Points
Applied Biostatistics 2020 A. Parlesak 1
Global Nutrition & Health

Elective Module “Applied Biostatistics”

Chapter 1: Basics, Centrality and Dispersion

http://jblomo.github.io/datamining290/slides/2013-02-08-Probability.html Applied Biostatistics 2020 A. Parlesak 2


Global Nutrition & Health
Elective Module “Applied Biostatistics”

Literature:
Peter Dalgaard:
Introductory Statistics with R
Second Edition 2008

Urdan, T.C.:
Statistics in Plain English
3rd ed. Routledge, New York

Kirkwood, B.R.; Sterne, J.A.C.:


Essential Medical Statistics
2nd ed. Blackwell, Malden, MS

Applied Biostatistics 2020 A. Parlesak 3


About Me

• I’m not an educated theoretical statistician.

• My competence in statistics developed from


meeting necessities for statistical evaluation of more then
150 clinical/experimental studies performed, more than 45 of
those published. Urge for understanding statistical evaluation
made me close theoretical gaps.

• Accordingly, the focus of this course will be on practical


application of statistical procedures.

• Nevertheless, the most important chunks of statistical theory


will be implemented into the teaching so you will have a
basic understanding of the principles of statistical testing.

http://rlv.zcache.com/retired_statistics_teacher_hat-p148022764156727328tru5_152.jpg
Applied Biostatistics 2020 A. Parlesak 4
Why Should You be Here?

• You have an interest in statistical


reasoning.
• You have a desire to learn to use
statistics properly in data analysis.
• You want to evaluate the (quantitative) data of your bachelor
thesis with your own skills.
• You want to develop your ability to critically assess scientific
arguments and reveal pseudo-scientific abuse of statistics.
• You want to have a common basis of communication with
professional statisticians when bringing up problems of your
work (statistical literacy).

http://blogs.stthomas.edu/opusmagnum/files/2010/08/statistics.jpg Applied Biostatistics 2020 A. Parlesak 5


Why Doing Statistics at All?
When searching for e.g. clinical studies, the abstract may look like this:

• In quantitative science, statements on differences, correlations, etc. ALWAYS


need to be backed up by an appropriate statistical test.

• This test usually gives you key indicators such as the effect size and the
probability whether the finding occurred only by chance – or not.

• There is NO evidence without statistical proof


Applied Biostatistics 2020 A. Parlesak 6
Objectives

• Understand the fundamental


principles of statistical inference.
• Understand the general principles
underlying the most common tests.
Gerhard Richter, 1024 Farben
• Know the assumptions of common
tests and understand impact of violations.
• Be able to perform standard statistical analyses with “R”.
• Developing skills to extrapolate bivariate tests to
multivariate formats.
Applied Biostatistics 2020 A. Parlesak 7
Competencies You Will Have After the Course (1)

1. Describe the roles biostatistics


serves in the discipline of public health.
2. Distinguish among the different
measurement scales (e.g., categorical,
ordinal and interval) and the implications for selection of
statistical methods to be used based on
these distinctions.
3. Apply descriptive techniques commonly used to summarize
public health data including data display (tables and figures)
and measures of distribution shape, central tendency,
variability, correlation, and risk assessment.
4. Understand key concepts of probability, random variation, and
commonly used statistical probability distributions such as
“normal” and “F” that shape the practice of biostatistics.

http://www.usrbingeek.com/blog/competence.gif
Applied Biostatistics 2020 A. Parlesak 8
Competencies You Will Have After the Course (2)

5. Apply common statistical methods for


inference, including: estimation, confi-
dence intervals, and hypothesis testing.
6. Specify preferred methodological
alternatives to commonly used statistical
methods when assumptions are not met.
7. Apply descriptive and inferential methodologies
(consisting of: sample selection, hypothesis development and
testing, decision errors, power, and sample size) according to
the type of study design (e.g., cross-sectional) for answering a
particular research question.
8. Interpret results of statistical analyses found in public health
studies including assessing the assumptions, quality of data
(objectivity, reliability, and validity), appropriateness of
statistical methods, and validity and utility of conclusions.

http://www.usrbingeek.com/blog/competence.gif
Applied Biostatistics 2020 A. Parlesak 9
Questions About This Class
(You might want to ask – or not)

• Is this class to be hard?


- No. Concept is easy and
procedure is clear.
• Why do we spend time on
theoretical stuff?
“It was my understanding that
- Helpful to understand the there would be no math”
applications ant their potential - Chevy Chase, ‘Spies Like Us‘

• Do we need to know all the stuff?


- You may not need all, but be prepared

http://www.doblu.com/wp-content/uploads/2010/08/spieslikeus10124.jpg
Applied Biostatistics 2020 A. Parlesak 10
Class Preparation

• Read appropriate chapter(s)


in books (as given in the
according Wiki) and on
Intrapol beforehand and
bring questions to class.
• There is no such thing as a stupid question!
• If you’ve got a question, ask it immediately!

http://lsatpreparationclasses.com/images/LSAT_Prep_Class.JPG
Applied Biostatistics 2020 A. Parlesak 11
Is Biostatistics Hard to Study?

Factors that make it hard for some


students to learn biostatistics:
• The terminology is deceptive. You have to
understand that the meaning of statistical
terms such as significant, error and hypothesis
is distinct from ordinary use of these words.
• Statistics requires mastering abstract concepts. Theoretical
concepts such as skewness of populations, probability
distributions, and null hypotheses are not easy to handle.
• There are always 2 sides of learning: the rationalistic
(knowledge) and the emotional (self-confidence, motivation).
Not understanding these abstract concepts might drive you
into feelings of despair.
(“I will never understand this, so why should I go on?”).

http://afteramerica.files.wordpress.com/2010/01/despair1237852510.jpg
Applied Biostatistics 2020 A. Parlesak 12
Is Biostatistics Hard to Study? NO!

Reasons why you should be optimistic


on your future statistical skills:
• The derivation of most statistical tests
involves difficult math. However, you
can learn to use statistical tests and
interpret the results even if you do not
fully understand how they work. You only need to know
enough about how the tool works so that you can use them in
appropriate situations. The type of data determines the right
test.
• Basically, you can calculate statistical tests and interpret
results even if you don’t understand how the equations were
derived, as long as you know enough how and when to use
the statistical tests appropriately.
http://dontsqueezethejj.com/blog/wp-content/uploads/2008/10/20080424-kenya_0425.jpg
Applied Biostatistics 2020 A. Parlesak 13
Why choosing “R” as Software for
Statistical Evaluations?

o Most statistical software packages are expensive.


R is the only widely accepted open-source
software with built-in statistical procedures.
o Key advantages:
- Freely available (http://www.r-project.org/)
- Achievable on every computer connected to the
internet all over the world
- High creative potential
- Constantly up-dated with routine commands for
modern statistical applications
o Key disadvantages: computer language-like
commands that need to be typed in
(no “simple click” solutions)

Applied Biostatistics 2020 A. Parlesak 14


Principles of Statistics are
Independent of the Software Applied

o When working in (quantitative) science, you will


be faced with specific questions on differences,
correlations, prevalence, etc.
o For each of these questions, a specific statistical
test is available, which usually has a name
(frequently after it’s inventor such as Spearman, Kendall,
Kolmogorov-Smirnoff, etc.)

o Independently of the software applied, both the


statistical procedures and their names are
identical.
o Having learned standard procedures of statistical
testing in one program (here: R), you will find
easily your way through other software
packages (if you can afford them)

Applied Biostatistics 2020 A. Parlesak 15


ACTIVITY 1:

1. Install “R” ([R]) on your computer.


(http://cran.r-project.org/bin/windows/base/)

2. Install “R Studio” on your computer.


(http://www.rstudio.com/products/rstudio/download/)

3. Install a shortcut for “R_Studio” and check whether


the programs are running.

http://www.famousquotes.com/show/1040765/
Applied Biostatistics 2020 A. Parlesak 16
What Statistics Can and Can’t Do

Can Can’t
• Deliver a basis for hypothesis • Tell the truth
generation (explorative research; (probabilistic conclusions only!)
help to detect patterns in messy • Compensate for poor design
data) (Sir Ronald Fisher)
• Provide objective criteria for • Indicate biological/social significance:
evaluating hypotheses and statistical significance does not mean
argument raising biological/social significance, nor vice
• Condense cluttered information (not versa!
without information loss, so keep
your raw data!)
• Translate data into statements
(e.g. “Increased intake of folic acid significantly
reduces incidence of spina bifida.”)
• Help you critically evaluate
arguments of others

Applied Biostatistics 2020 A. Parlesak 17


Some Opinions on Statistics

“There are three types of lies: lies, damn


lies, and statistics!”
Benjamin Disraeli,
Benjamin Disraeli; source: Mark Twain former British
Prime Minister

“If your experiment needs statistics, you


should have done a better experiment.”
Ernest Rutherford Ernest Rutherford,
Physicist,
Nobel prize winner

“To call in a statistician after the experiment


is done may be no more than asking him to
perform a postmortem examination; he may
Sir Ronald Fisher,
be able to say what the experiment died of.” Inventor of
Sir Ronald Fisher ANOVA

http://johngushue.typepad.com/photos/uncategorized/2008/02/03/benjamin_disraeli_portrait.jpg
http://www.nzhistory.net.nz/files/images/ernest-rutherford-image.jpg
http://www-history.mcs.st-and.ac.uk/Posters/217c.html Applied Biostatistics 2020 A. Parlesak 18
Biostatistics vs. Statistics

• The tools of statistics are employed in many fields –


e. g. economics, business, education, and
psychology.
• When the data being analyzed are derived from life
science, we use the term biostatistics (also
applying to social life) to distinguish this particular
application of statistical tools and concepts.
• In the current course, most of the statistical
procedures will refer to health issues and nutrition.
However, you will also learn how to extrapolate
your knowledge to other fields of science (e.g.
quantitative social science).

Applied Biostatistics 2020 A. Parlesak 19


Biostatistics (cont.)

“A habit of basing convictions upon evidence, and of


giving to them only that degree or certainty which
the evidence warrants, would, if it became general,
cure most of the ills from which the world suffers.”
Bertrand Russell - British philosopher, mathematician, social reformer

All biostatistics begins with description. Before you do


anything else, you look at the data and summarize the data.
Hence, our first goal:
- to show how to get a first look at the data and
- get ready to do more elaborate procedures.
A statistic is just a numerical summary of the data, like the
largest number in the data set or its average (mean).

http://www.famousquotes.com/show/1040765/
Applied Biostatistics 2020 A. Parlesak 20
The Framework We (and ALL Other
Scientists) Will Stay In: INFERENCE
Hypothesis formulation and data collection

In (bio)statistics, you can make a GENERAL


statement for the population ONLY if you infer
this statement from a sample being (more or
less) representative for the population. This
happens in 5 steps:
• From qualitative science (e.g. interviews
with focus groups) you form a hypothesis.
(e.g. you note that overweight persons
describing their daily life mention quite frequently “watching TV” => hypothesis:
“Prolonged TV watching is associated with overweight.”)
• Then you collect data from a SAMPLE to prove (or reject) this hypothesis.
• By appropriate data evaluation, the hypothesis is ACCEPTED or REJECTED.
• Due to the outcome of the statistical evaluation, you predict with a certain
PROBABILITY a general rule for the POPULATION.
Hence: without statistical evidence, NO generalized STATEMENT, NO general
RULE, NO general RECOMMENDATION.

http://www.famousquotes.com/show/1040765/
Applied Biostatistics 2020 A. Parlesak 21
Sample and Population

• Population: all items/persons


that have something in common
(e.g. disease, sex, education) Estimation Prediction

• Sample: collection of observations


taken from the study population
• Representative sample: part
of the population with characteristics
being as close as possible to the population.
• Random sample: each element of the
population has some chance of being selected to
• Simple random sample: the chance of each
element of the population of being selected
is the same
http://www.uwsp.edu/psych/stat/2/popsamp.gif/;
http://www.six-sigma-material.com/images/PopSamples.GIF Applied Biostatistics 2020 A. Parlesak 22
GROUP ACTIVITY 1:
Qualitative and Quantitative Science
In groups of 2-3,

1. formulate definitions of “Qualitative Science” and “Quantitative Science”


so that you can explain the main goals and the difference in an oral
exam;
2. State whether the hypothesis is part of qualitative or quantitative
science;
3. Explain why qualitative science is useless without quantitative science
and why quantitative science is useless without qualitative science;
4. Explicate whether you need statistics in qualitative science. If so, which
one?
5. Indicate how results from quantitative science foster new qualitative
science approaches.

http://www.famousquotes.com/show/1040765/
Applied Biostatistics 2020 A. Parlesak 23
Introductory Concepts –
Descriptive Techniques
Chapter 1:
• Types of Data
Sample, population, variable,
scales of measurement
• Descriptive Measures
Centrality, dispersion, skewness
Chapter 2:
• Probability and Distributions
• Estimation Techniques
Confidence intervals
http://bio.informatics.indiana.edu/VLDB07/images/smallgraphic2.png Applied Biostatistics 2020 A. Parlesak 24
Types of Data Dependencies

• If you measure the change of a dependent


variable (outcome) on a single
independent variable (exposure) only, you
apply univariate statistics.
• If you measure 2 exposures, you follow a
bivariate experimental design
• And if you have multiple exposures, it is
called a multivariate design

Applied Biostatistics 2020 A. Parlesak 25


Variables, Data Sets and Parameters
Age Sex Years of … Variable
[y] education m

• Variable: an attribute that varies from Subject 1 21 m 7.8 … x1j


Subject 2 34 f 8.8 … x2j
one element of the sample to the next Subject 3 18 f 7.3 … x3j
(e.g. weight of preschool children, Subject 4 37 m 9.8 … x4j
… … … … … …
sex of bosses, education of parents,
Subject n x
i1 x i2 x … i3 xij
income of teachers, etc…)
• Data set (=dataframe in R): each column is headed by the variable
name and contains in each cell the value of this variable measured in
that particular case. The position of the value in the dataset is crucial
for correct evaluation. The data set without its headers is a matrix.
• Parameter: descriptive measure of a variable based on probability
distributions, derived from a population, estimated from the sample.
For example, the sample mean (Χ) can be used as an estimate of the
mean parameter (μ) of the population from which the sample was
drawn.
Applied Biostatistics 2020 A. Parlesak 26
Principal Steps in a
Statistical Analysis Process

There are basically three steps in the statistical


analysis process, starting after you got your data.

1. Data description
(averages, standard deviation, distribution, etc.)
2. Data analysis
(significance of differences, correlations, etc.)
3. Data interpretation
(meaning of the differences/correlations found -
or not found – linking statistics to the hypothesis).

Applied Biostatistics 2020 A. Parlesak 27


Step I: Data Description

The goal of data description is to describe the


empirical frequency distribution of the variable.

The manner of the description will be dependent on


the class of the variable.

Applied Biostatistics 2020 A. Parlesak 28


Scales of Measurement

• Scales of measurement: different types of variables.


• The scale of measurement [of the investigated/interacting
variable(s)] defines the rules for further statistical proceeding.
• They are commonly broken down into four types:
– Nominal (categorial) N
– Ordinal (categorial) O
– Interval (numerical) I
– Ratio (numerical) R

Mnemonic

http://farm2.static.flickr.com/1298/534434590_581e774e8d.jpg
Applied Biostatistics 2020 A. Parlesak 29
Categorical Nominal Variables

• A nominal variable is a categorial scale and can be placed


in categories, which do NOT posses a natural hierarchy.
• It is characterized by its incapability to be ordered or
measured on a continuous scale.
Examples:
• Place (e.g. city, country)
• Disease type
(lung cancer, cervix cancer,
colon cancer, ...)
• Ethnicity (Hispanic, Caucasian, ...)
• Sex (male/female) (dichotomous)
• Smoker (yes/no) (dichotomous)
http://talks.blogs.com/phototalk/images/NewCategory.jpg
Applied Biostatistics 2020 A. Parlesak 30
Categorical Ordinal Variables
• An ordinal variable is a categorial scale and can be placed in
categories, which DO posses a natural hierarchy (inherent
order).
• It is characterized by its capability to be ordered but NOT to
be measured on a continuous scale.
Examples:
• Severity of disease
(healthy, stages 0, I, II, II, IV)
• Weight status
(underweight, normal,
overweight, obesity)
• Pain level
(none, mild, moderate,
severe, unbearable)
• Likert scale levels (strongly agree, agree, indifferent, disagree, strongly disagree)
http://beatcoloncancer.com/images/stages_of_cc.jpg
Applied Biostatistics 2020 A. Parlesak 31
Numerical Interval Variables
• An interval variable is a numerical scale and can be placed in
categories, which DO posses a natural hierarchy (inherent
order).
• It is characterized by its capability to be ordered; WITHIN the
intervals, the variable can be measured on a continuous scale.
• Note: Single intervals may miss and range for intervals might differ.
Examples:
• Age group
(0-0.99; 1-5.99; 12-18; …)
• Weight range
(14-18.49; 18.5-24.99;
25-29.99; 30-34.99; 35+)
• Yearly income level, U$
(<12 999, 13 000-20 000, …)

http://agritech.tnau.ac.in/nutrition/nutri_comondse_obesity_clip_image001.gif
Applied Biostatistics 2020 A. Parlesak 32
Numerical Ratio Variables (1)

• A ratio variable is a numerical scale (quantitative observation).


• It is characterized by its direct comparability and has a “zero”
value; the values can be added, subtracted, multiplied and
divided. It can take a limited (discrete) value (in R: integer) or
an infinite number of values between any two other values
(continuous).
Examples:
• Weight (12.3 kg, 14.6 kg, …)
• Distance (14.6 km, 23.5 km, …)
• Energy content of food (2043.5 kcal, …)
• Number of admissions (21, 136, 13 453, …)

http://www.janus-pyttel.de/assets/images/Skala_2.jpg
http://www.higa.bildung-rp.de/gymnasium/unterrichtszeiten/zeit500_500.jpg Applied Biostatistics 2020 A. Parlesak 33
Numerical Ratio Variables (2)
• There are ratio variables that do not meet the criterion of
having a “true zero” value (e.g. temperature in Celsius degree)
but which are treated in statistics as ratios anyway
• In literature, these values are also frequently called “interval
variables”
Examples:
• BMI
(12.3 kg, 14.6 kg, …)
• Time
(1.2 min, 23.5 h, …)
• Temperature (Kelvin)
[0 K (= -273.15 ° C), 298 K, …)

http://www.janus-pyttel.de/assets/images/Skala_2.jpg
http://www.higa.bildung-rp.de/gymnasium/unterrichtszeiten/zeit500_500.jpg Applied Biostatistics 2020 A. Parlesak 34
Levels of Measurement
• Higher level variables (ratio variables) can always be expressed as
lower level variables (other types) but this never works the other
way around
• Lowering the level of variables is always associated with
RELEVANT INFORMATION LOSS
• Therefore, in any study, you always should record data at the
highest possible level (ratios)
YES
Variable type Variable Unit
Ratio Weight, height kg, m
Interval BMI Kg/m2
Being underweight, normal
Ordinal none
weight, overweight or obese
Nominal Overweight or not none
NO
Applied Biostatistics 2020 A. Parlesak 35
Criteria of a Satisfactory Scale

A satisfactory scale meets the following requirements:


• Appropriate
• Practicable
• Powerful
• Clearly defined
• Sufficient number of categories
• Collectively exhaustive
• Mutually exclusive

http://www.medicalscale1.com/wp-content/uploads/2010/08/balance-scale.jpg Applied Biostatistics 2020 A. Parlesak 36


Appropriateness of a Satisfactory Scale

Keep in mind the conceptual definition of the variable and


the objectives of the study. Ask yourself: does the
variable represent the parameter necessary to answer the
research question?
Occupations, for instance, may be classified in different
ways, depending on whether the purpose is to use it as
• measure of social class,
• habitual physical activity,
• exposure to specific physical and chemical hazards.

Applied Biostatistics 2020 A. Parlesak 37


Practicability of a Satisfactory Scale

The practicability of a satisfactory scale is linked to the methods


that will be used in collecting the data.
• E.g. in an food frequency questionnaire, to ask for an estimation
of calorie intake on the basis of kJ might be inacceptable
because the study participants are not familiar with this unit.
Solution: provide pictures with standardized picture sizes:

• Usually, a high precision is linked to a high effort. Always ask


yourself: is it worth it?
(e.g. Detailed data on income, or broad income categories?
Weighing in tenth of kilograms, or whole kilograms?)
Applied Biostatistics 2020 A. Parlesak 38
Powerfulness of a Satisfactory Scale

The powerfulness of a satisfactory scale is linked to the


type of scale:
If there is a choice,
• an ordinal scale should be used rather than a nominal one
• and numerical scale rather than an ordinal one.

E.g., an analysis using the whole


spectrum of weight is more
informative than one using a
only two variables, such as
“below 70 kg” and
“more than 70 kg”.

http://3.bp.blogspot.com/_2XK0_P3eLgw/SxOKT8tmxqI/
AAAAAAAAAD0/mWWHHrMMe8o/s1600/weight-loss.jpg Applied Biostatistics 2020 A. Parlesak 39
Clarity of Definition of a Satisfactory Scale

Operational definitions are obligatory for


variables and categories.
• Nominal and ordinal scales
E.g. cases of a malnutrition: “present” or “absent”:
depends on the applied definition!
• Numerical measurements
- Number of decimal places to be used: realistic,
e.g. body weight to ± 0.1 kg, NOT to ± 0.1 g
- Your calculated value has ONLY the precision of the least
precise measurement used, e.g. BM: 67 kg, height:
178.8 cm => BMI= 21 and NOT 20.96 or 21.0
- Rule: calculate with one digit more than your
measurement allows: weight in whole kg, but indicate
final result with the least available precision
- Values to be rounded: off-downwards, upwards, or to the
nearest number (most preferable)
Applied Biostatistics 2020 A. Parlesak 40
Recap Exercise: Clarity of Definition of a
Satisfactory Scale

If the concept of PRECISION INDICATION of RESULTS


does not seem familiar to you, please go through the
exercise sheets “Precision and significant digits 1” and
“Precision and significant digits 2” in the shared folders on
Intrapol.

In the final exam, points will be deduced for every result


that has been indicated with the wrong number of digits!

Applied Biostatistics 2020 A. Parlesak 41


Sufficiency of Categories of a Satisfactory Scale

Avoid compression of data into too few categories – this


may lead to a loss of useful information!
E.G., if immigrants from North Africa have a particularly
high rate of mortality from a disease, this fact may be
completely masked if they are included in a broader
category (immigrants).
Hence: Collect data in a detailed form and decide later
whether to use the full scale or a “collapsed” one/both.

Applied Biostatistics 2020 A. Parlesak 42


Mutual Exclusiveness of a Satisfactory Scale

Provide sufficient classification to place every subject.


This may necessitate the inclusion of one or more of the
following categories: other or not applicable,
E.g. duration of marriage (also for case “not married”):

“under 5 years”,
“5-9.9 years”,
“10-19.9 years”,
“20-29.9 years”,
“30-39.9 years”,
“40 years and more”
“not applicable”.

http://manasij.files.wordpress.com/2010/06/not-married.jpg Applied Biostatistics 2020 A. Parlesak 43


Collective Exhaustiveness of a Satisfactory Scale

• Each item of information fits in only one place along the


scale. E.g., an age scale including both “70 to 80” and “80
to 90” is unacceptable, as “80” fits onto either of these
categories.

• A scale for measuring the conditions producing disability


includes the categories “blindness” and “deafness”
“not blind, not deaf”,
“blind, not deaf”,
“deaf, not blind”
“blind and deaf”.

http://www.bibliotecapleyades.net/imagenes_sociopol/globalbanking12_02.jpg Applied Biostatistics 2020 A. Parlesak 44


Description of Qualitative Variables

The empirical frequency distribution of a nominal or


an ordinal variable is usually summarized as list of:

• Frequencies (counts)

• Proportions

• Percentages

These numbers are also used in description of incidence


and prevalence (epidemiology).

Applied Biostatistics 2020 A. Parlesak 45


Descriptive Statistics (DS)

• A way to summarize data


from one or more samples
or populations.
• Reducing data of the sample to a
small number of summary measures is called statistics.
• DS illustrate the central tendency, variability, and shape of one
or more sets of data.
• DS should be clear and easily interpreted. It should not
mislead you about the data they are summarizing and should
retain sufficient information to allow characterization .

http://4.bp.blogspot.com/_FBWyXkXeMkc/TUATvpa0OaI/AAAAAAAAAW8/ljIcnYl8nd8/s1600/
statisticscomputer.gif Applied Biostatistics 2020 A. Parlesak 46
Summations
5 n

x i =1
i = x1 + x2 + x3 + x4 + x5 x
i =1
i = x1 + x2 + x3 + ... + xn

3
1 + 50 = 51
2 + 49 = 51
x i =1
2
i = x12 + x22 + x32
3 + 48 = 51
… … …
25 + 26 = 51

n n = 25 * 51 = 1275
 cx
i =1
i = cx1 + cx2 + cx3 + ... + cxn = c  xi
i =1
n
n(n + 1)
Hence: 
i =1
i=
2

Applied Biostatistics 2020 A. Parlesak 47


Descriptive Measures

After that we have recorded our data, we want to be


able to characterize it quantitatively:

Measures of Central Tendency (Measures of


Location) give you a POINT estimate
(Arithmetic) Mean, Median, Mode
Measures of Variability give you a RANGE estimate
Range, Variance, Standard Deviation
Measures of Relative Standing
Z-Scores, Percentiles, Quartiles

Applied Biostatistics 2020 A. Parlesak 48


Vectors and Variables

In statistics, a vector is a one-dimensional array:

(18, 32, 56, 23, 32, 27, 40, 38, 23)

Usually, such a vector represents a variable (e.g. age,


body mass, martial status, sex, etc.)

The vector has a name (variable name) and


occasionally a unit (if ordinal/continuous, e.g. kg,
years, m, etc.)

Applied Biostatistics 2020 A. Parlesak 49


(Arithmetic) Mean – The ”Average”
• Suppose we have N measurements (x1, x2, x3,…,xN) of a
particular parameter in a sample. We denote these N
measurements as a vector and can calculate the average
(mean): n

( x1 + x 2 + x3 + ....x )
n
 x i
 X
• Definition x= i =1
= =
n n N
Arithmetic mean:
• More accurately called the arithmetic mean, it is defined
as the sum of measures observed divided by the number
of observations.
• We can use the arithmetic mean of the empirical
frequency distribution to estimate the mean value of the
variable in the study population.
• [R: mean(vector)]
Applied Biostatistics 2020 A. Parlesak 50
[R] Exercise: Data Type

• Open “R_Studio”

• In the Console, type


“typeof(3.4)” ENTER

“typeof(TRUE)” ENTER

“typeof(“Sky”)” ENTER

“typeof(“Bambi”) ENTER
“typeof(cars)” ENTER

“list(cars)”

Applied Biostatistics 2020 A. Parlesak 51


[R] Exercise: Calculation of the
Arithmetic Mean (Average)
• Open “R_Studio”
• Type “age<-c(18,32,56,23,32,27,40,38,23)” and ENTER
(‘c’ stands for concatenate and includes the single values
into one vector; ‘<-’ assigns the aforementioned name
to the subsequent vector)

• Type “age” and ENTER

• Type mean(age) and ENTER


- what is the correct value (precision) for the mean of
‘age’?
• Try the same with the vectors
weight<-c(60,72,57,90,95,72) and
height<-c(1.75,1.80,1.65,1.90,1.74,1.91)

Applied Biostatistics 2020 A. Parlesak 52


[R] Exercise: Mathematical Operations
with Values and Vectors
• In [R], you can perform any mathematical operation
such as
> 2+5 > 3-9
> 8*7 > 372/56
> log(1000,10) (Note: with log, first comes the argument
and then the basis. Omitting the basis implicates
automatically Euler’s number (2.71828) as the basis.)
• Each mathematical operation is completed with ENTER .
The return is the result.
• Accordingly, you can perform mathematical operations
with vectors (if they have same length). Try
> BMI<-weight/height^2 ENTER . How do you interpret
the resulting vector?

Applied Biostatistics 2020 A. Parlesak 53


The Arithmetic Mean – Sensitive to Outliers

1 n
Sample: µˆ = x =  xi is an estimate of µ (population)
n i =1

The (arithmetic) mean is extremely sensitive to outliers:


E.g. let n=3; x1=6, x2=7, x3=5

1 3 1 1
x =  xi = (6 + 7 + 5) = 18 = 6
3 i =1 3 3
E.g. let n=4; x1=1, x2=6 x3=2, x4=91

1 4 1
x =  xi = (1 + 6 + 2 + 91) = 25
4 i =1 4
[R: sum(vector); length(vector); mean(vector)]
Applied Biostatistics 2020 A. Parlesak 54
Median –The “Middle Value” (1)
Frequently used if there are single extreme values in a distribution
Definition
Value that divides the ‘ordered array’ into two equal parts
If an odd number of observations, the median will be the
(n+1)/2th observation
E.g., the median of 11 observations is the 6th observation
In the case of an even number of observations, the median is the
midpoint between the two central values.
E.g., the median of 12 observations is the midpoint (average)
between the 6th and 7th observation
[R: median(vector)]

Applied Biostatistics 2020 A. Parlesak 55


Median –The “Middle Value” (2)

• Suppose there are n observations


• Order data from smallest and
11 values:
largest observation The median is
the 6th value
• The sample median is the
observation in the center of the
ordered observations:
1. the observation in the center or
(n+1)/2th largest for n odd
2. The average of the two most
central observation of (n/2)th 12 values:
and (n/2 + 1)th largest for n The median is
the average of
even the 6th and
the 7th value

Applied Biostatistics 2020 A. Parlesak 56


Median –The “Middle Value”: Examples

Example 1: odd number of observations:


Data: 3, 8, 15, 1, 9, 4, 6, 20, 14 n=9
Ordered: 1, 3, 4, 6, 8, 9, 14, 15, 20
median is (n+1)/2th largest value, (n+1)/2 = 5
Md (median) = 8

Example 2: even number of observations:.


Data: 3, 8, 15, 1, 9, 4, 6, 20, 14, 10 n=10
Ordered: 1, 3, 4, 6, 8, 9, 10, 14, 15, 20
median is avg (n/2)th and (n/2+1)th largest values,
n/2 = 5, n/2+1 = 6
Md = (8+9)/2 = 8.5

Applied Biostatistics 2020 A. Parlesak 57


[R] Exercise: Calculation of the
Median

• Type
> median(age), median(weight), median(height), and
median(BMI) and compare these values with the mean
values of the corresponding vectors.

• How do you explain the difference between the mean and


the median?

• Due to your gut feeling – do the mean or the median


represent the central value of the vectors better?

Applied Biostatistics 2020 A. Parlesak 58


Choosing Between Mean and Median
Robustness: The robustness of a statistic is
related to the statistic’s resistance to being affected
by extreme values. The arithmetic mean is a non- 180
robust statistic while the median is a robust 160
statistic. If the empirical distribution is skewed or 140
extreme values are present, the median will 120
provide a better measure of central location than 100
the arithmetic mean. 80
Strength of median: insensitive to outliers 60
Arithmetic
mean:
Weakness of median: completely guided by central 40
Median
values. 20
0
0.9 1 1.1
Summarizing Capability: The arithmetic mean is
a more appropriate statistic if the data can be
described by a particular mathematical model such
as the normal (Gaussian) distribution.
Applied Biostatistics 2020 A. Parlesak 59
Mode – The “Most Frequent Observation(s)”
Definition
Value that occurs most Mode
frequently in data set

Example:
Question to pupils in a class:
“How frequently are you
visiting a doctor each year?”

Mode (Mo): “4”


Bimodal Trimodal

• If all values different, no mode

0.20
0.12
• May be more than one mode

0.10

0.15
0.08
(bimodal, multimodal)

0.10
0.06
• Mode is poor measure of central tendency

0.04

0.05
0.02
• Not used frequently in practice
0.0

0.0
-10 -5 0 5 10 -10 -5 0 5 10

[R: names(sort(-table(vector)))[1]]
Applied Biostatistics 2020 A. Parlesak 60
[R] Exercise: Calculation of the Mode

• A physician wants to know what the most frequent yearly


number of customer visits is. For this purpose, she
recorded the number of visits for her 19 customers:
> docfreq<-c(2,3,4,3,4,3,5,3,4,2,3,5,4,3,7,6,3,4,5)
> names(sort(-table(docfreq)))[1]

• How do you interpret the result?

• Type > mode(docfreq) and interpret the result.

Applied Biostatistics 2020 A. Parlesak 61


Statistics Derived from Sample: Percentiles (1)

• The Pth percentile of a sample of n observations


is the value of the variable that has ordered rank
(P/100)(1+n). As and example for P=20, and n=100,
(20/100)(1+100) = 20.2. So the 20th smallest value
of the variable is the value at the 20th percentile.

• The 50th percentile of a sample of n observations


is referred to as the sample median.

• We can use the median of the empirical frequency


distribution to estimate the median value of the
variable in the study population.
[R: quantile(vector,percentile_value)]
percentile_value: 0<value≤1 - e.g. 0.25 for upper
limit of lowest quartile Applied Biostatistics 2020 A. Parlesak 62
Statistics Derived from Sample: Percentiles (2)

• The 25th percentile of a sample of n observations is referred


to as the lower quartile, and the 75th percentile of a
sample of n observations is referred to as the upper
quartile.
- Example Scores of 20, 30, 50, 60, 67, 67, 70, 80, 90, 95
- 1st Quartiles = 50, 3rd Quartile = 80

• The difference between the upper quartile value and the


lower quartile value is referred to as the interquartile
range of the empirical frequency distribution.

• We can use the interquartile range (IQR) of the


empirical frequency distribution to estimate the interquartile
range of the distribution of the values of the variable in the
study population.

Applied Biostatistics 2020 A. Parlesak 63


Visualizing Data from 1 Sample/Population
- Bar Charts and Box-and-Whiskers-Plots -
Data set: 22 values, mean: 9.5, SD: 1.77, SEM: 0.377
6
7
7
8
8
8
9
9
9
9
9
10
10
10
10
10
11 Mean ± SD Mean ± SEM
11
11
12
12
Box-and-whiskers-plot,
13 non-parametric indicators
http://www.physics.csbsju.edu/stats/simple.box.defs.gif Applied Biostatistics 2020 A. Parlesak 64
Measures of central tendency (cont.)

• Each of the three methods of measuring


central tendency has certain advantages and
disadvantages
• Which method should be used?
• It depends on the
type of data that is
being analyzed,
mainly on their
DISTRIBUTION and
SKEWNESS

http://herdingcats.typepad.com/photos/uncategorized/statistics.jpg Applied Biostatistics 2020 A. Parlesak 65


Measurement Scales and Indicators
of Centrality

Measurement Permissible mathematic Best measure of


scale operations central tendency

Nominal Counting Mode

Greater or less than


Ordinal Median
operations

(Symmetrical – Mean)
Interval Addition and subtraction
Skewed – Median

Addition, subtraction, Symmetrical – Mean


Ratio
multiplication and division Skewed – Median

Applied Biostatistics 2020 A. Parlesak 66


Measure of Distribution: Dispersion

• RANGE
What is the highest, what is the lowest value?
• STANDARD DEVIATION
How closely do values cluster around the mean value?
• SKEWNESS
How symmetrical is the curve of distribution?

Applied Biostatistics 2020 A. Parlesak 67


Descriptive Measures
After that we have recorded our data, we want to be able to
characterize it quantitatively:

Measures of Central Tendency (=Measures of Location)


(Arithmetic) Mean, Median, Mode
Measures of Variability (Dispersion)
Range, Variance, Standard Deviation
Measures of Relative Standing
Z-Scores, Percentiles, Quartiles

Applied Biostatistics 2020 A. Parlesak 68


Measure of Distribution: Range

Range is the difference between the largest and


smallest values in the data set
R=Max(Xi)-Min(Xi)
[R: max(vector)-min(vector);range(vector)]
Set 1: 100, 30, 20, 7, –20, –30, –100
Set 2: 10, 3, 2, 7, -2, -3, -10

R1=200; R2=20

Heavily influenced by two most extreme values and


ignores the rest of the distribution.

Applied Biostatistics 2020 A. Parlesak 69


Measure of Dispersion: Example

Look at these two data sets:


Set 1: 100, 30, 20, 7, –20, –30, –100
Set 2: 10, 3, 2, 7, -2, -3, -10

If we calculate mean:
Set 1. n = 7, x = 1
Set 2. n = 7, x = 1

How to measure dispersion (spread, variability)?

Applied Biostatistics 2020 A. Parlesak 70


Variance and Standard Deviation (Population)
Suppose we have N measurements of a particular
variable in a population: X1, X2, X3,…,XN,

The mean is µ , as  (X i − µ) = 0 , we define:

1 1 1
σ = ( X 1 − µ ) + ( X 2 − µ ) + ... + ( X N − µ ) 2 =
2 2 2  i
( X − µ ) 2

N N N N

as variance, unit is X unit2

σ=  i
2
( X − µ )
N
as standard deviation,
unit is X unit.
Applied Biostatistics 2020 A. Parlesak 71
Variance and Standard Deviation (Sample)

Suppose we have n measurements of a particular


variable in a sample: X1, X2, X3,…,Xn,
The mean is x , we define:

s2 =
 i
( X − x ) 2

n −1
as sample variance (s2) and

s=  (X i − x ) 2

n −1
as standard deviation (s) µ ±σ
(of the sample: note “n-1” x±s
in the denominator!)
[R: var(vector), sd(vector)]
http://www.six-sigma-material.com/images/PopSamples.GIF Applied Biostatistics 2020 A. Parlesak 72
Standard Error of the Mean (Sample)

• The standard error of the mean (SEM)


is the standard deviation (SD), divided
by the square root of the number of
observed values:
SD
SEm =
n
• Hence, SEM rather than the SD represents the estimated
standard deviation of the population than of the sample. In
nearly all cases, when sample size increases, the SD decreases
with small n.

• [R: sd(vector)/sqrt(length(vector))]
http://www.six-sigma-material.com/images/PopSamples.GIF Applied Biostatistics 2020 A. Parlesak 73
When to Use SEM, When SD?

• The standard deviation (SD) is used when describing:


quantifying the variation around the mean of a sample.
Std deviation is an important statistic when determining if two
samples likely originated from the same underlying population.
• Central limit theorem; “sample means are normally distributed”
• The standard error (of the mean) (SEM or SE) is used when
estimating the mean of the underlying population (from which
the sample originated).
SEM is the important statistic for use in calculating the
confidence of your sample statistic (sample mean), and it is
determined by both SD of the sample & the sample size.

http://www.six-sigma-material.com/images/PopSamples.GIF Applied Biostatistics 2020 A. Parlesak 74


Coefficient of Variation

The Coefficient of Variation (COV or CV, “unit”: %) can be


understood as the relative standard deviation:

SD
Definition of CV: CV = ×100
x

Useful in comparing variation between two distributions

Used particularly in comparing laboratory measures to


identify those methods that give you a lower variation
(higher precision)

[R: sd(vector)/mean(vector)*100]

Applied Biostatistics 2020 A. Parlesak 75


Example: Mean, Variance, Standard Deviation,
and Coefficient of Variation

Set 1: 100, 30, 20, 7, –20, –30, –100


Set 2: 10, 3, 2, 7, -2, -3, -10

Calculate x , s2, s and CV (e.g. in “R”):

Set x s2 s CV[%]
1 1 3773.7 61.4 6140.0
2 1 44.7 6.7 670

Applied Biostatistics 2020 A. Parlesak 76


Which measure to use?

•Range. It’s not often used because it’s very sensitive to


outliers.
•Interquartile range. It’s pretty robust to outliers. It’s used a
lot in combination with the median, i.e. when using non-
parametric tests.
•Variance. It’s completely uninterpretable because it doesn’t
use the same units as the data. It’s almost never used except as
a mathematical tool
•Standard deviation. This is the square root of the variance.
It’s expressed in the same units as the data. The standard
deviation is often used in the situation where the mean is the
measure of central tendency, along with parametric tests.

Applied Biostatistics 2020 A. Parlesak 77


Converting to Standard Normal
Frequently in science, you have to compare distributions that change
over time.
Example: BMI of children. Problem: compare BMIs of children from
different classes (age groups): how? =>
Convert N(µ,σ2) to a standard normal (Z-Score):
X −µ
Z=
obese
σ
overweight

normal

underweight
Extremely
underweight

Child 1: 5 years, BMI: 18.8, mean (5ys): 15.45 kg/m2; s=2.05 => Z= +1.64

Child 2: 16 years, BMI: 19.0, mean (5ys): 20.48 kg/m2; s=1.25 => Z= -1.63

[R: what
So: (vector-mean(vector))/sd(vector)]
is the call for z-scores in R? Applied Biostatistics 2020 A. Parlesak 78
Recap: Centrality and Dispersion

Population Sample

µ=  X i
x=
 X i

Mean N n

Variance σ2 =  i
( X − µ ) 2

s2 =
 i
( X − x ) 2

N n −1

SD σ=  i
( X − µ ) 2

s=
 i
( X − x ) 2

N n −1
s
CV CV = ×100
x
s
SEM SEm =
n
Applied Biostatistics 2020 A. Parlesak 79
Tools in [R] that Make Your Life Easier (1)

• mean(V). Reports the average value (=arithmetic mean) of


the vector’s elements
• median(V). Reports the median value of the vector’s
elements
• var(V). Reports the median value of the vector’s elements
• sd(V). Reports the standard deviation of the vector’s
elements
• quantile(v). Reports the minimum, the maximum and the
0.25, 0.50 and 0.75 quantiles (=quartiles).
• IQR(V). Reports the interquartile range of the vector
(difference between the 0.75 and the 0.25 quantile)

Applied Biostatistics 2020 A. Parlesak 80


Tools in [R] that Make Your Life Easier (2)

• summary(DF). Summary of a data frame - the function


summary(DF) is automatically applied to each column. The
format of the result depends on the type of the data
contained in the column. For example:
• If the column is a numeric variable, mean, median, min, max
and quartiles are returned.
• If the column is a factor variable, the number of
observations in each group is returned.
• You can use the summary() function also to single vectors.

Applied Biostatistics 2020 A. Parlesak 81


Tools in [R] that Make Your Life Easier (3)

• sapply() function - used to repeatedly apply a particular


function over a list or data frame. You can use it to compute
for each vector (column) in a data frame, the mean, sd, var,
min, quantile, …

• # Compute the mean of each column


• sapply(my_data, mean)

Applied Biostatistics 2020 A. Parlesak 82


Tools in [R] that Make Your Life Easier (4)
stat.desc(DF) The function stat.desc() [in pastecs package],
provides other useful statistics including:
• the median
• the mean
• the standard error on the mean (SE.mean)
• the confidence interval of the mean (CI.mean) at the p level (default is 0.95)
• the variance (var)
• the standard deviation (std.dev)
• and the variation coefficient (coef.var) defined as the standard deviation
divided by the mean

Install pastecs package: install.packages("pastecs")


Use the function stat.desc() to compute descriptive statistics
# Compute descriptive statistics
library(pastecs)
res <- stat.desc(my_data)
round(res, 2)
Applied Biostatistics 2020 A. Parlesak 83
Visualization of Centrality, Variation, and
Skewness of Data: Histograms

• Histograms are used to visually depict


frequency distributions of continuous data.
• A histogram is a type of bar chart without
spaces between the bars
• By each column, the number of observation
within the category is given.
• The variables on the X-axis (abscissa) are
either “categorical ordinal” (rarely) or
“numerical interval” (in most cases), without
gaps.
Example: age groups of pupils in a class

Applied Biostatistics 2020 A. Parlesak 84


Visualizing Data from 1 Sample/Population
- Histograms -
Histogram of 100 randomly generated values, µ=0, σ=0.0316:

X-axis (abscissa):
Categorical ordinal variables or numerical interval variables
http://grants.hhp.coe.uh.edu/doconnor/PEP6305/Topic%20005%20Normal%20Distributio
n_files/SEM%20histogram.jpg Applied Biostatistics 2020 A. Parlesak 85
Did I get it?
Yes, if you can …
o Define the terms population, sample, random sample, variable,
parameter, data set
o List the 3 principal levels of statistical data evaluation
o Explain the 4 types of scales that can be used in statistics
o Define the criteria for a satisfactory scale in an investigation
o Work with the summation symbol
o Define the terms centrality, dispersion, and skewness on a
statistical basis
o Explain the terms mean, median, mode and if you know how to
calculate them
o Explicate the terms range, standard deviation, standard error of
the mean, coefficient of variation, standard normal, and
histogram

Applied Biostatistics 2020 A. Parlesak 86

You might also like