Applied Biostatistics 2020 - 01 Basics, Centrality and Dispersion
Applied Biostatistics 2020 - 01 Basics, Centrality and Dispersion
University College
Copenhagen
72 h Face-to-face learning;
84 h Directed learning;
84 h Autonomous learning
10 ECTS Points
Applied Biostatistics 2020 A. Parlesak 1
Global Nutrition & Health
Literature:
Peter Dalgaard:
Introductory Statistics with R
Second Edition 2008
Urdan, T.C.:
Statistics in Plain English
3rd ed. Routledge, New York
http://rlv.zcache.com/retired_statistics_teacher_hat-p148022764156727328tru5_152.jpg
Applied Biostatistics 2020 A. Parlesak 4
Why Should You be Here?
• This test usually gives you key indicators such as the effect size and the
probability whether the finding occurred only by chance – or not.
http://www.usrbingeek.com/blog/competence.gif
Applied Biostatistics 2020 A. Parlesak 8
Competencies You Will Have After the Course (2)
http://www.usrbingeek.com/blog/competence.gif
Applied Biostatistics 2020 A. Parlesak 9
Questions About This Class
(You might want to ask – or not)
http://www.doblu.com/wp-content/uploads/2010/08/spieslikeus10124.jpg
Applied Biostatistics 2020 A. Parlesak 10
Class Preparation
http://lsatpreparationclasses.com/images/LSAT_Prep_Class.JPG
Applied Biostatistics 2020 A. Parlesak 11
Is Biostatistics Hard to Study?
http://afteramerica.files.wordpress.com/2010/01/despair1237852510.jpg
Applied Biostatistics 2020 A. Parlesak 12
Is Biostatistics Hard to Study? NO!
http://www.famousquotes.com/show/1040765/
Applied Biostatistics 2020 A. Parlesak 16
What Statistics Can and Can’t Do
Can Can’t
• Deliver a basis for hypothesis • Tell the truth
generation (explorative research; (probabilistic conclusions only!)
help to detect patterns in messy • Compensate for poor design
data) (Sir Ronald Fisher)
• Provide objective criteria for • Indicate biological/social significance:
evaluating hypotheses and statistical significance does not mean
argument raising biological/social significance, nor vice
• Condense cluttered information (not versa!
without information loss, so keep
your raw data!)
• Translate data into statements
(e.g. “Increased intake of folic acid significantly
reduces incidence of spina bifida.”)
• Help you critically evaluate
arguments of others
http://johngushue.typepad.com/photos/uncategorized/2008/02/03/benjamin_disraeli_portrait.jpg
http://www.nzhistory.net.nz/files/images/ernest-rutherford-image.jpg
http://www-history.mcs.st-and.ac.uk/Posters/217c.html Applied Biostatistics 2020 A. Parlesak 18
Biostatistics vs. Statistics
http://www.famousquotes.com/show/1040765/
Applied Biostatistics 2020 A. Parlesak 20
The Framework We (and ALL Other
Scientists) Will Stay In: INFERENCE
Hypothesis formulation and data collection
http://www.famousquotes.com/show/1040765/
Applied Biostatistics 2020 A. Parlesak 21
Sample and Population
http://www.famousquotes.com/show/1040765/
Applied Biostatistics 2020 A. Parlesak 23
Introductory Concepts –
Descriptive Techniques
Chapter 1:
• Types of Data
Sample, population, variable,
scales of measurement
• Descriptive Measures
Centrality, dispersion, skewness
Chapter 2:
• Probability and Distributions
• Estimation Techniques
Confidence intervals
http://bio.informatics.indiana.edu/VLDB07/images/smallgraphic2.png Applied Biostatistics 2020 A. Parlesak 24
Types of Data Dependencies
1. Data description
(averages, standard deviation, distribution, etc.)
2. Data analysis
(significance of differences, correlations, etc.)
3. Data interpretation
(meaning of the differences/correlations found -
or not found – linking statistics to the hypothesis).
Mnemonic
http://farm2.static.flickr.com/1298/534434590_581e774e8d.jpg
Applied Biostatistics 2020 A. Parlesak 29
Categorical Nominal Variables
http://agritech.tnau.ac.in/nutrition/nutri_comondse_obesity_clip_image001.gif
Applied Biostatistics 2020 A. Parlesak 32
Numerical Ratio Variables (1)
http://www.janus-pyttel.de/assets/images/Skala_2.jpg
http://www.higa.bildung-rp.de/gymnasium/unterrichtszeiten/zeit500_500.jpg Applied Biostatistics 2020 A. Parlesak 33
Numerical Ratio Variables (2)
• There are ratio variables that do not meet the criterion of
having a “true zero” value (e.g. temperature in Celsius degree)
but which are treated in statistics as ratios anyway
• In literature, these values are also frequently called “interval
variables”
Examples:
• BMI
(12.3 kg, 14.6 kg, …)
• Time
(1.2 min, 23.5 h, …)
• Temperature (Kelvin)
[0 K (= -273.15 ° C), 298 K, …)
http://www.janus-pyttel.de/assets/images/Skala_2.jpg
http://www.higa.bildung-rp.de/gymnasium/unterrichtszeiten/zeit500_500.jpg Applied Biostatistics 2020 A. Parlesak 34
Levels of Measurement
• Higher level variables (ratio variables) can always be expressed as
lower level variables (other types) but this never works the other
way around
• Lowering the level of variables is always associated with
RELEVANT INFORMATION LOSS
• Therefore, in any study, you always should record data at the
highest possible level (ratios)
YES
Variable type Variable Unit
Ratio Weight, height kg, m
Interval BMI Kg/m2
Being underweight, normal
Ordinal none
weight, overweight or obese
Nominal Overweight or not none
NO
Applied Biostatistics 2020 A. Parlesak 35
Criteria of a Satisfactory Scale
http://3.bp.blogspot.com/_2XK0_P3eLgw/SxOKT8tmxqI/
AAAAAAAAAD0/mWWHHrMMe8o/s1600/weight-loss.jpg Applied Biostatistics 2020 A. Parlesak 39
Clarity of Definition of a Satisfactory Scale
“under 5 years”,
“5-9.9 years”,
“10-19.9 years”,
“20-29.9 years”,
“30-39.9 years”,
“40 years and more”
“not applicable”.
• Frequencies (counts)
• Proportions
• Percentages
http://4.bp.blogspot.com/_FBWyXkXeMkc/TUATvpa0OaI/AAAAAAAAAW8/ljIcnYl8nd8/s1600/
statisticscomputer.gif Applied Biostatistics 2020 A. Parlesak 46
Summations
5 n
x i =1
i = x1 + x2 + x3 + x4 + x5 x
i =1
i = x1 + x2 + x3 + ... + xn
3
1 + 50 = 51
2 + 49 = 51
x i =1
2
i = x12 + x22 + x32
3 + 48 = 51
… … …
25 + 26 = 51
n n = 25 * 51 = 1275
cx
i =1
i = cx1 + cx2 + cx3 + ... + cxn = c xi
i =1
n
n(n + 1)
Hence:
i =1
i=
2
( x1 + x 2 + x3 + ....x )
n
x i
X
• Definition x= i =1
= =
n n N
Arithmetic mean:
• More accurately called the arithmetic mean, it is defined
as the sum of measures observed divided by the number
of observations.
• We can use the arithmetic mean of the empirical
frequency distribution to estimate the mean value of the
variable in the study population.
• [R: mean(vector)]
Applied Biostatistics 2020 A. Parlesak 50
[R] Exercise: Data Type
• Open “R_Studio”
“typeof(TRUE)” ENTER
“typeof(“Sky”)” ENTER
“typeof(“Bambi”) ENTER
“typeof(cars)” ENTER
“list(cars)”
1 n
Sample: µˆ = x = xi is an estimate of µ (population)
n i =1
1 3 1 1
x = xi = (6 + 7 + 5) = 18 = 6
3 i =1 3 3
E.g. let n=4; x1=1, x2=6 x3=2, x4=91
1 4 1
x = xi = (1 + 6 + 2 + 91) = 25
4 i =1 4
[R: sum(vector); length(vector); mean(vector)]
Applied Biostatistics 2020 A. Parlesak 54
Median –The “Middle Value” (1)
Frequently used if there are single extreme values in a distribution
Definition
Value that divides the ‘ordered array’ into two equal parts
If an odd number of observations, the median will be the
(n+1)/2th observation
E.g., the median of 11 observations is the 6th observation
In the case of an even number of observations, the median is the
midpoint between the two central values.
E.g., the median of 12 observations is the midpoint (average)
between the 6th and 7th observation
[R: median(vector)]
• Type
> median(age), median(weight), median(height), and
median(BMI) and compare these values with the mean
values of the corresponding vectors.
Example:
Question to pupils in a class:
“How frequently are you
visiting a doctor each year?”
0.20
0.12
• May be more than one mode
0.10
0.15
0.08
(bimodal, multimodal)
0.10
0.06
• Mode is poor measure of central tendency
0.04
0.05
0.02
• Not used frequently in practice
0.0
0.0
-10 -5 0 5 10 -10 -5 0 5 10
[R: names(sort(-table(vector)))[1]]
Applied Biostatistics 2020 A. Parlesak 60
[R] Exercise: Calculation of the Mode
(Symmetrical – Mean)
Interval Addition and subtraction
Skewed – Median
• RANGE
What is the highest, what is the lowest value?
• STANDARD DEVIATION
How closely do values cluster around the mean value?
• SKEWNESS
How symmetrical is the curve of distribution?
R1=200; R2=20
If we calculate mean:
Set 1. n = 7, x = 1
Set 2. n = 7, x = 1
1 1 1
σ = ( X 1 − µ ) + ( X 2 − µ ) + ... + ( X N − µ ) 2 =
2 2 2 i
( X − µ ) 2
N N N N
σ= i
2
( X − µ )
N
as standard deviation,
unit is X unit.
Applied Biostatistics 2020 A. Parlesak 71
Variance and Standard Deviation (Sample)
s2 =
i
( X − x ) 2
n −1
as sample variance (s2) and
s= (X i − x ) 2
n −1
as standard deviation (s) µ ±σ
(of the sample: note “n-1” x±s
in the denominator!)
[R: var(vector), sd(vector)]
http://www.six-sigma-material.com/images/PopSamples.GIF Applied Biostatistics 2020 A. Parlesak 72
Standard Error of the Mean (Sample)
• [R: sd(vector)/sqrt(length(vector))]
http://www.six-sigma-material.com/images/PopSamples.GIF Applied Biostatistics 2020 A. Parlesak 73
When to Use SEM, When SD?
SD
Definition of CV: CV = ×100
x
[R: sd(vector)/mean(vector)*100]
Set x s2 s CV[%]
1 1 3773.7 61.4 6140.0
2 1 44.7 6.7 670
normal
underweight
Extremely
underweight
Child 1: 5 years, BMI: 18.8, mean (5ys): 15.45 kg/m2; s=2.05 => Z= +1.64
Child 2: 16 years, BMI: 19.0, mean (5ys): 20.48 kg/m2; s=1.25 => Z= -1.63
[R: what
So: (vector-mean(vector))/sd(vector)]
is the call for z-scores in R? Applied Biostatistics 2020 A. Parlesak 78
Recap: Centrality and Dispersion
Population Sample
µ= X i
x=
X i
Mean N n
Variance σ2 = i
( X − µ ) 2
s2 =
i
( X − x ) 2
N n −1
SD σ= i
( X − µ ) 2
s=
i
( X − x ) 2
N n −1
s
CV CV = ×100
x
s
SEM SEm =
n
Applied Biostatistics 2020 A. Parlesak 79
Tools in [R] that Make Your Life Easier (1)
X-axis (abscissa):
Categorical ordinal variables or numerical interval variables
http://grants.hhp.coe.uh.edu/doconnor/PEP6305/Topic%20005%20Normal%20Distributio
n_files/SEM%20histogram.jpg Applied Biostatistics 2020 A. Parlesak 85
Did I get it?
Yes, if you can …
o Define the terms population, sample, random sample, variable,
parameter, data set
o List the 3 principal levels of statistical data evaluation
o Explain the 4 types of scales that can be used in statistics
o Define the criteria for a satisfactory scale in an investigation
o Work with the summation symbol
o Define the terms centrality, dispersion, and skewness on a
statistical basis
o Explain the terms mean, median, mode and if you know how to
calculate them
o Explicate the terms range, standard deviation, standard error of
the mean, coefficient of variation, standard normal, and
histogram