Lecture 3 Numerical Measures of Data
Lecture 3 Numerical Measures of Data
Feng-An Yang1
Fall Semester
1/36
Outline
Measures of Location
Mean
Median
Mode
Shape of a distribution
Measures of Variation
Range
Variance and Standard Deviation
Coefficient of Variation
Grouped Data
Measures of Position
Percentile
Location of Percentile
Quartile and Decile
Box plot
2/36
Measures of Location
Measures of Location
Numerical measures used to describe the central tendency of the
data
I Common measures of location
I Mean
I Median
I Mode
3/36
Mean
Mean
A numerical average of a set of numbers
I Arithmetic Mean
I Weighted Mean
I Geometric Mean
Example
I The mean height of AGEC students is 172 cm.
I The mean weight of AGEC students is 55.3 kg.
4/36
Arithmetic Mean
Arithmetic Mean
Arithmetic mean is the simplest and the most widely used measure
of mean, and it is the sum of all the numbers in a dataset divided
by the number of observations in that dataset
Population Mean
N
P
xi
i=1
µ=
N
5/36
Arithmetic Mean
Sample Mean
n
P
xi
i=1
x̄ = n
Example
{90,77,94,89,119,112,91,110,92,100,113,83}
n
P
xi
i=1 90+77+···+83 1,170
x̄ = n = 12 = 12 = 97.5
6/36
Arithmetic Mean
Example
{3,7,5}, x̄ = 5
n
(xi − x̄ ) = (3 − 5) + (7 − 5) + (5 − 5) = 0
P
i=1
7/36
Arithmetic Mean
Example
I A={1,2,3,4,5}, x̄A = 3
I B={1,2,3,4,100}, x̄B = 22
8/36
Median
Median
The midpoint of all values in a dataset
9/36
Median
Median
I The median is less sensitive to extreme values
I The median is unique
Example
I A={1,2,3,4,5}, x̄A = 3, median=3
I B={1,2,3,4,100}, x̄B = 22, median=3
10/36
Mode
Mode
The value of number that appears most often in a datset
I The mode is less sensitive to extreme values
I There may be multiple modes
11/36
Mode
Example
{4,4,4,3,100,3,1,3,5,2,2,5,6,1,2,2,3,7,
1,3,7,8,1,4,7,5,2,2,5,1,1,3,3,1,2}
Value Frequency
1 7
2 7
3 7
4 3
5 4
6 1
7 3
100 2
12/36
Shape of a distribution
Skewness
Skewness is a measure of the symmetry of a data distribution
1.5
Mean Mean
0.2 0.2
0.5
0 0 0
−4 −2 0 −2 0 2 0 2 4
(a) Left-skewed: Mean < Median (b) Symmetric: Mean = Median (c) Right-skewed: Mean > Median
13/36
Measures of Variation
Measures of Variation
Numerical measures used to describe the spread of data
I Common measures of variation
I Range
I Variance and Standard Deviation
I Coefficient of Variation
Range
The difference between the largest and the smallest values in a
dataset
Range = Maximum value - Minimum value
Example
{7,8,13,15,27,30}, Range=30-7=23
Issues
I It can be affected by extreme values
I {7,8,13,15,27,30}, Range=30-7=23
I {7,8,13,15,27,130}, Range=130-7=123
I It tells nothing about how data are distributed
15/36
Variance
Variance
The arithmetic mean of the squared deviations from the mean
Population Variance
N
P
(xi −µ)2
σ2 = i=1
N
16/36
Variance
Sample Variance
n
P
(xi −x̄ )2
s2 = i=1
n−1
17/36
Variance
n
(xi − x̄ )2
P
2 i=1
s =
n−1
n
xi2 − 2xi x̄ + x̄ 2
P
i=1
=
n
n−1
n
P 2
xi − 2x̄ xi + nx̄ 2
P
i=1 i=1
=
n
n−1
P 2 2 2
xi − 2nx̄ + nx̄
i=1
=
n
n−1
P 2 2
xi − nx̄
i=1
=
n−1
18/36
Variance
Example
x x2 x − x̄ (x − x̄ )2
12 144 -5 25
20 400 3 9
16 256 -1 1
18 324 1 1
19 361 2 4
Total 1485 0 40
n
(xi − x̄ )2
P
i=1 40
s2 = = = 10
n
n−1
5−1
P 2
xi − nx̄ 2
i=1 1485 − 5 × 172
= = = 10
n−1 5−1
19/36
Variance
Properties of Variance
I Variance and standard deviation can never be negative
I Variance and standard deviation do not depend on the
location of data
I The more concentrated the data are, the smaller the variance
and standard deviation
I What if there is no variation in the data, i.e., all values are the
same?
0.2
0.1
0
−2 0 2 4 6 8 10 12
x
20/36
Empirical Rule
Empirical Rule
For a symmetrical, bell-shaped distribution, approximately 68%,
95%, and 99.7% of the observations lie within plus and minus one,
two, and three standard deviation of the mean, respectively
I Pr(µ − σ ≤ X ≤ µ + σ) ≈ 68%
I Pr(µ − 2σ ≤ X ≤ µ + 2σ) ≈ 95%
I Pr(µ − 3σ ≤ X ≤ µ + 3σ) ≈ 99.7%
68%
95%
99.7%
Chebyshev’s Theorem
For any set of observations (sample or population), the proportion
of values that lie within k standard deviations of the mean is at
least 1âĂŞ k12 , where k is any value greater than 1
Example
The average height of AGEC students is 170 cm and the
corresponding standard deviation is 10. At least what percent of
students lie within plus 3 and minus 3 standard deviations of the
mean? 1 − k12 = 1 − 312 = 1 − 19 ≈ 0.89
22/36
Coefficient of Variation
23/36
Coefficient of Variation
Example
Example
Meann
P
f ×M
i=1
x̄ = n
I f is the frequency in each class
I M is the midpoint in each class
Example
Point Frequency (f ) Midpoint (M) f ×M
0-10 5 5 25
10-20 1 15 15
20-30 3 25 75
30-40 4 35 140
40-50 2 45 90
Total 15 345
n
P
f ×M
i=1 345
x̄ = n = 15 = 23
25/36
Standard Deviation of Grouped data
Standard
v Deviation
uPn
u f (M−x̄ )2
t
i=1
s= n−1
Example
Point Frequency (f ) Midpoint (M) f ×M (M − x̄ ) (M − x̄ )2 f (M − x̄ )2
0-10 5 5 25 -18 324 1620
10-20 1 15 15 -8 64 64
20-30 3 25 75 2 4 12
30-40 4 35 140 12 144 576
40-50 2 45 90 22 484 968
Total 15 345 3240
v
uPn
u f (M−x̄ )2
t q
i=1 3240
x̄ = n−1 = 14 = 15.21
26/36
Measures of Position
Measures of Position
Numerical measures used to divide data in equal parts
I Common measures of Position
I Quartile
I Decile
I Percentile
27/36
Percentile
Percentile
A percentile is a value indicating the percentage of observations in
a dataset fall below that value
Example
I The 87th percentile is 90 and it indicates that 87% of
observations are below 90
28/36
Location of Percentile
Note
There are some other ways to determine the percentile, such as
nearest-rank method, linear interpolation method
29/36
Location of Percentile
Example
{43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78, 79, 85, 87,
88, 89, 93, 95, 96, 98, 99, 99}
I Suppose we want to find the 60th percentile. Index
i = 60/100 × 25 = 15
I The 60th percentile is then the simple average between the
15th value and 16th value
79+85
I P60 = 2 = 82
30/36
Location of Percentile
Example
{34, 42, 51, 65, 69, 74, 78, 84, 85, 85, 86, 87}
I Suppose we want to find the 80th percentile. Index
i = 80/100 × 12 = 9.6
I Since the index is not a whole number, we round it up to 10.
Then the the 80th percentile is at the 10th position in the
ordered data
I P80 = 85
31/36
Quartile and Decile
Quartiles
I The first quartile is called Q1 and it is equal to the 25th
percentile, indicting that 25% of observations are below it
I The second quartile is called Q2 and it is equal to the 50th
percentile. It is also simply the median that splits the data in
half
I The third quartile is called Q3 and it is equal to the 75th
percentile, indicting that 75% of observations are below it
I Interquartile range = Q3 − Q1
Deciles
In a similar fashion to Quartiles, Deciles are nine values that divide
the data into ten equal parts
32/36
Box plot
Box plot
I A box plot is a graphical representation of the distribution of
a data set
I It displays the median, quartiles, and potential outliers of the
data, providing a visual summary of its central tendency and
spread
I Also known as a box-and-whisker plot
33/36
Box plot
Components of a Box Plot
I Box
I The central box represents the interquartile range (IQR), which
includes the middle 50% of the data
I The edges of the box are the first quartile (Q1) and the third
quartile (Q3)
I Median Line
I A line inside the box represents the median (the 50th
percentile), which divides the data into two equal halves
I Whiskers
I Whiskers extend from the edges of the box to the minimum
and maximum values within a defined range, typically 1.5
times the IQR from Q1 and Q3
I They show the spread of the data outside the middle 50%
I Outliers
I Data points that fall outside the whiskers are considered
outliers and are often marked with individual points or symbols
34/36
Box plot
Min and Max as the boundary
55 70 80 90 100
35/36
Box plot
1.5 IQR as the boundary
I 30,50,51,53,53,54,54,58,59,60,61,62,62,64,65,67,68,69,80,90
I Summaries
I Minimum: 30
I Q1 (First Quartile): 53.5
I Median (Q2): 60.5
I Q3 (Third Quartile): 66
I Maximum: 90
I Lower and upper bound
I Interquartile Range (IQR) = Q3 - Q1 = 66 - 54 = 12
I Lower Bound = 54 - 1.5 × 12 = 36
I Upper Bound = 66 + 1.5 × 12 = 84
I Outliers: 94
30 36 54 60.5 66 84 88 94
36/36