0% found this document useful (0 votes)
7 views

Lecture 3 Numerical Measures of Data

Uploaded by

nicklin0419
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 3 Numerical Measures of Data

Uploaded by

nicklin0419
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Lecture 3.

Numerical Measures of Data


AGEC 2001 Statistics I

Feng-An Yang1

1 Departmentof Agricultural Economics


National Taiwan University

Fall Semester

1/36
Outline
Measures of Location
Mean
Median
Mode
Shape of a distribution
Measures of Variation
Range
Variance and Standard Deviation
Coefficient of Variation
Grouped Data
Measures of Position
Percentile
Location of Percentile
Quartile and Decile
Box plot
2/36
Measures of Location

Measures of Location
Numerical measures used to describe the central tendency of the
data
I Common measures of location
I Mean
I Median
I Mode

3/36
Mean

Mean
A numerical average of a set of numbers
I Arithmetic Mean
I Weighted Mean
I Geometric Mean

Example
I The mean height of AGEC students is 172 cm.
I The mean weight of AGEC students is 55.3 kg.

4/36
Arithmetic Mean

Arithmetic Mean
Arithmetic mean is the simplest and the most widely used measure
of mean, and it is the sum of all the numbers in a dataset divided
by the number of observations in that dataset

Population Mean
N
P
xi
i=1
µ=
N

I µ is the population mean


I N is the number of observations
I Xi is the value of i-th observation

5/36
Arithmetic Mean

Sample Mean
n
P
xi
i=1
x̄ = n

I x̄ is the sample mean


I n is the number of observations in the sample

Example
{90,77,94,89,119,112,91,110,92,100,113,83}
n
P
xi
i=1 90+77+···+83 1,170
x̄ = n = 12 = 12 = 97.5

6/36
Arithmetic Mean

Properties of the Arithmetic Mean


I All values in the dataset are used in the calculation of mean
I The mean is unique
I The sum of the deviations from the mean is zero
n
(xi − x̄ ) = 0
P
i=1

Example
{3,7,5}, x̄ = 5
n
(xi − x̄ ) = (3 − 5) + (7 − 5) + (5 − 5) = 0
P
i=1

7/36
Arithmetic Mean

Properties of the Arithmetic Mean (cont’d)


I The mean can be affected by extreme values

Example
I A={1,2,3,4,5}, x̄A = 3
I B={1,2,3,4,100}, x̄B = 22

8/36
Median

Median
The midpoint of all values in a dataset

Steps for finding the median


I Sort the data in ascending (or descending) order
I In case of odd number of observations, the Median is on the
n+1
2 position
I Example: {11, 17, 25, 38, 60}. The median is 25
I In case of even number of observations, the Median is the
simple average of two middle numbers
25+38
I Example: {11, 17, 25, 38, 60, 65}. The median is 2 = 31.5

9/36
Median

Median
I The median is less sensitive to extreme values
I The median is unique

Example
I A={1,2,3,4,5}, x̄A = 3, median=3
I B={1,2,3,4,100}, x̄B = 22, median=3

10/36
Mode

Mode
The value of number that appears most often in a datset
I The mode is less sensitive to extreme values
I There may be multiple modes

Steps for finding the mode


I Organize the data and make a frequency table
I The mode is the value(s) with highest frequency

11/36
Mode

Example
{4,4,4,3,100,3,1,3,5,2,2,5,6,1,2,2,3,7,
1,3,7,8,1,4,7,5,2,2,5,1,1,3,3,1,2}

Value Frequency
1 7
2 7
3 7
4 3
5 4
6 1
7 3
100 2

I The modes are 1, 2, and 3

12/36
Shape of a distribution

Skewness
Skewness is a measure of the symmetry of a data distribution

1.5

0.4 Mode 0.4 Mode


Mean, Median, Mode
Median 1 Median

Mean Mean

0.2 0.2
0.5

0 0 0
−4 −2 0 −2 0 2 0 2 4

(a) Left-skewed: Mean < Median (b) Symmetric: Mean = Median (c) Right-skewed: Mean > Median

13/36
Measures of Variation
Measures of Variation
Numerical measures used to describe the spread of data
I Common measures of variation
I Range
I Variance and Standard Deviation
I Coefficient of Variation

Why study dispersion?


Measures of location, which describe central tendency of data, are
useful at that standpoint, but it tells noting about the variability of
data. Two data distributions can have the same central tendency
but quite different variability
0.3
0.2
0.1
0
0 2 4 6 8 10
x
14/36
Range

Range
The difference between the largest and the smallest values in a
dataset
Range = Maximum value - Minimum value

Example
{7,8,13,15,27,30}, Range=30-7=23

Issues
I It can be affected by extreme values
I {7,8,13,15,27,30}, Range=30-7=23
I {7,8,13,15,27,130}, Range=130-7=123
I It tells nothing about how data are distributed

15/36
Variance

Variance
The arithmetic mean of the squared deviations from the mean

Population Variance
N
P
(xi −µ)2
σ2 = i=1
N

I σ 2 is the population variance


I xi is the value of i-th observation
I µ is the population mean
I N is the number of observations in the population

16/36
Variance

Sample Variance
n
P
(xi −x̄ )2
s2 = i=1
n−1

I s 2 is the sample variance


I x̄ is the sample mean

Sample Standard Deviation


v
uPn
u (xi −x̄ )2
t
i=1
s= n−1

17/36
Variance

n
(xi − x̄ )2
P
2 i=1
s =
n−1
n 
xi2 − 2xi x̄ + x̄ 2
P 
i=1
=
 n
n−1
n

P 2
xi − 2x̄ xi + nx̄ 2
P
i=1 i=1
=
 n
n−1 
P 2 2 2
xi − 2nx̄ + nx̄
i=1
=
 n
n−1
P 2 2
xi − nx̄
i=1
=
n−1
18/36
Variance
Example

x x2 x − x̄ (x − x̄ )2
12 144 -5 25
20 400 3 9
16 256 -1 1
18 324 1 1
19 361 2 4
Total 1485 0 40

n
(xi − x̄ )2
P
i=1 40
s2 = = = 10
 n
n−1 
5−1
P 2
xi − nx̄ 2
i=1 1485 − 5 × 172
= = = 10
n−1 5−1
19/36
Variance

Properties of Variance
I Variance and standard deviation can never be negative
I Variance and standard deviation do not depend on the
location of data
I The more concentrated the data are, the smaller the variance
and standard deviation
I What if there is no variation in the data, i.e., all values are the
same?

0.2

0.1

0
−2 0 2 4 6 8 10 12
x
20/36
Empirical Rule

Empirical Rule
For a symmetrical, bell-shaped distribution, approximately 68%,
95%, and 99.7% of the observations lie within plus and minus one,
two, and three standard deviation of the mean, respectively
I Pr(µ − σ ≤ X ≤ µ + σ) ≈ 68%
I Pr(µ − 2σ ≤ X ≤ µ + 2σ) ≈ 95%
I Pr(µ − 3σ ≤ X ≤ µ + 3σ) ≈ 99.7%
68%

95%

99.7%

−3σ −2σ −1σ µ 1σ 2σ 3σ


21/36
Chebyshev’s Theorem

Chebyshev’s Theorem
For any set of observations (sample or population), the proportion
of values that lie within k standard deviations of the mean is at
least 1âĂŞ k12 , where k is any value greater than 1

Example
The average height of AGEC students is 170 cm and the
corresponding standard deviation is 10. At least what percent of
students lie within plus 3 and minus 3 standard deviations of the
mean? 1 − k12 = 1 − 312 = 1 − 19 ≈ 0.89

22/36
Coefficient of Variation

Coefficient of Variation (CV)


The coefficient of variation is a standardized measure of dispersion
of a data distribution, expressed as a percentage
I CV = x̄s × 100%
s is the sample standard deviation and x̄ is the sample mean
I It quantifies the variability relative to the mean and facilitates
the comparison of variability among data distributions with
different units or significantly different means

23/36
Coefficient of Variation
Example

Pollutant Mean Standard Deviation CV


PM2.5 100 Îijg/m3 10 Îijg/m3 10%
Ozone 50 ppm 10 ppm 20%

Relative to mean, the pollution of ozone is more variable than the


PM2.5

Example

Company Mean Production Standard Deviation CV


A 10000 10 0.1%
B 50 10 20%

Company A and B have the same variation in their production, but


company B is more variable relative to its production
24/36
Arithmetic Mean of Grouped data

Meann
P
f ×M
i=1
x̄ = n
I f is the frequency in each class
I M is the midpoint in each class

Example
Point Frequency (f ) Midpoint (M) f ×M
0-10 5 5 25
10-20 1 15 15
20-30 3 25 75
30-40 4 35 140
40-50 2 45 90
Total 15 345

n
P
f ×M
i=1 345
x̄ = n = 15 = 23

25/36
Standard Deviation of Grouped data

Standard
v Deviation
uPn
u f (M−x̄ )2
t
i=1
s= n−1

Example
Point Frequency (f ) Midpoint (M) f ×M (M − x̄ ) (M − x̄ )2 f (M − x̄ )2
0-10 5 5 25 -18 324 1620
10-20 1 15 15 -8 64 64
20-30 3 25 75 2 4 12
30-40 4 35 140 12 144 576
40-50 2 45 90 22 484 968
Total 15 345 3240

v
uPn
u f (M−x̄ )2
t q
i=1 3240
x̄ = n−1 = 14 = 15.21

26/36
Measures of Position

Measures of Position
Numerical measures used to divide data in equal parts
I Common measures of Position
I Quartile
I Decile
I Percentile

27/36
Percentile

Percentile
A percentile is a value indicating the percentage of observations in
a dataset fall below that value

Example
I The 87th percentile is 90 and it indicates that 87% of
observations are below 90

28/36
Location of Percentile

Steps for finding the pth percentile


I 1. Order the data in ascending order
I 2. Multiply p percent by the number of observations in the
data. Let’s call the resulting number as an index i
I 3. Check the index in Step 2.
I In case of a whole number, the pth percentile is the simple
average between the ith value and (i + 1)th value in the
ordered data
I Otherwise, round the index up to the nearest whole number.
The pth percentile is the dieth value in the ordered data

Note
There are some other ways to determine the percentile, such as
nearest-rank method, linear interpolation method

29/36
Location of Percentile

Example
{43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78, 79, 85, 87,
88, 89, 93, 95, 96, 98, 99, 99}
I Suppose we want to find the 60th percentile. Index
i = 60/100 × 25 = 15
I The 60th percentile is then the simple average between the
15th value and 16th value
79+85
I P60 = 2 = 82

30/36
Location of Percentile

Example
{34, 42, 51, 65, 69, 74, 78, 84, 85, 85, 86, 87}
I Suppose we want to find the 80th percentile. Index
i = 80/100 × 12 = 9.6
I Since the index is not a whole number, we round it up to 10.
Then the the 80th percentile is at the 10th position in the
ordered data
I P80 = 85

31/36
Quartile and Decile

Quartiles
I The first quartile is called Q1 and it is equal to the 25th
percentile, indicting that 25% of observations are below it
I The second quartile is called Q2 and it is equal to the 50th
percentile. It is also simply the median that splits the data in
half
I The third quartile is called Q3 and it is equal to the 75th
percentile, indicting that 75% of observations are below it
I Interquartile range = Q3 − Q1

Deciles
In a similar fashion to Quartiles, Deciles are nine values that divide
the data into ten equal parts

32/36
Box plot

Box plot
I A box plot is a graphical representation of the distribution of
a data set
I It displays the median, quartiles, and potential outliers of the
data, providing a visual summary of its central tendency and
spread
I Also known as a box-and-whisker plot

33/36
Box plot
Components of a Box Plot
I Box
I The central box represents the interquartile range (IQR), which
includes the middle 50% of the data
I The edges of the box are the first quartile (Q1) and the third
quartile (Q3)
I Median Line
I A line inside the box represents the median (the 50th
percentile), which divides the data into two equal halves
I Whiskers
I Whiskers extend from the edges of the box to the minimum
and maximum values within a defined range, typically 1.5
times the IQR from Q1 and Q3
I They show the spread of the data outside the middle 50%
I Outliers
I Data points that fall outside the whiskers are considered
outliers and are often marked with individual points or symbols
34/36
Box plot
Min and Max as the boundary

I Let’s consider an example where we have exam scores for a


group of students
I 55, 60, 65, 70, 72, 75, 78, 80, 83, 85, 88, 90, 92, 95, 100
I Summaries
I Minimum: 55
I Q1 (First Quartile): 70
I Median (Q2): 80
I Q3 (Third Quartile): 90
I Maximum: 100

55 70 80 90 100

35/36
Box plot
1.5 IQR as the boundary

I 30,50,51,53,53,54,54,58,59,60,61,62,62,64,65,67,68,69,80,90
I Summaries
I Minimum: 30
I Q1 (First Quartile): 53.5
I Median (Q2): 60.5
I Q3 (Third Quartile): 66
I Maximum: 90
I Lower and upper bound
I Interquartile Range (IQR) = Q3 - Q1 = 66 - 54 = 12
I Lower Bound = 54 - 1.5 × 12 = 36
I Upper Bound = 66 + 1.5 × 12 = 84
I Outliers: 94

30 36 54 60.5 66 84 88 94
36/36

You might also like