0% found this document useful (0 votes)
17 views

03_WEEK2_Statistics_Part2 (2)

The document covers statistical measures of location, including mean, median, and mode, as well as their calculations for both sample and population data. It also introduces percentiles, quartiles, and exploratory data analysis techniques such as the five-number summary and box plots. Examples are provided to illustrate the computation of these statistics using apartment rent data.

Uploaded by

Alma Cseh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

03_WEEK2_Statistics_Part2 (2)

The document covers statistical measures of location, including mean, median, and mode, as well as their calculations for both sample and population data. It also introduces percentiles, quartiles, and exploratory data analysis techniques such as the five-number summary and box plots. Examples are provided to illustrate the computation of these statistics using apartment rent data.

Uploaded by

Alma Cseh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Statistics

Week 2
Part 2

Unit 8: Measures of Location


Unit 9: Percentiles and Exploratory Data
Analysis
Unit 8 Measures of Location

■ Mean
■ Weighted Mean
■ Median
■ Mode
Measures of Location

■ Measures of location indicate at what numerical values


certain characteristic points of the distribution are
located.
If the measures are computed
using data from a sample,
they are called sample statistics.

If the measures are computed


using data for a population,
they are called population parameters.

A sample statistic is referred to


as the point estimator of the
corresponding population parameter.
Mean

■ The mean of a data set is the measure commonly


referred to as the average of all the data values.
■ The sample mean y is the point estimator of the
population mean m.
Sample Mean y

Sum of the values


of the n observations
n

y i
y= i =1

n
Number of
observations
in the sample
Population Mean m

Sum of the values


of the N observations
N

y i
m= i =1

N
Number of
observations in
the population
Sample Mean

■ Example: Apartment Rents


Seventy apartments were
randomly sampled in a small
university town. The monthly
rents for these apartments
are listed in ascending
order on the next slide.
Sample Mean

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Sample Mean

y=
 y i
=
34, 356
= 490.80
n 70

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
The Weighted Mean and
Working with Grouped Data
■ Weighted Mean
■ Mean for Grouped Data
Weighted Mean

■ When the mean is computed by giving each data


value a weight that reflects its importance, it is
referred to as a weighted mean.
■ When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.
■ For example, in the computation of a grade point
average (GPA) in U.S. universities, the weights are
the number of credit hours earned for each grade.
Weighted Mean

y=  Wy
= w y
i i

W i
i i

where:
yi = value of observation i
Wi = weight for observation i
wi = relative weight for observation i
 Wi 
 wi = 

  Wj 

Grouped Data

■ The weighted mean computation can be used to


obtain approximations of the mean, variance, and
standard deviation for grouped data.
■ To compute the weighted mean, we treat the
midpoint of each class as though it were the mean
of all items in the class.
■ We compute a weighted mean of the class midpoints
using the class frequencies as weights.
■ Similarly, in computing the variance and standard
deviation, the class frequencies are used as weights.
Mean for Grouped Data

■ Sample Data
k

fM i i
y i =1
n

■ Population Data
k

fM i i
m i =1
N
where:
k = number of classes
fi = frequency of class i
Mi = midpoint of class i
Sample Mean for Grouped Data

Given below is the previous sample of monthly rents


for 70 apartments, presented here as grouped data in
the form of a frequency distribution.
Rent (€) Frequency
420-439 8
440-459 17
460-479 12
480-499 8
500-519 7
520-539 4
540-559 2
560-579 4
580-599 2
600-619 6
Sample Mean for Grouped Data

Rent (€) fi Mi fiMi


420-439 8 430 3440 34,560
y = 493.71
440-459 17 450 7650 70
460-479 12 470 5640 This approximation
480-499 8 490 3920 differs by €2.91 from
500-519 7 510 3570
the actual sample
520-539 4 530 2120
540-559 2 550 1100
mean of €490.80.
560-579 4 570 2280
580-599 2 590 1180
600-619 6 610 3660
Total 70 34560
Median

■ The median of a data set is the value in the middle


when the data items are arranged in ascending order.
■ Whenever a data set has extreme values, the median
is the preferred measure of central location.
■ The median is the measure of location most often
reported for annual income and property value data.
■ A few extremely large incomes or property values
can inflate the mean.
Median

■ For an odd number of observations:


• Position of the median: i = (n+1)/2

• Value of the median: Me = yi

26 18 27 12 14 27 19 7 observations

12 14 18 19 26 27 27 in ascending order

the median is the middle value.

Median = 19
Median

■ For an even number of observations:


• Position of the median: i = (n+1)/2

• Value of the median: Me = (yi-0.5 + yi+0.5)/2

26 18 27 12 14 27 30 19 8 observations

12 14 18 19 26 27 27 30 in ascending order

the median is the average of the middle two values.

Median = (19 + 26)/2 = 22.5


Median

Averaging the 35th and 36th data values:


Median = (475 + 475)/2 = 475
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Mode

■ The mode of a data set is the value that occurs with


greatest frequency.
■ The greatest frequency can occur at two or more
different values.
■ If the data have exactly two modes, the data are
bimodal.
■ If the data have more than two modes, the data are
multimodal.
Mode

450 occurred most frequently (7 times)


Mode = 450

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
End of Unit 8
Unit 9 Percentiles and
Exploratory Data Analysis
■ Percentiles
■ Five-Number Summary
■ Box Plot
Percentiles

■ Percentiles (or quantiles) of a data set are cut-off values


that separate the lower p% of the data from the upper
(100-p)%.
■ The pth percentile, Qp% is a value such that at least p%
of the data items take on a value less than or equal to
Qp% and at least (100 - p)% of the data items take on a
value greater than or equal to Qp%.
■ A well-chosen set of percentiles provides information
about how the data are spread over the interval from
the smallest to the largest value.
■ Admission test scores for colleges and universities are
frequently reported in terms of percentiles.
Percentiles

Arrange the data in ascending order.

Compute index i, the position of the pth percentile.

i = (p/100)n

If i is not an integer, round it up. The pth percentile


is the value in position i.
Qp% = y i 

If i is an integer, the pth percentile is the average of


the values in positions i and i+1.
yi + yi +1
Qp% =
2
90th Percentile

i = (p/100)n = (90/100)×70 = 63
Averaging the 63rd and 64th data values:
90th Percentile = (580 + 590)/2 = 585

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
90th Percentile

“At least 90% “At least 10%


of the items of the items
take on a value take on a value
of 585 or less.” of 585 or more.”
63/70 = 0.9 or 90% 7/70 = 0.1 or 10%

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Quartiles

■ Quartiles are specific percentiles.


■ First Quartile = 25th Percentile
■ Second Quartile = 50th Percentile = Median
■ Third Quartile = 75th Percentile
Third Quartile

Third quartile = 75th percentile


i = (p/100)n = (75/100)×70 = 52.5 = 53
Third quartile = 525

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Exploratory Data Analysis

■ Five-Number Summary
■ Box Plot
Five-Number Summary

1 Smallest Value

2 First Quartile

3 Median

4 Third Quartile

5 Largest Value
Five-Number Summary

Lowest Value = 425 First Quartile = 445


Median = 475
Third Quartile = 525 Largest Value = 615
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Box Plot

■ A box is drawn with its ends located at the first and


third quartiles.
■ A vertical line is drawn in the box at the location of
the median (= second quartile).

375 400 425 450 475 500 525 550 575 600 625

Q1 = 445 Q3 = 525
Q2 = 475
Box Plot

■ The interquartile range (IQR) is calculated as the


difference between the first and the third quartiles.

IQR = Q3 – Q1 = 525 – 445 = 80

■ Limits are located (not drawn) using the interquartile


range (IQR).
■ The normal range is defined as the interval between
the lower and the upper limits.
■ Data outside these limits are considered outliers.

… continued
Box Plot

■ The lower limit is located 1.5×IQR below Q1.

Lower Limit: Q1 – 1.5×IQR = 445 – 1.5×80 = 325

■ The upper limit is located 1.5×IQR above Q3.

Upper Limit: Q3 + 1.5×IQR = 525 + 1.5×80 = 645

■ The normal range is the interval between the lower


and the upper limits: [325, 645].
■ There are no outliers (values outside the normal
range) in the apartment rent data.
Box Plot

■ Whiskers (dashed lines) are drawn from the ends of


the box to the smallest and largest data values within
the normal range.
■ The location of each outlier is shown by a suitable
symbol, e.g. an asterisk (*).

375 400 425 450 475 500 525 550 575 600 625

Smallest value in Largest value in


normal range = 425 normal range = 615
End of Unit 9

You might also like