Numerical Descriptive
measures
Summary Definitions
The central tendency/location is the extent to which the values of a
numerical variable, group around a typical or central value. It is the
central value around which data tends to cluster.
The variation is the amount of dispersion or scattering away from
a central value that the values of a numerical variable show. It
measures the spread of the data
The shape is the pattern of the distribution of values from the
lowest value to the highest value.
Numerical Descriptive Techniques
Measure of Central Tendency: Mean
Mean (average)
The sum of all the data entries divided by the number of entries.
Sigma notation: Σx = add all of the data entries (x) in the
data set.
Population mean: x
u N
Sample mean: x
x n
Population mean µ
The population mean is the sum of the values in the population
divided by the population size, N.
i1Xi X X
N 1 X
2
N
N
Where μ = population mean
N = population size
Xi = ith value of the variable
X
Sample mean
The arithmetic mean (often just called the “mean”)
is the most common measure of central
tendency.
For a sample of size n:
The ith value
Pronounced x-bar
n
i1
Xi X X
X n X1 2
n
n
Sample size Observed values
Measures of Central Tendency : Mean
Advantages of using mean:
• easy to calculate,
• provides good description for data on height, grades etc.
Disadvantages of using mean:
• is sensitive to extreme values.
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Mean Mean =
=13 14
Measures of Central Tendency : Median
• Median is the value that divides the data into two parts- 50% of the observations
have values less than the median and 50% of the observations have values
greater then the median.
• The median is calculated by placing all the observations in order; the
observation that falls in the middle is the median.
• The location of the median when the values are in numerical order
(smallest to largest):
n1
Median position 2 position in the ordered
data
• If the number of values is odd, the median is the middle number.
• If the number of values is even, the median is the average of the two
middle numbers.
• Note that (n + 1)/2 is not the value of the median, only the position of
the median in the ranked data.
Measures of Central Tendency: Median
In an ordered array, the median is the “middle” number
(50% above, 50% below).
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Median = 13 Median = 13
Less sensitive than the mean to extreme values
There are as many values above the median as below it in
the data array.
The sample and population medians are computed in the
same way.
EXAMPLES - Mean and Median
A sample of 10 adults was asked to report the number of hours they spent on the internet the
previous month. The results are listed here. Calculate the sample mean and Median.
0 7 12 5 33 14 8 0 9 22
The median is the average of the fifth and sixth observations (the middle two), which
are 8 and 9, respectively. Thus, the median is 8.5.
Measures of Central Tendency : The Mode
Value that occurs most often.
Not affected by extreme values.
Used mainly for nominal data.
There may be no mode.
There may be several modes. (bi-modal)
The sample and population modes are computed in the same
way.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
Mode = 9 No Mode
Copyright © 2017 Pearson Education, Ltd.
Who wins between Mean, Median, and Mode?
Out of the three measures to choose from, which one should we use?
• The mean is generally our first selection. However, there are several
circumstances when the median is better.
• The mode is seldom the best measure of central location.
• One advantage the median holds is that it not as sensitive to
extreme values as is the mean.
Find the mode for the data in Internet Example
0 7 12 5 33 14 8 0 9 22
All observations except 0 occur once. There are two 0s. Thus, the
mode is 0. As you can see, this is a poor measure of central location. It
is nowhere near the center of the data. Compare this with the mean
11.0 and median 8.5 and we can see that mean and median are
superior measures.
Activity
The prices (in dollars) for a sample of roundtrip flights from Chicago, Illinois to
Cancun, Mexico are listed. What is the mean, median, mode price of the
flights?
1872 432 397 427 388 482 397 358 432
Which central tendency measure is best suitable to describe this data?
Mean=5185/9= 576.111
Median=5th position= 427
358,388,397,397,427,432,432,482,1872
Mode= 397, 432
Because of extreme values median is appropriate
Summary
• Compute the Mean to
Describe the central location of a single set of interval data.
• Compute the Median to
Describe the central location of a single set of interval or ordinal data
(with extreme observations)
• Compute the Mode to
Describe a single set of nominal, interval data
Instructor-
Dispersion and Variation
Why Study Dispersion?
– A measure of location, such as the mean or the median,
only describes the center of the data. It is valuable from
that standpoint, but it does not tell us anything about the
spread of the data.
– For example, if your nature guide told you that the river
ahead averaged 3 feet in depth, would you want to
wade across on foot without additional information?
Probably not. You would want to know something about
the variation in the depth.
– A second reason for studying the dispersion in a set of
data is to compare the spread in two or more
distributions.
Measures of Variation
Variation
Range Variance Standard Coefficient
Deviation of Variation
Measures of variation
give information on the
spread or variability of
the data values which
measure of location fail
to tell.
Same
centre,
different
variation
Measures of Variation: The Range
Simplest measure of variation.
Difference between the largest and the smallest
values:
Range = Xlargest – Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 2 = 12
Potential problem with Range?
Once again let us think about the following example on grades.
Grades of course 1: {4, 4, 4, 4, 50}.
Grades of course 2: {4, 8, 15, 24, 39, 50}.
Range= 46 in both the courses but the two courses have very
different distributions.
• Its major advantage is the ease with which it can be computed.
• Its major shortcoming is its failure to provide information on the
dispersion of the observations between the two end points.
• Hence we need a measure of variability that incorporates all the
data and not just two observations.
Deviation, Variance, and Standard Deviation
Deviation
The difference between the data entry, x, and the mean of the
data set.
It gives a rough estimate of the typical distance of a data value
from the mean.
Population data set:
Deviation of x = x – μ
Sample data set:
Deviation of x = x – x
Numerical Descriptive Measures for a Population:
Variance σ2
Average of squared deviations of values from the mean.
N
Population variance: i
(X μ)2
i1
σ2 N
Where
μ = population mean, N = population size
Xi = ith value of the variable X
Copyright © 2017 Pearson Education, Ltd.
Numerical Descriptive Measures for a Population: Standard
Deviation σ
Most commonly used measure of variation.
Shows average variation about the mean.
Is the square root of the population variance.
Has the same units as the original data.
N
Population standard deviation:
i
(X μ) 2
i1
σ
N
Measures of Variation: Sample Variance
Average (approximately) of squared deviations of values from the
mean.
n
Sample variance:
2
(X X)
i
2
S i1
n -1
Where
X = arithmetic mean
n = sample size
Xi = ith value of the variable
X
Measures of Variation: Sample Standard Deviation
Most commonly used measure of variation.
Shows average variation about the mean.
Is the square root of the variance.
Has the same units as the original data.
Sample standard deviation: (X i X) 2
S i 1
n -1
Interpreting Standard Deviation
Standard deviation is a measure of the typical amount an entry
deviates from the mean.
The more the entries are spread out, the greater the
standard deviation.
.
Measures of Variation: Comparing Standard
Deviations
Smaller standard deviation
Larger standard deviation
Measure of Variability: Standard Deviation -
Interpretation
Measure of Variability: Standard Deviation -
Interpretation
Measure of Variability: Standard Deviation -
Interpretation
Measure of Variability: Standard Deviation -
Interpretation
• Suppose that the mean and standard deviation of last year’s midterm test
marks are 70 and 5, respectively.
• What can you say about the distribution of grades if the histogram is bell-
shaped?
• We know that approximately 68% of the marks fell between 65 and 75,
approximately 95% of the marks fell between 60 and 80, and
approximately 99.7% of the marks fell between 55 and 85.
• What can you say about the distribution of grades if the shape of the
histogram is not known?
• If the shape of the histogram is not known, we can say that at least 75%
of the marks fell between 60 and 80, and at least 88.9% of the marks fell
between 55 and 85. (k= 2 and 3.)
The Coefficient of Variation (CV)
Measures relative variation
Always in percentage (%)
Shows variation relative to mean
Is the standard deviation divided by the mean, multiplied by 100%
Comparing Coefficients of Variation
Stock A:
Average price last year = $50
Standard deviation = $5
S $5
CVA 100% 100% 10%
X $50
S $5
CVB 100% 100% 5%
X $100
Measures of Variation:
Summary Characteristics
The more the data are spread out, the greater the range, variance,
and standard deviation.
The more the data are concentrated, the smaller the range, variance,
and standard deviation.
If the values are all the same (no variation), all these measures will be
zero.
None of these measures are ever negative.
The measure of variability can be used for interval data and Ordinal data
(IQR).
Measure of Relative Standing
• Measures of relative standing are designed to
provide information about the position of particular
values relative to the entire data set.
• Percentile: the Pth percentile is the value for which P
% of the observations are less than that value and
(100-P)% of the observations are greater than that
value.
• Suppose you scored in the 60th percentile on some
exam, that means 60% of the other scores were
below yours, while 40% of scores were above yours
Quartile Measures
The quartile measures the spread of values above and below the mean
by dividing the distribution into four groups.
A quartile divides data into three points:
First quartile, Q1: About one quarter of the data fall on or below Q1.
Second quartile, Q2: About one half of the data fall on or below Q2
(median).
Third quartile, Q3: About three quarters of the data fall on or below
Q3.
25% 25% 25% 25%
Q1 Q2
Q3
Quartiles are used to calculate the interquartile range, which is a
measure of variability around the median.
Quartile Measures:
Locating Quartiles
Find a quartile by determining the value in the appropriate position
in the ranked data, where:
First quartile position: Q1 = (n+1)/4 ranked value.
Second quartile position: Q2 = (n+1)/2 ranked value or Median.
Third quartile position: Q3 = 3(n+1)/4 ranked value.
where n is the number of observed values.
The number of nuclear power plants in the top 15 nuclear power-producing
countries in the world are listed. Find the first, second, and third quartiles of
the data set.
7 18 11 6 59 17 18 54 104 20 31 8 10 15 19
Solution:
• Q2 divides the data set into two halves.
Lower half Upper half
6 7 8 10 11 15 17 18 18 19 20 31 54 59 104
Q2
The first (16/4th position) =4th position = 10, second quartiles (16*2)/4 =8th
position = 18 and third quartiles (16*3)/4 =12th position = 31
Lower half Upper half
6 7 8 10 11 15 17 18 18 19 20 31 54 59 104
Q1 Q2 Q3
Q1 tells us that 25% of the countries have 10 or less nuclear plants, Q2
tells us that about 50% have 18 or less; and Q3 reveals that about 75%
have 31 or less plants.
Measure of Relative Standing: Commonly
used Percentiles
Measure of Relative
Standing: Location of
Percentiles
Measure of Relative
Standing: Location of
Percentiles
Measure of Relative
Standing: Location of
Percentiles
Measure of Relative
Standing: Location of
Percentiles
Interquartile Range(IQR)
Measures the range of the middle 50% of the data that shows how
spread out the data is.
The difference between the third and first quartiles.
IQR = Q3 – Q1
Large values of this statistic mean that the 1st and 3rd quartiles are
far apart indicating a high level of variability.
Find the interquartile range of the data set. Recall Q1 = 10, Q2 = 18,
and Q3 = 31
Solution:
• IQR = Q3 – Q1 = 31 – 10 = 21
The number of power plants in the middle portion of the data set vary by at
most 21.
Describing Relationship between Two
Variables
One graphical technique we use to show the
relationship between 2 variables is called a scatter
diagram.
To draw a scatter diagram we need two variables. We
scale one variable along the horizontal axis (X-axis)
of a graph and the other variable along the vertical
axis (Y-axis).
Describing Relationship between Two
Variables – Scatter Diagram Examples
We Discuss Two Measures Of The Relationship
Between Two Numerical Variables
Scatter plots allow you to visually examine the
relationship between two numerical variables
Now,
We will discuss two quantitative measures of such
relationships.
The Covariance
The Coefficient of Correlation
The Covariance
The covariance measures the strength of the linear
relationship between two numerical variables (X & Y)
Numerical Illustration
Interpreting Covariance
Covariance between two variables:
When there is no particular pattern, the covariance is a small number.
The covariance has a major flaw:
It is not possible to determine the relative strength of the
relationship from the size of the covariance
Coefficient of Correlation
Measures the relative strength of the linear
relationship between two numerical variables
Sample coefficient of correlation:
cov (X , Y)
r
SX SY
n n n
(X X)(Y Y)
i i (Xi X) 2
i
(Y Y ) 2
cov (X , Y) i1 SX i1
SY i1
n 1 n 1 n 1
Features of the
Coefficient of Correlation
The population coefficient of correlation is referred as ρ.
The sample coefficient of correlation is referred to as r.
Either ρ or r have the following features:
Unit free
Ranges between –1 and 1
The closer to –1, the stronger the negative linear relationship
The closer to 1, the stronger the positive linear relationship
The closer to 0, the weaker the linear relationship / no relationship
Scatter Plots of Sample Data with
Various Coefficients of Correlation
Coefficient of Correlation
Because we’ve already calculated the covariances we need to compute only the standard deviations of X
and Y.
For Set 1: Strong positive linear relationship
For Set 2: Strong negative linear relationship
For Set 3: Weak negative linear relationship