Chapter 3
Measures of Dispersion
Notes:
Measures of dispersion (also called measures of variability) are statistical tools used to describe
how spread out or scattered the values in a dataset are. They provide insight into the degree of
variation or inconsistency in the data.
Measures of dispersion:
1. Range:
- Definition: The difference between the maximum and minimum values in a dataset.
- Formula: Rang = Maximum - Minimum
- Use: Simple and gives a quick sense of how spread out the data is. However, it is sensitive to
outliers.
2. Interquartile Range (IQR):
- Definition: The range of the middle 50% of the data, or the difference between the 75th
percentile (Q3) and the 25th percentile (Q1).
- Formula: IQR = Q3 - Q1
- Use: More resilient than the range because it focuses on the middle of the data and is less
influenced by outliers.
3. Variance:
- Definition: The average of the squared differences between each data point and the mean.
- Formula:
- Use: Measures how much the data points deviate from the mean but in squared units, making
it less intuitive for interpretation.
4. Standard Deviation (SD):
- Definition: The square root of the variance. It indicates the average amount by which the
data points differ from the mean.
- Formula:
- Use: Commonly used because it is in the same units as the original data and provides a clear
measure of spread, especially for normally distributed data.
5. Mean Absolute Deviation (MAD):
- Definition: The average of the absolute differences between each data point and the mean.
- Formula:
- Use: Shows the average distance of each data point from the mean, less sensitive to extreme
values compared to variance and SD.
6. Coefficient of Variation (CV):
- Definition: The ratio of the standard deviation to the mean, expressed as a percentage.
- Formula:
- Use: Useful for comparing variability between datasets with different units or scales.
7. Index of Qualitative Variation (IQV):
- Definition: A measure used to assess the dispersion of nominal (categorical) data.
-
- Use: It indicates how evenly distributed observations are across categories in a nominal
dataset.
8. Inter-Decile Range (IDR):
- Definition: The range between the 90th percentile and the 10th percentile.
- Formula:
- Use: Like IQR, it focuses on the central portion of the data but covers a broader range (80%
of the data).
9. Entropy:
- Definition: A measure of uncertainty or disorder, often used for categorical or nominal data.
- Formula
- Use: Captures the unpredictability in a dataset, especially for nominal data.
Review Questions:
1. When might we prefer to use an entropy measure of dispersion rather than an IQV? Rather
than a standard deviation?
Entropy (like Shannon entropy) measures uncertainty or disorder in categorical data,
where there may be many categories with varying probabilities. Use entropy when you
want to capture the unpredictability or diversity in a dataset, especially for nominal data.
Index of Qualitative Variation (IQV) is also for nominal data but assumes equal
probabilities among categories, which entropy does not require.
Standard deviation is appropriate for interval and ratio data, where numerical distances
between values matter. It would not be suited for nominal data because nominal variables
lack meaningful distances between categories.
2. In the formula , what do N, nj, and K stand for? What does this formula
give us?
The formula for the Index of Qualitative Variation (IQV) essentially tells us how spread out or
varied cases are across different categories. The IQV ranges from 0 to 1:
0 means no variation (all cases fall into one category).
1 means maximum variation (cases are evenly spread across all categories).
The IQV is a rescaled version of the Index of Diversity, where the denominator is set to ensure
this 0-1 range. This makes it easier to interpret as a measure of diversity or variation.
In short, the formula gives a straightforward way to quantify how diverse the distribution of
cases is across categories, where higher values indicate greater diversity.
3. Suppose we see the formula −∑ pi [log2(pi)]. What do pi and log2 stand for? If we calculate
this quantity, what will it tell us?
This formula calculates the Shannon entropy, which measures the uncertainty or disorder
in a dataset. High entropy means the data is more spread out across different categories,
making it unpredictable. On the other hand, low entropy indicates that the data is more
concentrated in fewer categories, making it more predictable.
4. Suppose someone said to you that there is no measure of dispersion for nominal (ordinal)
variables because dispersion is meaningless when we cannot tell how far apart categories are.
What might you say in reply?
We can reply that there are measures of dispersion for nominal and ordinal variables,
even without knowing the distances between categories. For nominal data, measures like
the Index of Diversity, Index of Qualitative Variation (IQV), and Entropy show how
spread out the data is across categories. For ordinal data, while precise distances can't be
specified, we can use the Interquartile Range (IQR), Interdecile Range (IDR), and
Median Absolute Deviation (MAD) to show how much of the data falls between
specific points. The IQR and IDR are commonly used, representing the range between the
middle 50% and 80% of the data, respectively.
Explanation:
We might reply that there are, in fact, measures that do not require knowing the distances
between categories, but still give valuable information about how spread out or concentrated
the data is across these categories. For example, dispersion in nominal variables can be
measured through the Index of Diversity, which tells us how likely it is that two cases,
drawn at random, will come from different categories, Index of Qualitative Variation
(IQV), or through Entropy (a measure of the absolute extent of diversity that is present).
With ordinal data, although we cannot specify precise distances between categories, we can
say how much of the sample lies between particular values. For these purposes, we could use
the Interquartile Range (IQR), the Interdecile Range (IDR), and the Median Absolute
Deviation (MAD). The most widely used measures for this purpose are the Interquartile and
the Interdecile Ranges. Respectively, these give us the range between the upper and lower
quartiles, and between the upper and lower deciles. Quartiles, unsurprisingly, are points that
divide an ordered distribution into quarters. Deciles are points dividing an ordered set of
cases into tenths.
5. In what way are the IQV and the entropy measure complementary?
Unlike the IQV, the entropy measure does not assess dispersion in relation to a maximum.
Instead, it measures the absolute extent of diversity present. Because of this difference,
the two are complementary and can be used together to highlight different aspects of the
data. Although entropy is calculated differently from the IQV or Index of Diversity, all
three measure the dispersion of nominal variables, and entropy is often well correlated
with the others.
Explanation:
The Index of Qualitative Variation (IQV) and the entropy measure are complementary because
they each highlight different aspects of diversity in categorical data.
- IQV gives us the amount of dispersion relative to the maximum possible variation.
It tells us how close a distribution is to being perfectly diverse or homogenous.
- Entropy measures the absolute extent of diversity without comparing it to a
theoretical maximum. It focuses on how much uncertainty or unpredictability exists within the
distribution.
Because the IQV focuses on relative dispersion and entropy measures absolute diversity, using
both together provides a fuller picture of how diverse or varied the data is. While they are
calculated differently, they often correlate well and can reinforce each other in showing the
degree of variation present.
In other words, IQV is about how far a distribution is from being maximally diverse, while
entropy quantifies the diversity in absolute terms, making them useful together for a more
complete analysis.
6. What measures of dispersion are commonly suggested for ordinal variables? Why, for truly
ordinal variables, may it be safer just to report key percentiles?
With ordinal data, while exact distances between categories cannot be measured, we can
still describe how much of the sample falls between certain values. The Interquartile
Range (IQR) and Interdecile Range (IDR) are widely used to capture the spread of the
middle 50% and 80% of the data, respectively. Another option sometimes suggested for
ordinal data is the MAD (the Median Absolute Deviation). Since ordinal data does not
assume equal distances between categories, reporting key percentiles often provides a
clearer and more accurate picture.
Explanation:
Interquartile Range (IQR) and Interdecile Range (IDR) are commonly suggested
because they focus on the middle part of the distribution, which is suitable for ordinal
data. Another option sometimes suggested for ordinal data is the MAD (the Median
Absolute Deviation).
For truly ordinal variables, reporting key percentiles (e.g., the 25th, 50th, and 75th) might
be safer because they give a clear sense of how the values are distributed across ordered
categories without assuming equal intervals between ranks.
7. Suppose that the IDR for the final grades in a course in social statistics was found to be 21.
What would this tell us about the distribution of grades? If the MAD was 10, what would this tell
us?
If the Interdecile Range (IDR) for the final grades is 21, it suggests that the middle 80%
of students' grades are spread across a 21-point range. This indicates moderate variability
in the distribution of grades.
If the Median Absolute Deviation (MAD) is 10, it means that the typical deviation from
the median grade is 10 points. This means that most grades cluster around the median
with some moderate spread. Combining these two measures shows both the overall range
and the central concentration of grades.
8. What measure of dispersion is typically suggested for ratio variables, and when is it liable to
be misleading?
For (interval or) ratio variables, the most widely used measure of dispersion is the
standard deviation (SD). It leverages the defined intervals between observations by calculating
the average distance of each value from the mean. The SD is expressed in the original units of
the variable and has four key advantages: it accounts for the precise intervals between data
points, uses information from all cases, is comparable across different samples, and is in
meaningful units. However, because the SD squares deviations from the mean, outliers can have
a significant impact, making it unstable when extreme values are present.
9. What is the formula for a standard deviation? Why is there a square root sign in the formula?
10. Explain the meaning of the symbols in the formula for the SD.
11. Briefly state the advantages of the standard deviation.
The standard deviation:
accounts for the precise intervals between data points
uses information from all cases
is comparable across different samples
is in meaningful units.
12. When might we prefer to use an IQR rather than a standard deviation?
The IQR is preferred when the data are skewed or contain outliers, as it focuses on the
middle 50% of the data and is less sensitive to extreme values.
13. Why do many researchers prefer the SD to the IQR or IDR (or MAD) when they have ordinal
data?
One reason to use the SD with ordinal data is that measures often recommended, the IQR,
IDR, and MAD, may show major changes when the shifts in the data are modest, or no changes
when the shifts in the data are major. The SD responds much more smoothly to changes in the
data than the IQR, the IDR, and the MAD. Many use the SD with ordinal data for that reason, as
long as they can accept that the distances between categories are not too seriously uneven.
14. What are the mean and SD of a z-score? What are two ways z-scores can be helpful?
A z-score is a standardized form of a variable that allows for easy comparison across
different datasets. The mean of a z-score distribution is zero, and its standard deviation (SD) is
one.
Z-scores can be helpful because:
- they allow us to compare the shape of two distributions without being
distracted by differing means and SDs.
- let us see how far a given case lies from the mean by expressing the distance in standard
deviations.
15. If a variable is normally distributed, what percentage of observations will lie within
approximately two SDs of the mean? Within one SD?
If a variable is normally distributed, approximately 68% of cases will lie within one
standard deviation (SD) of the mean, and 95% will lie within 1.96 SDs of the mean. Often, a
two-SD approximation is used for simplicity when discussing the 95% range.
16. What is the “empirical rule”?
The empirical rule states that in a normal distribution (in practice, even if not normal,
many distributions tend to follow the same rule), approximately:
68% of data falls within 1 SD of the mean.
95% falls within 2 SDs.
99.7% falls within 3 SDs.
Explanation:
The empirical rule states that many distributions encountered in practice, even if they are
not perfectly normal, tend to follow similar patterns regarding the spread of data. Specifically,
about 95% of the data typically lies within two standard deviations (SDs) of the mean, and
around 99.7% lies within three SDs of the mean. This rule provides a useful guideline for
understanding the distribution of data in various contexts.
17. If a variable is strictly continuous and unimodal, what percentage of observations will lie
within two SDs of the mean?
If a distribution is strictly continuous and unimodal, with a definable standard deviation,
then no more than about 11.1% of the observations can lie further than two standard deviations
(SDs) from the mean. Conversely, this means that approximately 88.9% of the observations will
fall within that range. This principle helps to quantify the spread of data in continuous unimodal
distributions.
Explanation:
In a strictly continuous and unimodal distribution—which means the distribution has a single
peak and no gaps—there are certain expectations about where most of the data will fall. If the
distribution has a definable standard deviation (SD), it helps us understand how spread out the
data is around the mean (the average value).
1. Within Two Standard Deviations: About 88.9% of the data points will lie within two
standard deviations from the mean. This means that if you look at the range from two
SDs below the mean to two SDs above the mean, you will find that most of the data
points (almost 9 out of 10) fall within this range.
2. Outside Two Standard Deviations: The remaining 11.1% of data points will lie outside
this range—meaning they are either much lower than two SDs below the mean or much
higher than two SDs above the mean.