Chapter 2: Summarizing data
OpenIntro Statistics, 4th Edition
Slides developed by Mine Çetinkaya-Rundel of OpenIntro.
The slides may be copied, edited, and/or shared via the CC BY-SA license.
Some images may be included under fair use guidelines (educational purposes).
Examining numerical data
Scatterplot
Scatterplots are useful for visualizing the relationship between two
numerical variables.
Do life expectancy and total fertility ap-
pear to be associated or independent?
Was the relationship the same through-
out the years, or did it change?
http:// www.gapminder.org/ world
1
Dot plots
Useful for visualizing one numerical variable. Darker colors
represent areas where there are more observations.
2.5 3.0 3.5 4.0
GPA
How would you describe the distribution of GPAs in this data set?
Make sure to say something about the center, shape, and spread of
the distribution.
2
Dot plots & mean
2.5 3.0 3.5 4.0
GPA
• The mean, also called the average (marked with a triangle in
the above plot), is one way to measure the center of a
distribution of data.
• The mean GPA is 3.59.
3
Mean
• The sample mean, denoted as x̄, can be calculated as
x1 + x2 + · · · + xn
x̄ = ,
n
where x1 , x2 , · · · , xn represent the n observed values.
• The population mean is also computed the same way but is
denoted as µ. It is often not possible to calculate µ since
population data are rarely available.
• The sample mean is a sample statistic, and serves as a point
estimate of the population mean. This estimate may not be
perfect, but if the sample is good (representative of the
population), it is usually a pretty good estimate.
4
Stacked dot plot
Higher bars represent areas where there are more observations,
makes it a little easier to judge the center and the shape of the
distribution.
●
●
●
● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ●● ● ●
●●
● ●
● ● ● ● ●●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●●● ●● ●
● ● ● ● ● ● ●● ● ● ●●● ●● ●
● ● ● ● ● ● ● ● ●● ● ●● ●
● ● ● ● ● ● ● ● ● ● ●● ● ●● ●
● ● ● ● ● ● ● ● ●●● ● ●●● ●● ●
● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●
● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ●
2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0
GPA
5
Histograms - Extracurricular hours
• Histograms provide a view of the data density. Higher bars
represent where the data are relatively more common.
• Histograms are especially convenient for describing the shape
of the data distribution.
• The chosen bin width can alter the story the histogram is
telling.
150
100
50
0
0 10 20 30 40 50 60 70
6
Hours / week spent on extracurricular activities
Bin width
Which one(s) of these histograms are useful? Which reveal too
much about the data? Which hide too much?
200 150
150
100
100
50
50
0 0
0 20 40 60 80 100 0 10 20 30 40 50 60 70
Hours / week spent on extracurricular activities Hours / week spent on extracurricular activities
80 40
60 30
40 20
20 10
0 0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
Hours / week spent on extracurricular activities Hours / week spent on extracurricular activities
7
Shape of a distribution: modality
Does the histogram have a single prominent peak (unimodal),
several prominent peaks (bimodal/multimodal), or no apparent
peaks (uniform)?
14
20
15
15
15
10
10
8
10
10
6
5
4
5
2
0
0
0 5 10 15 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
Note: In order to determine modality, step back and imagine a smooth curve over
the histogram – imagine that the bars are wooden blocks and you drop a limp
spaghetti over them, the shape the spaghetti would take could be viewed as a
smooth curve.
8
Shape of a distribution: skewness
Is the histogram right skewed, left skewed, or symmetric?
30
15
60
25
20
10
40
15
10
20
5
5
0
0
0 2 4 6 8 10 0 5 10 15 20 25 0 20 40 60 80
Note: Histograms are said to be skewed to the side of the long tail.
9
Shape of a distribution: unusual observations
Are there any unusual observations or potential outliers?
40
30
25
30
20
20
15
10
10
5
0
0
0 5 10 15 20 20 40 60 80 100
10
Extracurricular activities
How would you describe the shape of the distribution of hours per
week students spend on extracurricular activities?
150
100
50
0
0 10 20 30 40 50 60 70
Hours / week spent on extracurricular activities
11
Commonly observed shapes of distributions
• modality
uniform
unimodal bimodal multimodal
• skewness
right skew symmetric
left skew
12
Practice
Which of these variables do you expect to be uniformly distributed?
(a) weights of adult females
(b) salaries of a random sample of people from North Carolina
(c) house prices
(d) birthdays of classmates (day of the month)
13
Are you typical?
http:// www.youtube.com/ watch?v=4B2xOvKFFz4
How useful are centers alone for conveying the true characteristics
of a distribution?
14
Variance
Variance is roughly the average squared deviation from the mean.
Pn
2 i =1 (xi − x̄ )2
s =
n−1
• The sample mean is
80
x̄ = 6.71, and the sample
60
size is n = 217. 40
• The variance of amount of 20
sleep students get per night 0
2 4 6 8 10 12
can be calculated as: Hours of sleep / night
(5 − 6.71)2 + (9 − 6.71)2 + · · · + (7 − 6.71)2
s2 = = 4.11 hours 2
217 − 1
15
Variance (cont.)
Why do we use the squared deviation in the calculation of variance?
16
Standard deviation
The standard deviation is the square root of the variance, and has
the same units as the data.s
p
s= s2
• The standard deviation of
amount of sleep students
get per night can be 80
calculated as: 60
√ 40
s= 4.11 = 2.03 hours 20
• We can see that all of the
2 4 6 8 10 12
Hours of sleep / night
data are within 3 standard
deviations of the mean.
17
Median
• The median is the value that splits the data in half when
ordered in ascending order.
0, 1, 2, 3, 4
• If there are an even number of observations, then the median
is the average of the two values in the middle.
2+3
0, 1, 2, 3, 4, 5 → = 2.5
2
• Since the median is the midpoint of the data, 50% of the
values are below it. Hence, it is also the 50th percentile.
18
Q1, Q3, and IQR
• The 25th percentile is also called the first quartile, Q1.
• The 50th percentile is also called the median.
• The 75th percentile is also called the third quartile, Q3.
• Between Q1 and Q3 is the middle 50% of the data. The range
these data span is called the interquartile range, or the IQR.
IQR = Q3 − Q1
19
How to calculate the sample quartiles?
20
Example
Let’s compute the quartiles of the following data set representing
the heights (in cm) of 12 plants:
12.3, 4.1, 16.2, 8.9, 14.7, 2.5, 10.6, 7.8, 9.1, 15.4, 17.6, 6.3
21
Step 1: Order the Data and Compute Positions
First, we order the data from smallest to largest:
2.5, 4.1, 6.3, 7.8, 8.9, 9.1, 10.6, 12.3, 14.7, 15.4, 16.2, 17.6
Then, we calculate the positions for Q1 and Q3 as
0.25(12 + 1) = 3.25 and 0.75(12 + 1) = 9.75 respectively.
22
Step 2: Compute the Quartiles
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
2.5 4.1 6.3 7.8 8.9 9.1 10.6 12.3 14.7 15.4 16.2 17.6
We interpolate to find the values at these positions:
• For Q1, the positions 3 and 4 are the closest integers to 3.25.
The values at these positions are 6.3 and 7.8. So, Q1 =
6.3 + 0.25 ∗ (7.8 − 6.3) = 6.675.
• For Q3, the positions 9 and 10 are the closest integers to
9.75. The values at these positions are 14.7 and 15.4. So, Q3
= 14.7 + 0.75 ∗ (15.4 − 14.7) = 15.225.
The median Q2 is the average of the values at positions 12/2 = 6
and (12/2 + 1 = 7, which are 9.1 and 10.6 in our case. So, Q2 =
(9.1 + 10.6)/2 = 9.85.
23
Result
So, the quartiles of our data set are:
• Q1 = 6.675
• Q2 = 9.85
• Q3 = 15.225
24
Data Set
Here’s the data set we are working with:
12.3, 4.1, 16.2, 8.9, 14.7, 2.5, 10.6, 7.8, 9.1, 15.4, 17.6, 6.3
# Store the data in a vector
data <- c(12.3, 4.1, 16.2, 8.9, 14.7, 2.5, 10.6, 7.8,
9.1, 15.4, 17.6, 6.3)
25
Computing Mean and Median in R
Here’s how you can compute the mean and median for the data
set using R:
# Compute the mean
mean_value <- mean(data)
mean_value
# Compute the median
median_value <- median(data)
median_value
26
Computing Quartiles in R (Type 6)
Here’s how you can compute the quartiles for the data set using R
with ‘type = 6‘:
# Compute the quartiles using type 6
quartiles <- quantile(data, type = 6)
quartiles
27
Box plot
The box in a box plot represents the middle 50% of the data, and
the thick line in the box is the median.
70
60
# of study hours / week
50
40
30
20
10
28
Anatomy of a box plot
70
60
# of study hours / week
50
suspected outliers
40
max whisker reach
& upper whisker
30
20 Q3 (third quartile)
●
● median
●
●
●
10 ●
●
● Q1 (first quartile)
●
●
●
●
●
●
●
0 lower whisker
29
Whiskers and outliers
• Whiskers
of a box plot can extend up to 1.5×IQR away from the quartiles.
max upper whisker reach = Q3 + 1.5 × IQR
max lower whisker reach = Q1 − 1.5 × IQR
IQR : 20 − 10 = 10
max upper whisker reach = 20 + 1.5 × 10 = 35
max lower whisker reach = 10 − 1.5 × 10 = −5
• A potential outlier is defined as an observation beyond the
maximum reach of the whiskers. It is an observation that
appears extreme relative to the rest of the data.
30
Outliers (cont.)
Why is it important to look for outliers?
31
Extreme observations
How would sample statistics such as mean, median, SD, and IQR
of household income be affected if the largest value was replaced
with $10 million? What if the smallest value was replaced with $10
million?
●
●
●
● ●
● ●
● ● ● ● ●
●● ● ● ●
● ● ● ● ●
● ●●
● ● ● ●
●● ●●●● ● ●
● ●●●● ● ● ● ●
● ●● ● ● ●● ● ● ●
● ●● ● ● ● ● ● ● ●
● ● ● ●
●● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
Annual Household Income
32
Robust statistics
●
●
●
● ●
● ●
● ● ● ● ●
●● ● ● ●
● ● ● ● ●
● ●●
● ● ● ●
●● ●●●● ● ●
● ●●●● ● ● ● ●
● ●● ● ● ●● ●
● ● ●
● ● ●
● ●●● ● ● ● ● ● ● ●
●● ●
● ●● ● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●
0e+00 2e+05 4e+05 6e+05 8e+05 1e+06
Annual Household Income
robust not robust
scenario median IQR x̄ s
original data 190K 200K 245K 226K
move largest to $10 million 190K 200K 309K 853K
move smallest to $10 million 200K 200K 316K 854K
33
Robust statistics
Median and IQR are more robust to skewness and outliers than
mean and SD. Therefore,
• for skewed distributions it is often more helpful to use median
and IQR to describe the center and spread
• for symmetric distributions it is often more helpful to use the
mean and SD to describe the center and spread
If you would like to estimate the typical household income for a stu-
dent, would you be more interested in the mean or median income?
34
Mean vs. median
• If the distribution is symmetric, center is often defined as the
mean: mean ≈ median Symmetric
mean
median
• If the distribution is skewed or has extreme outliers, center is
often defined as the median
• Right-skewed: mean > median
• Left-skewed: mean < median
Right−skewed Left−skewed
mean mean
median median
35
Practice
Which is most likely true for the distribution of percentage of time actually
spent taking notes in class versus on Facebook, Twitter, etc.?
50
40
30
20
10
0
0 20 40 60 80 100
% of time in class spent taking notes
(a) mean> median (c) mean ≈ median
(b) mean < median (d) impossible to tell
36
Extremely skewed data
When data are extremely skewed, transforming them might make
modeling easier. A common transformation is the log
transformation.
The histograms on the left shows the distribution of number of
basketball games attended by students. The histogram on the right
shows the distribution of log of number of games attended.
40
150
30
100
20
50
10
0 0
0 10 20 30 40 50 60 70 0 1 2 3 4
# of basketball games attended # of basketball games attended
37
Pros and cons of transformations
• Skewed data are easier to model with when they are
transformed because outliers tend to become far less
prominent after an appropriate transformation.
# of games 70 50 25 ···
log(# of games) 4.25 3.91 3.22 ···
• However, results of an analysis in log units of the measured
variable might be difficult to interpret.
What other variables would you expect to be extremely skewed?
38
Intensity maps
What patterns are apparent in the change in population between
2000 and 2010?
http:// projects.nytimes.com/ census/ 2010/ map
39
Considering categorical data
Contingency tables
A table that summarizes data for two categorical variables is called
a contingency table.
The contingency table below shows the distribution of survival and
ages of passengers on the Titanic.
Survival
Died Survived Total
Adult 1438 654 2092
Age
Child 52 57 109
Total 1490 711 2201
40
Bar plots
A bar plot is a common way to display a single categorical variable.
A bar plot where proportions instead of frequencies are shown is
called a relative frequency bar plot.
1500
60.0%
Relative frequency
1000
Frequency
40.0%
500
20.0%
0 0.0%
Died Survived Died Survived
Survival Survival
How are bar plots different than histograms?
41
Choosing the appropriate proportion
Does there appear to be a relationship between age and survival
for passengers on the Titanic?
Survival
Died Survived Total
Adult 1438 654 2092
Age
Child 52 57 109
Total 1490 711 2201
To answer this question we examine the row proportions:
• % Adults who survived: 654 / 2092 ≈ 0.31
• % Children who survived: 57 / 109 ≈ 0.52
42
Bar plots with two variables
• Stacked bar plot: Graphical display of contingency table
information, for counts.
• Side-by-side bar plot: Displays the same information by
placing bars next to, instead of on top of, each other.
• Standardized stacked bar plot: Graphical display of
contingency table information, for proportions.
43
What are the differences between the three visualizations shown
below?
1500
2000
1500 1000
Frequency
Frequency
Survival Survival
1000 Died Died
Survived Survived
500
500
0 0
Adult Child Adult Child
Age Age
1.00
Relative frequency
0.75
Survival
0.50 Died
Survived
0.25
0.00
Adult Child
Age
44
Mosaic plots
What is the difference between the two visualizations shown below?
1.00 Adult Child
Relative frequency
0.75
Survival
Died
Died
0.50
Survived
0.25
0.00 Survived
Adult Child
Age
45
Pie chart vs bar plot
46
Side-by-side box plots
Does there appear to be a relationship between class year and
number of clubs students are in?
8 ● ●
6 ● ●
● ●
0 ● ●
First−year Sophomore Junior Senior
47
Bar plot (bar graph) versus histogram
48
Bar plot (bar graph) versus histogram
49