0% found this document useful (0 votes)
37 views52 pages

Chapter 2

Chapter 2 of OpenIntro Statistics discusses various methods for summarizing numerical data, including scatterplots, dot plots, histograms, and box plots. It explains key concepts such as mean, median, variance, and standard deviation, as well as how to identify the shape and modality of distributions. The chapter also provides practical examples and R code for calculating these statistics.

Uploaded by

Kuro 909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views52 pages

Chapter 2

Chapter 2 of OpenIntro Statistics discusses various methods for summarizing numerical data, including scatterplots, dot plots, histograms, and box plots. It explains key concepts such as mean, median, variance, and standard deviation, as well as how to identify the shape and modality of distributions. The chapter also provides practical examples and R code for calculating these statistics.

Uploaded by

Kuro 909
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 2: Summarizing data

OpenIntro Statistics, 4th Edition

Slides developed by Mine Çetinkaya-Rundel of OpenIntro.


The slides may be copied, edited, and/or shared via the CC BY-SA license.
Some images may be included under fair use guidelines (educational purposes).
Examining numerical data
Scatterplot

Scatterplots are useful for visualizing the relationship between two


numerical variables.

Do life expectancy and total fertility ap-


pear to be associated or independent?
Was the relationship the same through-
out the years, or did it change?

http:// www.gapminder.org/ world

1
Dot plots

Useful for visualizing one numerical variable. Darker colors


represent areas where there are more observations.

2.5 3.0 3.5 4.0

GPA

How would you describe the distribution of GPAs in this data set?
Make sure to say something about the center, shape, and spread of
the distribution.

2
Dot plots & mean

2.5 3.0 3.5 4.0

GPA

• The mean, also called the average (marked with a triangle in


the above plot), is one way to measure the center of a
distribution of data.
• The mean GPA is 3.59.

3
Mean

• The sample mean, denoted as x̄, can be calculated as

x1 + x2 + · · · + xn
x̄ = ,
n
where x1 , x2 , · · · , xn represent the n observed values.
• The population mean is also computed the same way but is
denoted as µ. It is often not possible to calculate µ since
population data are rarely available.
• The sample mean is a sample statistic, and serves as a point
estimate of the population mean. This estimate may not be
perfect, but if the sample is good (representative of the
population), it is usually a pretty good estimate.

4
Stacked dot plot

Higher bars represent areas where there are more observations,


makes it a little easier to judge the center and the shape of the
distribution.




● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ● ●● ● ●
●●
● ●
● ● ● ● ●●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●●● ●● ●
● ● ● ● ● ● ●● ● ● ●●● ●● ●
● ● ● ● ● ● ● ● ●● ● ●● ●
● ● ● ● ● ● ● ● ● ● ●● ● ●● ●
● ● ● ● ● ● ● ● ●●● ● ●●● ●● ●
● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ●
● ● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ●

2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

GPA

5
Histograms - Extracurricular hours

• Histograms provide a view of the data density. Higher bars


represent where the data are relatively more common.
• Histograms are especially convenient for describing the shape
of the data distribution.
• The chosen bin width can alter the story the histogram is
telling.

150

100

50

0
0 10 20 30 40 50 60 70
6
Hours / week spent on extracurricular activities
Bin width

Which one(s) of these histograms are useful? Which reveal too


much about the data? Which hide too much?

200 150

150
100

100

50
50

0 0
0 20 40 60 80 100 0 10 20 30 40 50 60 70

Hours / week spent on extracurricular activities Hours / week spent on extracurricular activities

80 40

60 30

40 20

20 10

0 0
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70

Hours / week spent on extracurricular activities Hours / week spent on extracurricular activities

7
Shape of a distribution: modality

Does the histogram have a single prominent peak (unimodal),


several prominent peaks (bimodal/multimodal), or no apparent
peaks (uniform)?

14
20
15

15

15

10
10

8
10

10

6
5

4
5

2
0

0
0 5 10 15 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20

Note: In order to determine modality, step back and imagine a smooth curve over
the histogram – imagine that the bars are wooden blocks and you drop a limp
spaghetti over them, the shape the spaghetti would take could be viewed as a
smooth curve.
8
Shape of a distribution: skewness

Is the histogram right skewed, left skewed, or symmetric?

30
15

60

25
20
10

40

15
10
20
5

5
0

0
0 2 4 6 8 10 0 5 10 15 20 25 0 20 40 60 80

Note: Histograms are said to be skewed to the side of the long tail.

9
Shape of a distribution: unusual observations

Are there any unusual observations or potential outliers?

40
30
25

30
20

20
15
10

10
5
0

0
0 5 10 15 20 20 40 60 80 100

10
Extracurricular activities

How would you describe the shape of the distribution of hours per
week students spend on extracurricular activities?

150

100

50

0
0 10 20 30 40 50 60 70

Hours / week spent on extracurricular activities

11
Commonly observed shapes of distributions

• modality

uniform
unimodal bimodal multimodal

• skewness

right skew symmetric


left skew

12
Practice

Which of these variables do you expect to be uniformly distributed?

(a) weights of adult females


(b) salaries of a random sample of people from North Carolina
(c) house prices
(d) birthdays of classmates (day of the month)

13
Are you typical?

http:// www.youtube.com/ watch?v=4B2xOvKFFz4

How useful are centers alone for conveying the true characteristics
of a distribution?

14
Variance

Variance is roughly the average squared deviation from the mean.


Pn
2 i =1 (xi − x̄ )2
s =
n−1

• The sample mean is


80
x̄ = 6.71, and the sample
60

size is n = 217. 40

• The variance of amount of 20

sleep students get per night 0


2 4 6 8 10 12

can be calculated as: Hours of sleep / night

(5 − 6.71)2 + (9 − 6.71)2 + · · · + (7 − 6.71)2


s2 = = 4.11 hours 2
217 − 1

15
Variance (cont.)

Why do we use the squared deviation in the calculation of variance?

16
Standard deviation

The standard deviation is the square root of the variance, and has
the same units as the data.s
p
s= s2

• The standard deviation of


amount of sleep students
get per night can be 80

calculated as: 60

√ 40

s= 4.11 = 2.03 hours 20

• We can see that all of the


2 4 6 8 10 12

Hours of sleep / night

data are within 3 standard


deviations of the mean.
17
Median

• The median is the value that splits the data in half when
ordered in ascending order.

0, 1, 2, 3, 4

• If there are an even number of observations, then the median


is the average of the two values in the middle.

2+3
0, 1, 2, 3, 4, 5 → = 2.5
2
• Since the median is the midpoint of the data, 50% of the
values are below it. Hence, it is also the 50th percentile.

18
Q1, Q3, and IQR

• The 25th percentile is also called the first quartile, Q1.


• The 50th percentile is also called the median.
• The 75th percentile is also called the third quartile, Q3.
• Between Q1 and Q3 is the middle 50% of the data. The range
these data span is called the interquartile range, or the IQR.

IQR = Q3 − Q1

19
How to calculate the sample quartiles?

20
Example

Let’s compute the quartiles of the following data set representing


the heights (in cm) of 12 plants:

12.3, 4.1, 16.2, 8.9, 14.7, 2.5, 10.6, 7.8, 9.1, 15.4, 17.6, 6.3

21
Step 1: Order the Data and Compute Positions

First, we order the data from smallest to largest:

2.5, 4.1, 6.3, 7.8, 8.9, 9.1, 10.6, 12.3, 14.7, 15.4, 16.2, 17.6

Then, we calculate the positions for Q1 and Q3 as


0.25(12 + 1) = 3.25 and 0.75(12 + 1) = 9.75 respectively.

22
Step 2: Compute the Quartiles

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:


2.5 4.1 6.3 7.8 8.9 9.1 10.6 12.3 14.7 15.4 16.2 17.6
We interpolate to find the values at these positions:

• For Q1, the positions 3 and 4 are the closest integers to 3.25.
The values at these positions are 6.3 and 7.8. So, Q1 =
6.3 + 0.25 ∗ (7.8 − 6.3) = 6.675.
• For Q3, the positions 9 and 10 are the closest integers to
9.75. The values at these positions are 14.7 and 15.4. So, Q3
= 14.7 + 0.75 ∗ (15.4 − 14.7) = 15.225.

The median Q2 is the average of the values at positions 12/2 = 6


and (12/2 + 1 = 7, which are 9.1 and 10.6 in our case. So, Q2 =
(9.1 + 10.6)/2 = 9.85.

23
Result

So, the quartiles of our data set are:

• Q1 = 6.675
• Q2 = 9.85
• Q3 = 15.225

24
Data Set

Here’s the data set we are working with:

12.3, 4.1, 16.2, 8.9, 14.7, 2.5, 10.6, 7.8, 9.1, 15.4, 17.6, 6.3

# Store the data in a vector


data <- c(12.3, 4.1, 16.2, 8.9, 14.7, 2.5, 10.6, 7.8,
9.1, 15.4, 17.6, 6.3)

25
Computing Mean and Median in R

Here’s how you can compute the mean and median for the data
set using R:

# Compute the mean


mean_value <- mean(data)
mean_value

# Compute the median


median_value <- median(data)
median_value

26
Computing Quartiles in R (Type 6)

Here’s how you can compute the quartiles for the data set using R
with ‘type = 6‘:

# Compute the quartiles using type 6


quartiles <- quantile(data, type = 6)
quartiles

27
Box plot

The box in a box plot represents the middle 50% of the data, and
the thick line in the box is the median.

70

60
# of study hours / week

50

40

30

20

10

28
Anatomy of a box plot

70

60
# of study hours / week

50
suspected outliers

40
max whisker reach
& upper whisker
30

20 Q3 (third quartile)

● median



10 ●

● Q1 (first quartile)







0 lower whisker

29
Whiskers and outliers

• Whiskers
of a box plot can extend up to 1.5×IQR away from the quartiles.

max upper whisker reach = Q3 + 1.5 × IQR


max lower whisker reach = Q1 − 1.5 × IQR

IQR : 20 − 10 = 10
max upper whisker reach = 20 + 1.5 × 10 = 35
max lower whisker reach = 10 − 1.5 × 10 = −5

• A potential outlier is defined as an observation beyond the


maximum reach of the whiskers. It is an observation that
appears extreme relative to the rest of the data.

30
Outliers (cont.)

Why is it important to look for outliers?

31
Extreme observations

How would sample statistics such as mean, median, SD, and IQR
of household income be affected if the largest value was replaced
with $10 million? What if the smallest value was replaced with $10
million?




● ●
● ●
● ● ● ● ●
●● ● ● ●
● ● ● ● ●
● ●●
● ● ● ●
●● ●●●● ● ●
● ●●●● ● ● ● ●
● ●● ● ● ●● ● ● ●
● ●● ● ● ● ● ● ● ●
● ● ● ●
●● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

Annual Household Income

32
Robust statistics




● ●
● ●
● ● ● ● ●
●● ● ● ●
● ● ● ● ●
● ●●
● ● ● ●
●● ●●●● ● ●
● ●●●● ● ● ● ●
● ●● ● ● ●● ●
● ● ●
● ● ●
● ●●● ● ● ● ● ● ● ●
●● ●
● ●● ● ● ● ● ● ● ●
● ● ● ● ● ●● ● ● ● ● ● ● ● ●
● ●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●

0e+00 2e+05 4e+05 6e+05 8e+05 1e+06

Annual Household Income

robust not robust


scenario median IQR x̄ s
original data 190K 200K 245K 226K
move largest to $10 million 190K 200K 309K 853K
move smallest to $10 million 200K 200K 316K 854K
33
Robust statistics

Median and IQR are more robust to skewness and outliers than
mean and SD. Therefore,

• for skewed distributions it is often more helpful to use median


and IQR to describe the center and spread
• for symmetric distributions it is often more helpful to use the
mean and SD to describe the center and spread

If you would like to estimate the typical household income for a stu-
dent, would you be more interested in the mean or median income?

34
Mean vs. median

• If the distribution is symmetric, center is often defined as the


mean: mean ≈ median Symmetric

mean
median

• If the distribution is skewed or has extreme outliers, center is


often defined as the median
• Right-skewed: mean > median
• Left-skewed: mean < median
Right−skewed Left−skewed

mean mean
median median

35
Practice

Which is most likely true for the distribution of percentage of time actually
spent taking notes in class versus on Facebook, Twitter, etc.?
50

40

30

20

10

0
0 20 40 60 80 100

% of time in class spent taking notes

(a) mean> median (c) mean ≈ median

(b) mean < median (d) impossible to tell

36
Extremely skewed data

When data are extremely skewed, transforming them might make


modeling easier. A common transformation is the log
transformation.

The histograms on the left shows the distribution of number of


basketball games attended by students. The histogram on the right
shows the distribution of log of number of games attended.

40
150

30
100
20

50
10

0 0
0 10 20 30 40 50 60 70 0 1 2 3 4

# of basketball games attended # of basketball games attended

37
Pros and cons of transformations

• Skewed data are easier to model with when they are


transformed because outliers tend to become far less
prominent after an appropriate transformation.

# of games 70 50 25 ···

log(# of games) 4.25 3.91 3.22 ···

• However, results of an analysis in log units of the measured


variable might be difficult to interpret.
What other variables would you expect to be extremely skewed?

38
Intensity maps

What patterns are apparent in the change in population between


2000 and 2010?

http:// projects.nytimes.com/ census/ 2010/ map

39
Considering categorical data
Contingency tables

A table that summarizes data for two categorical variables is called


a contingency table.

The contingency table below shows the distribution of survival and


ages of passengers on the Titanic.

Survival
Died Survived Total
Adult 1438 654 2092
Age
Child 52 57 109
Total 1490 711 2201

40
Bar plots

A bar plot is a common way to display a single categorical variable.


A bar plot where proportions instead of frequencies are shown is
called a relative frequency bar plot.

1500

60.0%

Relative frequency
1000
Frequency

40.0%

500
20.0%

0 0.0%
Died Survived Died Survived
Survival Survival

How are bar plots different than histograms?

41
Choosing the appropriate proportion

Does there appear to be a relationship between age and survival


for passengers on the Titanic?

Survival
Died Survived Total
Adult 1438 654 2092
Age
Child 52 57 109
Total 1490 711 2201

To answer this question we examine the row proportions:

• % Adults who survived: 654 / 2092 ≈ 0.31


• % Children who survived: 57 / 109 ≈ 0.52

42
Bar plots with two variables

• Stacked bar plot: Graphical display of contingency table


information, for counts.
• Side-by-side bar plot: Displays the same information by
placing bars next to, instead of on top of, each other.
• Standardized stacked bar plot: Graphical display of
contingency table information, for proportions.

43
What are the differences between the three visualizations shown
below?

1500
2000

1500 1000
Frequency

Frequency
Survival Survival

1000 Died Died


Survived Survived
500
500

0 0
Adult Child Adult Child
Age Age

1.00
Relative frequency

0.75

Survival
0.50 Died
Survived

0.25

0.00
Adult Child
Age

44
Mosaic plots

What is the difference between the two visualizations shown below?

1.00 Adult Child


Relative frequency

0.75

Survival
Died
Died
0.50
Survived

0.25

0.00 Survived
Adult Child
Age

45
Pie chart vs bar plot

46
Side-by-side box plots

Does there appear to be a relationship between class year and


number of clubs students are in?

8 ● ●

6 ● ●

● ●

0 ● ●

First−year Sophomore Junior Senior

47
Bar plot (bar graph) versus histogram

48
Bar plot (bar graph) versus histogram

49

You might also like