DESCRIBING DATA DISTRIBUTIONS
In this chapter, you will learn:
➢ To explore data, develop and interpret tables and charts for categorical data and
numerical data
➢ To describe the properties of the measures of center and spread
➢ To calculate and interpret effectively descriptive summary measures
TOPIC 1: DISPLAYING AND DESCRIBING DISTRIBUTIONS
In our everyday lives, we are acquiring data without consideration that these data
may be useful to improve one’s living. Some people collect data for the purpose of
requirements while others for money, letting the data rot in the office or stockroom.
Fortunately, we have these people who know about the beauty behind the data. And
these people/group of people such as private businesses, government agencies and non-
government organizations (NGO’s) have collected considerable quantities of data. These
data are a huge possible source of essential information especially in decision making.
However, the data are practically useless, unless they are properly organized and
then presented in simple, easy-to-understand form. It should also readily suggest the
important information and results at a glance. Thus, if correctly presented, they provide
meaningful insights which then yield conclusions that can immensely influence both daily
resolutions and policies into community and government activities. There are three ways of
presenting organized data: Textual presentation, Tabular presentation and Graphical
presentation.
I. Textual Presentation is a narrative way of describing the characteristics of the
population based on the data collected and organized from the sample. It uses both text
and figures to convey relevant information and is usually found in magazines like the
NSO’s annual/quarterly reports and news papers like Philippine Daily Inquirer.
Illustrations:
▪ A total of 22.4 million children aged 5-17 years old in 9.6 million households were
estimated from the 1995 National Survey of Working Children (NSWC).
▪ Sixteen percent (16%) or 3.6 million children were reported engaged in economic
activities at any time in 1995. Boys were more likely to work than girls with a national
sex ratio of working children of 187.
II. Tabular Presentation is one of the most commonly used presentation method which
makes use of tables. Such data are tallied into appropriate row and/or column
categories. It could be in the form of a cross tabulation table, contingency table, or a
frequency distribution table (FDT).
B.1 Cross Tabulation Table
▪ Cross tabulation table is readily used if the data are expressed in categories
since it is appropriate to present the results in a systematic manner which
arranges data in rows and columns.
Example:
Table1. Numbers of patients falling into each
smoking and lung cancer combination
Lung Cancer
Smoker
Present Absent
Yes 688 650
No 21 59
24
B.2 Contingency tables
▪ Contingency tables are usually used to record and analyze the relationship
between two or more qualitative (categorical) variables. It will be better discussed
in the later part of this workbook.
Example:
▪ Suppose that we have two variables, sex (male or female) and handedness
(right- or left-handed). We observe the values of both variables in a random
sample of 100 people. Then a contingency table can be used to express the
relationship between these two variables, as follows:
Table2. Distribution of 100 persons by Gender and handedness
Handedness
Gender
Right-handed Left-handed Total
Male 43 9 52
Female 44 4 48
Total 87 13 100
The figures italicized are called marginal totals and the figure bolded is the
grand total.
B.3 Frequency Distribution Table
In the real world, sometimes information gathered is numerical in nature, such as
age of the respondent, score in admission entrance examination, GPA of a student and
so on. If ever these kind of data gathered is very large ( n 50 ), then it would be
advantageous to group the data into a number of classes of intervals so as to get a
better look at the overall picture.
Table 3. Scores in a Stat 211 Midterm Exam
31 28 15 10 47
18 32 29 58 48
37 49 26 54 56
21 24 28 32 28
43 12 23 29 61
16 42 40 32 26
48 36 39 22 40
20 63 54 30 17
18 30 23 26 36
47 19 25 38 35
Frequency Distribution Table
▪ The frequency distribution table is one easy method in organizing the data. It is a
grouping of all observations into interval or classes together with a count of the
number of observations that fall in each interval or class.
▪ Data in Table 3 are called raw data and such form is difficult to read and analyze.
Thus, the frequency distribution table will present the data in a more compact,
usable manner and is a meaningful guide for statistical analysis. However, this
process brings about some loss of details.
Steps in Constructing a Frequency Distribution Table
1. From the data set, identify the highest value and lowest value. Compute the range R
as
R = highest value – lowest value
2. Estimate the number of classes, k as
k= n
Note: If the result is fractional, then it will be “rounded off” to the next higher integer,
NOT the usual nearest integer. Rounding off to the nearest integer will often
yield a number of intervals that cannot accommodate all the observations.
DISPLAYING AND DESCRIBING DISTRIBUTIONS
25
3. Estimate the width c of the interval by dividing the range R by the number of classes
k, that is,
R
c=
k
Round off the estimated c to the same number of significant places as the original
data set.
No. of decimal places
Precision
of the raw data
0 1
1 0.1
2 0.01
3 0.001
4. List the lower and upper class limits of the first interval. This interval should contain
the smallest observation in the data set. The starting lower limit could be the lowest
value. List all the class limits by adding the class width to the limits of the previous
interval. The last class should contain the largest observation in the data set.
5. Tally the frequencies for each class.
6. Compute the class marks and the class boundaries.
▪ The class mark is the midpoint of an interval. That is,
LL + UL
CM = , where CM – class mark
2
LL – lower limit
UL – upper limit
▪ It is important to know the unit of precision of the raw data to find the true
class boundaries (i.e. the midterm exam scores are precise to the ones unit,
the value reported as 1.2m is precise to the tenth unit and GPA of 1.77 is
precise to the hundredth unit).
Lower true class boundary, LTCB , is given as
LTCB = LL − 0.5 * precision
Upper class boundary, UTCB , is given as
UTCB = UL + 0.5 * precision
7. We may add some columns to obtain useful information about the distributional
characteristics of the data. Among these are:
▪ Relative frequency ( RF ) – frequency of a class expressed in proportion to or
percentage of the total number of observations. That is,
fi
RF = , where fi is the frequency in each interval and
n
n is the total number of observations
▪ Cumulative frequency (CF). This is the accumulated frequency of a class.
There are two types:
✓ Less than CF (<CF) of a class
• It is the number of observations whose values are less than or
equal to the upper limit of the class interval if the data are discrete
• It is the number of observations whose values are less than the
upper true class boundary if the data are continuous.
✓ Greater than CF (>CF) of a class
• It is the number of observations whose values are greater than or
equal to the lower limit of the class interval if the data is discrete
• It is the number of observations whose values are greater than the
lower true class boundary if the data is continuous.
Let us illustrate the steps listed above for the Stat 211 midterm scores in Table 3.
Step 1: Compute the range: R = 63 − 10 = 53
Step 2: Estimate the number of classes: k = 50 = 7.07 8
DISPLAYING AND DESCRIBING DISTRIBUTIONS
26
53
Step 3: Estimate c, the width of the interval: c = = 6.625 7
8
Note: c is rounded off to a whole number since our data values have zero decimal place.
Hence, the precision is one (1). In general, the class width (c) will be rounded off
depending on the precision of the given data set.
Step 4: List the lower and upper limits of the first interval. You may use the lowest value as
the first lower limit (LL1) and the first upper limit is UL1 = LL1 + c − p , where p is
the precision of the data. So, for this example LL1 = 10 and UL1 = 10 + 7 − 1 = 16
List the succeeding intervals by adding the class width c to the previous limits.
Since c = 7 , then
LL2 = LL1 + c = 10 + 7 = 17 and UL2 = UL1 + c = 16 + 7 = 23
LL3 = LL2 + c = 17 + 7 = 24 and UL3 = UL2 + c = 23 + 7 = 30
LL4 = LL3 + c = 24 + 7 = 31 and UL4 = UL3 + c = 30 + 7 = 37
LL5 = LL4 + c = 31 + 7 = 38 and UL5 = UL4 + c = 37 + 7 = 44
LL6 = LL5 + c = 38 + 7 = 45 and UL6 = UL5 + c = 44 + 7 = 51
LL7 = LL6 + c = 45 + 7 = 52 and UL7 = UL6 + c = 51 + 7 = 58
LL8 = LL7 + c = 52 + 7 = 58 and UL8 = UL7 + c = 58 + 7 = 65
Step 5: Tally the frequencies.
Step 6: Compute the Classmark (CM), that is,
10 + 16
CM1 = = 13
2
17 + 23
CM 2 = = 20
2
24 + 30
CM 3 = = 27
2
31 + 37
CM 4 = = 34
2
38 + 44
CM 5 = = 41
2
45 + 51
CM 6 = = 48
2
52 + 58
CM 7 = = 55
2
58 + 65
CM 8 = = 62
2
Warning: Do not round off the class marks !
Step 6: Compute the Relative Frequency (RF), that is,
RF1 (% ) =
4
100% = 8%
50
RF2 (% ) =
9
100% = 18%
50
RF3 (% ) =
12
100% = 24%
50
RF4 (% ) =
8
100% = 16%
50
RF5 (% ) =
6
100% = 12%
50
RF6 (% ) =
5
100% = 10%
50
RF7 (% ) =
4
100% = 8%
50
RF8 (% ) =
2
100% = 4%
50
DISPLAYING AND DESCRIBING DISTRIBUTIONS
27
Step 8: Compute the CF and CF .
Table 4. The Frequency Distribution of the Midterm Scores of Stat 211
Class Intervals Tally Frequency Class mark RF(%) <CF >CF
10-16 IIII 4 13 8 4 50
17-23 IIII-IIII 9 20 18 13 46
24-30 IIII-IIII-II 12 27 24 25 37
31-37 IIII-III 8 34 16 33 25
38-44 IIII-I 6 41 12 39 17
45-51 IIII 5 48 10 44 11
52-58 IIII 4 55 8 48 6
59-65 II 2 62 4 50 2
NOTE: If the data are continuous, the true class boundaries may be added.
Example 2
Table 4. Weights in kilograms of 50 freshmen students.
57.5 39.3 52.4 52.5 43.4 50.4 53.9 42.4 58.4 55.4
58.3 50.6 53.8 50.9 49.8 45.4 49.0 51.4 44.4 54.9
49.9 57.0 55.0 59.0 45.3 50.0 45.2 51.7 54.3 58.3
53.0 49.4 52.1 51.4 41.3 52.4 40.6 44.8 49.4 45.4
43.3 47.7 47.8 43.5 51.3 55.8 55.8 46.4 54.3 41.4
Step 1: Compute the range: R = 59 − 39.3 = 19.7
Step 2: Estimate the number of classes: k = 50 = 7.07 8
19.7
Step 3: Estimate c, the width of the interval: c = = 2.46 2.5
8
Note: c is rounded off to one decimal place since our data set has one decimal place.
Hence, the precision is 0.1.
Step 4: List the lower limits (LL) and upper limits (UL) of the first interval. So, for this
example LL1 = 39.3 and UL1 = 39.3 + 2.5 − 0.1 = 41.7
Step 5: Tally the frequencies.
Step 6: Compute the Classmark (CM)
Warning: Do not round off the class marks !
Since the data are continuous, we can compute the True Class Boundaries, that is,
LTCB1 = LL1 − 0.5 * p = 39.3 − (0.5 * 0.1) = 39.25
and
UTCB1 = UL1 + 0.5 * p = 41.7 + (0.5 * 0.1) = 41.75
Step 6: Compute the relative frequency
Step 8: Compute the CF and CF .
Table 5. Frequency Distribution of the Weights of 50 freshmen students.
Class Intervals Frequency RF(%) True Class Boundaries <CF >CF Class mark
39.3 – 41.7 4 8 39.25 – 41.75 4 50 40.5
41.8 – 44.2 4 8 41.75 – 44.25 8 46 43.0
44.3 – 46.7 7 14 44.25 – 46.75 15 42 45.5
46.8 – 49.2 3 6 46.75 – 49.25 18 35 48.0
49.3 – 51.7 12 24 49.25 – 51.75 30 32 50.5
51.8 – 54.2 7 14 51.75 – 54.25 37 20 53.0
54.3 – 56.7 7 14 54.25 – 56.75 44 13 55.5
56.8 – 59.2 6 12 56.75 – 59.25 50 6 58.0
DISPLAYING AND DESCRIBING DISTRIBUTIONS
28
EXERCISE 3.1a
Name:____________________________ Score:____________
Yr/Crs/Sec:_______________________ Date:_____________
Instruction: Using the data below, construct the Frequency Distribution Table (FDT). Show
your complete solution and fill in the correct entries in the table given. Provide
an appropriate title.
Table 3. Weights in kilograms of 35 Middle-Aged Municipal Workers
in Antipas Municipal Hall
57.1 63.2 55.1 54.2 70.7 54.1 55.6
62.4 50.6 50.5 53.3 60.9 72.5 45.5
55.5 47.8 53.8 65.6 66.2 45.3 50.1
48.2 53.2 65.9 54.7 54.4 79.1 55.5
57.6 57.5 69.4 57.1 52.1 51.2 65.3
Table 3.1
Class True Class Class
Tally Frequency Rf(%) <CF >CF
Intervals Boundaries mark
SOLUTION:
DISPLAYING AND DESCRIBING DISTRIBUTIONS
29
III. Graphical Presentation
Graphs bring about essential features not immediately seen in tabular presentations,
like the shape of the distribution and the “spread” of the data. Data are presented graphically
in a frequency histogram, a frequency polygon, a frequency ogive, a stem and leaf plot, a
dot plot and a box-and-whiskers plot.
1. Frequency Histogram
Histogram is a graph that closely resembles with the bar graph. It is constructed
from a frequency table, thus its name is "frequency histogram". The intervals from the
table are placed on the x-axis and the values needed for the frequencies are
represented on the y-axis. The frequencies are depicted by the height of a rectangular
bar located directly above the corresponding interval. The basic difference is: a bar
chart uses class limits for the horizontal axis while the histogram employs the class
boundaries. Using the class boundaries, it eliminates spaces between the rectangles
giving it a solid appearance.
39.25 41.75 44.25 46.75 49.25 51.75 54.25 56.75 59.25
Fig.1. Histogram of the weights of 50 freshmen students
2. Frequency Polygon
A frequency polygon is a variation of a histogram. It is constructed by plotting the
class marks against the frequency. The set of (x, y) points formed the class marks and
their corresponding frequencies are connected by straight lines. To “close” the polygon,
an additional class mark is added at the beginning and at the end of the distribution (the
class width must be deducted from the first class mark and added to the last class mark
with zero frequencies).
38.0 40.5 43.0 45.5 48.0 50.5 53.0 55.5 58.0 60.5
Fig.2. Frequency polygon of the weights of 50 freshmen students
Fig.2. Frequency polygon of the weights of 50 freshmen students.
DISPLAYING AND DESCRIBING DISTRIBUTIONS
30
3. Frequency Ogive
A cumulative frequency distribution can be represented graphically by a
frequency ogive. An ogive is obtained by plotting the class boundaries on the horizontal
scale and the cumulative frequency (or the cumulative relative frequency) in the vertical
scale. If the ogive is constructed using the cumulative relative frequency, it is a good way
to estimate the percentiles. Some important percentiles are the median (50th percentile),
lower quartile (25th percentile) and upper quartile (75th percentile). The 25th percentile
is the value below which 25% of the data falls.
39.25 39.25
41.75 41.75
44.25 44.25
46.75 46.75
49.25 49.25
51.75 51.75
54.25 54.25
56.75 56.75
59.25 59.25
Fig.3.
Fig.4. Less thanthan
Greater ogive of the
ogive weights
of the of 50
weights offreshmen student
50 freshmen student
4. Stem-and-Leaf Plot
A special plot that may be presented for a quantitative variable is the stem-and-
Fig.4. Greater than ogive of the weights of 50 freshmen
leaf plot. It is a quick way to illustrate the shape of a distribution while including the
student
actual numerical values in the graph. It is best recommended for small numbers of
observations that are all greater than zero.
Steps in the construction of stem and-leaf-plot:
1) Arrange the observations in increasing order.
2) Identify the leaf (the units digit) and the stem (all other digits except the last or
units digit in each of the observations.
3) List the stems vertically in increasing order from top to bottom.
4) Draw a vertical line to the right of the stems.
5) For each stem, write its leaves to the right of the vertical line in increasing
order.
Example: Construct the stem-and-leaf plot of the data given below:
21, 23, 23, 34, 20, 18, 34, 43, 56, 18, 28, 19, 30, 42, 27, 40, 53, 45
In increasing order:
18, 18, 19, 20, 21, 23, 23, 27, 28, 30, 34, 34, 40, 42, 43, 45, 53, 56
Therefore the stem-and-leaf plot looks like this:
1 8 8 9
2 0 1 3 3 7 8
3 0 4 4
4 0 2 3 5
5 3 6
DISPLAYING AND DESCRIBING DISTRIBUTIONS
31
5. Dot plots
Dot plots are one of the simplest plots available, and are suitable for small to
moderate-sized data sets. A dot plot resembles a stem-and-leaf plot lying on its back,
with dots replacing the values on the leaves. It displays the shape, location, and spread
of the distribution. They are useful for highlighting clusters and gaps, as well as outliers.
Their other advantage is the conservation of numerical information.
Example 1: The number of various kinds of snakes found in a zoo is shown in the
dot plot. What is the total number of snakes in the zoo?
There are four Black Cobras, five Pythons, two Anacondas, etc. Rattle
snakes are the most frequent snakes in the zoo; the least are anacondas and there
are a total of 29 snakes.
Example 2. Construct a dot plot of the following data values: 5, 5, 5, 5, 5, 5, 10, 10, 10,
10, 10, 15, 15, 15, 20, 20, 20, 20, 20, 20, 20, 25, 25, 25, 25, 25, 25, 25, 25, 25, 30, 30,
30, 30, 30, 30, 30, 35, 35, 35, 35, 40, 40, 40, 40, 40, 40, 40, 40, 40, 45, 45, 45, 50, 50.
The dot plot shows that the numbers 25 and 40 are the most frequent in the
data set and 50 being the least frequent. The minimum and maximum values are 5
and 50, respectively.
DISPLAYING AND DESCRIBING DISTRIBUTIONS
32
EXERCISES 3.1b
Name:____________________________ Score:____________
Yr/Crs/Sec:_______________________ Date:_____________
1. Using the FDT you have constructed in Exercise 2.2a, construct the Frequency histogram
and Frequency Polygon.
2. The data below are the diastolic heart rates while the respondents are lying on a folding
bed. Scores are in beats/min. Make a dot plot and a stem&leaf plot
62 85 92 85 88 71
73 82 84 89 93 75
81 72 97 78 90 87
78 74 61 66 83 68
67 83 75 70 86 72
DISPLAYING AND DESCRIBING DISTRIBUTIONS