MMW Module 4
MMW Module 4
4
Data
Management
Data Management
94
Learning Outcomes
95
Lesson 1. Introduction to Data Management
Statistics play a very vital role in our society today, especially this time
of pandemic (COVID-19). All should be included, be counted and accountable for.
No one should be left behind. Because of the usefulness of statistics in almost all
fields of endeavor, some cautions should also be considered. Impressive figures
can be blown out of proportions of their real or imagined importance.
Unscrupulous minds with vested interests make improper or unethical use of
different statistical methods. Questionable and even conflicting claims backed
up with “statistics” can be accepted as true which leads one to believe that
anything can be proven statistically. Moreover, faulty researchers maybe slanted
to produce a particular outcome, that is, statistical analyses are chosen to
produce such outcomes.
Most importantly, for the above reasons, for the statistics users or the
researchers that they clearly understand the statistical tools or techniques being
used in their researches. Thus, in this module, careful attention will be given to
the role of statistics as a tool in research.
96
Statistics is being divided into two (2) categories or branches called
descriptive and inferential statistics. We can differentiate the two using the
definition of statistics.
COLLECTING
ORGANIZING DATA DESCRIPTIVE STATISTICS
PRESENTING
ANALYSIS
DATA INFERENTIAL STATISTICS
INTERPRETING
PARAMETER
A parameter is a value, usually a numerical value that describes a
population. It may be obtained from a single measurement, or it may be derived
from a set of measurements from the population. (µ-population mean; δ-
population standard deviation)
STATISTIC
A statistic is a value, usually a numerical value that describes a sample.
It may be obtained from a single measurement, or it may be derived from a set
of measurements from the sample. (Ẍ-sample mean; s-sample standard
deviation)
VARIABLE
A variable is any information that differs from one member to another in
a population or sample. It is a characteristic of interest for the elements. The
weight (kg) in Table 1.1 served as the variable.
97
Table 1.1 Weights of Randomly selected Grade IV pupils in AES, 1st Quarter of 2020
Section Weights (Kg)
IV - 1 50 41 36 34 54 60 51 37
IV - 2 22 39 42 42 45 38 38 40
IV - 3 38 28 32 44 42 47 37 28
IV - 4 27 27 40 41 39 32 36 24
IV - 5 40 39 33 33 27 30 31 45
Each weight of pupils included in the data set is called an element. An entity on
which data are collected. Collected measurements on each variable for every
element in a study provide the data. The set of measurements obtained for
particular element is called observation.
In Table 1.1, we see the different measurements for the first observations
(IV-1) are 50, 41, 36, 34, 54, 60, 51, 37. For the second observations (IV-2) are
22, 39, 42, 42, 45, 38, 38, 40, and so on. A data set with 40 elements contains
40 observations.
CONSTANT
A constant is an information about the population or sample that is true
to all members. The value of pi, temperature (Celsius to Fahrenheit and vice
versa), number of days in a week, and different forms of measurements e.g. 12
inches = I foot, are some examples of constant.
1. Qualitative Data
Qualitative data describes qualities or characteristics. It is mostly non-
numerical and descriptive in nature. It often but not always captures
emotions, feeling and subjective perception of something.
Qualitative method of research is characterized by the following:
2. Quantitative Data
Quantitative data deals with things that are measurable and can be
expressed in number and figures. It is usually expressed in numerical form
and can be mathematically computed. Qualitative data can be collected
using:
● Experiments/clinical trials
● Observing and recording well-defined objects such as number of
cars which participated in a motorcade.
● Administering surveys with closed-ended questions.
● Paper-pencil questionnaires
Example:
1. Number of siblings
2. Height and weight
3. Temperature in degree Celsius
For example, if you would describe a house, your description can either be
qualitative or quantitative. Here are some descriptions:
99
Qualitative Quantitative
The house is located in Baguio City. The house is 8.5 meters high.
The house is mostly made of cement. The house has 3 bedrooms.
The color of the house is green. The house’s floor area is 125 square
The door is made of oak tree. meters.
1. Nominal data
This level of data is categorical in nature; none is greater than or less than
the other, and it is not in any particular order. Also, the categories are
exclusive and exhaustive, meaning, the response can neither be ‘both’ nor
‘neither’.
2. Ordinal data
Ordinal data must also be exclusive and exhaustive, but the difference is
that the responses are ranked or it has order. Here, you can say that one
response is higher or better than the other.
3. Interval
Here, interval of equal length signifies equal differences in the data.
Difference makes sense but ratios do not. An example is temperature, 30 oC
is not twice as hot as 15oC. Also, the ‘true zero’ start point is not applicable.
This means that zero does not signify the absence of the measurement. Zero
degree Celsius does not mean that there is no temperature.
Example: Temperature
4. Ratio
At this level, both differences and ratios are meaningful. Example, 4
Liters of water is twice as much as 2 Liters of water. There also exists the
‘true zero’ start point in which zero means nothing or the absence of the
measurement. Zero liter of water means there is no water.
100
Example: Weight, Height, Number of children
Data
Qualitative Quantitative
Data can also be classified according to who collected the data. It can be
a primary data or secondary data.
Primary data – These are data which were collected first hand. It is more
authentic, reliable and objective as compare to secondary data. Primary data can
be obtained through experiments, surveys, questionnaires, interviews and
observations.
Secondary data – These data are collected from already published in any form.
The review of literature of research is based on secondary sources. The
importance of secondary data is when you do not need to go through the hassle
of collecting data when it is already available and published. It will save time,
effort and money in the part of the researcher. Secondary data can be collected
from books, records, magazines, research articles, newspapers, biographies,
databases, etc.
Data Presentation
1. Textual Presentation
In textual or descriptive presentation, the data are presented using texts
or paragraphs. This is usually used when the number of data is not too large.
For example:
The population of Region I as of May 1, 2020 is 5,301,139 based on the
2020 Census of Population and Housing (2020 CPH). This accounts for about
4.86 percent of the Philippine population in 2020. The 2020 population of the
region is higher by 275,011 from the population of 5.03 million in 2015, and
552,767 more than the population of 4.75 million in 2010. Moreover, it is higher
101
by 1,100,661 compared with the population of 4.20 million in 2000.
(psa.gov.ph)
2. Tabular Presentation
In tabular presentation, data are presented using tables to represent even
a large number of data to make it engaging and easier to read. The data are
arranged in rows (horizontal) and columns (vertical). Tabular presentation
avoids unnecessary details and repetitions of data. It reveals patterns which
cannot be seen when it is presented in textual form.
102
3. Diagrammatical or Graphical Presentation
This type of presentation uses graphs or diagrams such as bar graph, pie
graph, line graph and scatter diagram. Diagrams give a bird’s eye view of the
data and can be easily understood just by looking at the graph.
Some of the charts or graphs which are commonly used are the following:
1. Pie chart
The following pie graph illustrates the population of Region I per province
for the year 2020 using the data in table 2.
It can be seen in the graph that Pangasinan constitutes 60% of the total
population of Region I.
2. Bar graph
The following bar graph shows the comparison among the population of
the provinces in Region I from 2000, 2010, 2015 and 2020 as seen in
Figure 2.
103
The bar graph shows that Pangasinan dominates the population of
Region I from year 2000 to 2020. The province with the least population
is Ilocos Norte.
3. Column chart
4. Line graph
104
The line graph illustrates that the population of the provinces from
Region I continuously increased from year 2000 to 2020.
5. Scatterplot
The scatter plot shows an almost perfect linear relationship between the
year and the population of the Philippines.
Looking at the given examples, diagrams are mostly used as visual aids. It
cannot be considered as alternatives for numerical data. Diagrams and graphs are
not as accurate as tabular data. Only tabular data can be used for further analysis.
105
MODULE IV – Data Management
Learning Activity 1 – Introduction to Data Management
Name: ______________________________________________________
Course, Year and Section: _____________________________________
Classify the following data whether they are qualitative or quantitative and
nominal, ordinal, ratio or interval.
106
Lesson 2. Measures of Central Tendency
A measure of central tendency is a single value that attempts to describe
a set of data by identifying the central position within that set of data. As such,
measures of central tendency are sometimes called measures of central
location. You can think of it as the tendency of data to cluster around a middle
value.
The measures of central tendency are the mean, median and mode. Each
of these measures calculates the central point using a different method.
𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛
𝑥=
𝑛
This formula can be written in summation notation,
∑𝑛𝑖=1 𝑥𝑖
𝑥=
𝑛
Example: The scores obtained by the students during a 30-point math quiz
are as follows:
18, 23, 28, 27, 29, 27, 23, 20, 18, 10, 14, 25, 29, 30, 24, 15, 10, 24, 19, 20
What is their average score?
Solution: The average score can be determined by solving for the mean.
∑𝑛𝑖=1 𝑥𝑖
𝑥=
𝑛
18 + 23 + 28 + 27 + 29 + 27 + 23 + 20 + 18 + 10 + 14 + 25 + 29 + 30 + 24 + 15 + 10 + 24 + 19 + 20
=
20
The advantages of using the mean is that it is simple to understand and
easy to calculate. It also takes into account all the values in the dataset. The
only disadvantage is that it is easily affected by outliers. Outliers are the
values that are unusual compare to the rest of the data by being especially
small or large in value.
107
Example: Below are the salaries given to staffs of a certain company.
Staff 1 2 3 4 5 6 7 8 9 10
Salary 21K 12K 10K 13K 14K 15K 12K 18K 90K 96K
Notice that most of the salaries only range from 10K to 21K. There are
two entries which are extremely high, 90K and 96K. The resulting mean salary
of these ten staff is 30.1K which is higher than most of the given salaries.
This is because the mean is being pulled by the two large values. In this
situation, we might need to use another measure of central tendency.
∑𝑛𝑖=1 𝑤𝑖 𝑥𝑖
𝑥= 𝑛
∑𝑖=1 𝑤𝑖
That is, taking the sum of the product of each data values multiplied to
its weight, divided by the sum of all the weight.
2. MEDIAN
The median is the middle value. It is the value that splits the data in
half. To find for the median, first arrange the dataset in ascending or
descending order.
If the number of items in the dataset is odd, find the middle values
where in there are equal number of data below and above it.
Example, find the median of the following scores of students during a
math quiz:
12 25 30 13 17 28 27 23 24 11 21
Arrange the scores in ascending order and find the middle value,
11 12 13 17 21 23 24 25 27 28 30
108
The median is 23 because there are five scores before it and five scores
after it.
If the number of items in the dataset is even, find the two middle values
and get it average. For example, find the median of the following data:
12 25 30 13 17 28 27 23 24 11 21 21
First, arrange the scores in ascending order and find the two middle values,
11 12 13 17 21 21 23 24 25 27 28 30
21 + 23
= 22.
2
Therefore, the median of the dataset is 22, which means that there are
six scores lower than 22 and six which are higher.
Median can be used to data with skewed distribution, continuous data
and ordinal data.
3. MODE
The mode is the most frequent score appeared in the dataset. If the
data have multiple values that are tied for occurring the most frequently, the
data have a multimodal distribution. If no value repeats, the data do not have
a mode.
For example, consumers are asked to rate a certain restaurant
according to its overall service. They are asked rate it from 1 to 5, with 5 as
the highest. Here are their responses:
5, 5, 3, 4, 3, 4, 3, 4, 5, 2, 3, 4, 4, 4, 2, 3, 4, 5, 4, 4, 5,
3, 3, 4, 3, 4, 4, 4, 3, 3, 4, 4, 5, 2, 5, 4, 4, 3, 4, 3, 4, 5
Rating Frequency
5 8
4 19
3 11
2 3
1 0
109
Since 4 has the greatest number of frequencies, then the mode is 4.
110
MODULE IV – Data Management
Learning Activity 2 – Measures of Central Tendency
Name: ______________________________________________________
Course, Year and Section: _____________________________________
1. Twenty students were asked their shoe sizes. The results are given below:
7 7 6 7 4 5½ 6½ 7½ 7½ 11
4 6 4½ 9 8 8 6 5 7½ 7
2. A farmer buys 10 packets of seeds from two different companies. Each pack
contains 20 seeds and he recorded the number of plants which grow from
each pack.
Company A: 20 5 20 20 20 6 20 20 20 8
Company B: 17 18 15 16 18 18 17 15 17 18
a. Find the mean, median and mode for each company’s seeds.
b. Which company does the mode suggest best?
c. Which company does the mean suggest is best?
111
b. Only six scores are to be used. Which two scores may be omitted to
leave the value of the median the same?
5. The school has to select one student to join a Math Quiz Bee. Tony and Zoro
took part in six trial quizzes to determine who will represent the school.
The following lists show their scores:
Tony: 29 25 22 28 25 26
Zoro: 33 19 16 32 34 18
112
Lesson 3. Measures of Dispersion
The measures of central tendency are used to determine a central value
which can represent the entire data.
But it does not show anything about the scatteredness or spread of the
data. To illustrate it, let’s take the example below:
The numbers of tray sold per week by two egg producers are:
Producer 1 : 75 85 83 92 98 90 100
Producer 2 : 88 89 89 89 89 89 90
The average numbers of tray sold by the two producers are the same,
which is 89, but they have different data. With regards to the spread of the data,
the following are observed:
Producer 1, the data are more scattered from the mean while Producer 2,
almost all the observations are concentrated around the mean.
Group 1: 6 9 11 13 15 21 23 28 29 35
Group 2: 15 16 16 17 18 19 20 21 23 25
The diagram below shows the points obtained by the student per group.
The arrow is pointing at the position of the mean.
Group 1:
113
Group 2:
It is clear that the scores of group 1 have more dispersion than the scores
in group 2. The scores on group 1 are more scattered away from the mean than
in group 2.
RANGE
In symbols, let HV be the highest value and LV be the lowest value. Then,
𝑅𝑎𝑛𝑔𝑒 = 𝐻𝑉 − 𝐿𝑉
STANDARD DEVIATION
(∑ 𝑥)2
√∑ 𝑥 −
2
𝑠= 𝑛
𝑛−1
114
VARIANCE
Example:
Find the range, standard deviation and variance of the following scores of students in
a 20-point History quiz.
16 14 20 14 18 10 15 16 18 12 20 15
Solution:
Range: 𝑅𝑎𝑛𝑔𝑒 = 𝐻𝑉 − 𝐿𝑉 = 20 − 10 = 10
The range of the students’ score is 10.
Standard deviation:
𝑥 𝑥2
16 256
14 196
20 400
14 196
18 324
10 100
15 225
16 256
18 324
12 144
20 400
15 225
∑ 𝑥 = 188 ∑ 𝑥 2 = 3046
2 (∑ 𝑥)2 (188)2
√∑ 𝑥 − 𝑛 =√ 3046 − 12 = 3.0251
𝑠=
𝑛−1 12 − 1
Variance: 𝑠 2 = 3.02512 = 9.1515
115
MODULE IV – Data Management
Learning Activity 3 – Measures of Dispersion
Name: ______________________________________________________
Course, Year and Section: _____________________________________
MULTIPLE CHOICE. Choose the best answer from the given choices and write
the letter of your choice before each number.
2.
A. 4M B. 4N C. 4P D. 4S
3. A set of data contains 20 numbers. The sum of the numbers is 284 and the
sum of the squares of the numbers is 4,688. Calculate the standard deviation
of the set of data.
A. 5.274 B. 5.724 C. 32.76 D. 36.27
4. The variance of a set of positive numbers p, (p-5), (p-2), (p-3), and (2p-5) is
5.84. Calculate the value of p.
A. 5 B. 6 C. 7 D. 8
6. Calculate the variance and the standard deviation of the set of the data.
Score 1 2 3 4 5
Frequency 5 6 11 10 8
116
C. Variance = 8.95, Standard deviation = 2.99
D. Variance = 12.2, Standard deviation = 3.49
7. The stem-and-leaf plot below show a data set. What is the range of the data.
Stem Leaf
5 0 3 4
6 2 5 6
7 1 2 5 7
8 1 4
Sample: 6 2 means 62
A. 27 B. 30 C. 31 D. 34
10. Which one from these two graphs, (below), has the smallest dispersion?
117
Lesson 4. Measures of Relative Position
Measures of relative position determine the location of a value, relative
to other value in a data set. The most common measures are percentiles,
quartiles and standard score or also known as the z-score.
PERCENTILES
1 3 4 6 8 9 10 15 17 20
Since the given data is already arranged from smallest value to largest
value, let us proceed to step 2.
𝑝 85
i= ( )n = (100)10 = 8.5
100
Step 3, because i is not an integer, round up. The position of the 85th
percentile is the next integer greater than 8.5, the 9th position.
Returning to the data, we see that the 85th percentile is the data value
in the 9th position, or 17.
118
Because i is an integer, step 3(b) states that the 50th percentile is the
average of the fifth and sixth data values; thus the 50 th percentile is
(8+9)/2 = 8.5. Note that the 50th percentile is also the median.
QUARTILES
Quartiles are the values that divide a list of numbers into quarters. The four
quartiles are denoted by 𝑄1 , 𝑄2 , and 𝑄3 .
The first quartile, 𝑄1, divides the dataset such that 25% is less than it and
75% is greater. The second quartile, 𝑄2 , is the median, which means it is at the
middle of the dataset. Lastly, the third quartile, 𝑄3 , divides the data set in such
a way that 75% is less than it and 25% is greater.
For example:
𝑄1 𝑄2 𝑄3
1 3 4 6 8 9 10 15 17 20
𝑋−𝜇
𝑧=
𝜎
where 𝑧 is the z-score, 𝑋 is the value of the element, 𝜇 is the mean of the
population, and 𝜎 is the standard deviation.
119
For example:
Given: 𝜇 = 8; 𝜎 = 13; 𝑋 = 95
Solution:
𝑋−𝜇
𝑧=
𝜎
95 − 89
𝑧=
13
𝑧 = 0.46
The value 𝑧 = 0.46 means that the score of Michael, which is 95, is 0.46
of the standard deviation from the mean.
2. The z-score of James in a Biology test is 1.12. If the mean of the scores
in the test is 79 and the standard deviation is 5.8, what is the score of
James?
Given: 𝜇 = 79; 𝜎 = 5; 𝑧 = 1.12
Solution:
𝑋−𝜇
𝑧=
𝜎
𝑧(𝜎) = 𝑋 − 𝜇
𝑋 = 𝑧(𝜎) + 𝜇
Substitute the given values then solve.
𝑋 = 1.2(5) + 79
𝑥 = 85
The score of James is 85.
120
BOX AND WHISKER PLOT
The box and Whisker plot is used to show all the important values.
The lowest value is 2 and the highest value is 8. The first quartile is 4, the
second quartile is 5 and the third quartile is 7.
Second example, construct the box and whisker plot of the following data:
Step 3: Determine the quartiles, the lowest and the highest value
121
MODULE IV – Data Management
Learning Activity 4 – Measures of Relative Position
Name: ______________________________________________________
Course, Year and Section: _____________________________________
MULTIPLE CHOICE. Choose the best answer from the given choices and write
the letter of your choice before each number.
1. Marianne and her brother are both 62 inches tall. Anna is in the 85 th
percentile for height for her age and her little brother is in the 90 th
percentile. Who is taller for their age?
A. They’re the same, both 62 inches.
B. Anna, she’s taller than more people.
C. Brother, he’s taller than more people.
D. You can’t tell if you don’t know the mean and the standard deviation.
2. Quintiles divide a data set into five regions. Which of the following is a true
statement about quintiles?
A. Each region contains about 5% of the data
B. Each region contains about 20% of the data
C. Each region contains about 25% of the data
D. Each region contains about 50% of the data
3. Clark scored at the 99th percentile on a test. How should be interpret this
information?
A. Clark scored better than 99% of people who took this test
B. Clark scored worse than 99% of people who took this test
C. Clark got 99% of the questions on the test right
D. Clark got 99% of the questions on the test wrong
4. Kyle receives a salary in the 70th percentile. Should he have pleased with his
salary?
A. Yes, because most of the employees receive the salary less than or equal
than him.
B. No, because the salary is not sufficient to his needs.
C. Yes, because only 30% of the employees receiving salary greater than him.
D. No, because 50% of the employees are receiving the salary as him.
5. Yes FM 101.1 station has a low number of listeners. 75% of all country stations
in the Philippines have more listeners than this station. Let the data set “L”
122
consist of the number of listeners for each country station in the Philippines.
Which of the following is a true statement about the Yes FM 101.1 station’s
position in L?
A. It is between the 20th and 30th percentile
B. It is below the 20th percentile
C. It is between the 70th and 80th percentile
D. It is above the 80th percentile
6. Using the table, the scores on a summative examination are presented below
in decreasing order of magnitude. A score of 63 is approximately equivalent
to a percentile rank of ____.
47 48 56 56 56 57
57 57 57 57 58 58
59 59 60 60 60 60
61 61 61 62 62 62
63 64 64 65 65 65
A. 20 B. 25 C. 63 D. 82
7. In an 80-item test, the passing mark is the 3rd quartile. What does it imply?
A. The students should answer at least 60 items correctly.
B. The students should answer at least 40 items correctly.
C. The students should answer at most 60 items correctly.
D. The students should answer at most 40 items correctly.
9, 12, 19, 10, 26, 24, 17, 15, 30, 17, 5, 9, 15, 8, 17, 12, 15, 20, 21
A. 4 B. 5 C. 6 D. 7
9. If Mark Christian discovered that his grade on a recent test was the 72nd
percentile. If 90 students took the test, then approximately how many
students received a higher grade than he did?
A. 72 B. 62 C. 25 D. 18
123
10. National achievement test is administered annually to 6 th graders. The test
has a mean score of 100 and a standard deviation of 15. If Cassandra’s z-
score is 1.20, what was her score on the test?
A. 82 B. 88 C. 100 D. 118
11. The average waist size for teenage males is 29 inches with a standard
deviation of 2 inches. If waist sizes are normally distributed, determine the
z-score of a teenage male a 33-inch waist.
A. 2 B. 1 C. -2 D. -1
12. What does point C on the box plot represent? (See figure below)
A. First Quartile
B. Median
C. Third Quartile
D. Mean
13. What is the least and greatest value of this data set? (See figure below)
A. 6 and 8
B. 6 and 9
C. 8 and 15
D. 15 and 19
14. Which is NOT in the middle 50% of data values? (See figure below)
A. 21 B. 17 C. 15 D. 11
15. If 44 values were used for the data, about how many data values are greater
than the 1st quartile?
A. 11 B. 22 C. 33 D. 128
124
Lesson 5. Probabilities and Normal Distributions
PROBABILITY
A random event is something that may or may not occur while a random
variable can take on any random event as its possible value. For example, when
you toss a die, the possible outcomes can be represented by 𝑋, which is called a
random variable. Tossing the die result to 1 or 𝑋 = 1 is an example of a random
event.
The probability that the event 𝑋 will occur is denoted by 𝑃(𝑋). Further, the
probability that 𝑋 does not occur is 1 − 𝑃(𝑋). In other words, the probability of
the complement of 𝑋 is 1 − 𝑃(𝑋).
For example, if the weather forecast says that the probability to rain
tomorrow is 0.8, then the probability for it not to rain is 1 − 0.8 = 0.2.
NORMAL DISTRIBUTION
125
The shape of the normal curve depends on the mean and standard
deviation of the population for the associated random variable.
The graph above shows a selection of Normal curves, for various of 𝜇 and
𝜎. The curve is always bell shaped, and always centered at the mean 𝜇 . Larger
standard deviations give a curve that is more spread out.
126
By the empirical rule, the following applies:
The standard normal curve is the normal curve with mean 𝜇 = 0 and
standard deviation 𝜎 = 1.
127
The area under the curve can be solved using calculus of by using the following table.
128
-1.90 0.0287 -0.15 0.4404 1.60 0.9452 3.35 0.9996
-1.85 0.0322 -0.10 0.4602 1.65 0.9505 3.40 0.9997
-1.80 0.0359 -0.05 0.4801 1.70 0.9554 3.45 0.9997
3.50 0.9998
In the table, the value of the variable 𝑧 is the score or the number of
standard deviations away from the mean and 𝐴(𝑧) is the area under the standard
normal curve to the left of the 𝑧 value.
For example, if 𝑧 = 1, then the area under the normal curve to the left of
1, as seen in the figure, is 0.8413.
𝑧 𝐴(𝑧)
1 0.8413
This means than in the above example of 𝑧 = 1 and 𝐴(𝑧) = 0.8413, then
the probability that 𝑧 ≤ 1 is the same as 𝐴(𝑧), that is, 𝑃(𝑍 ≤ 1) = 0.8413.
Another example, what is the probability that 𝑍 ≤ −1? Sketch the region
under the standard normal curve whose area is equal to 𝑃(𝑍 ≤ −1).
129
What if you want to know the area to the right of a value?
The area to the right of the value is the area of its complement. Given 𝑧
and you would like to know the area to its right, it would be 1 − 𝐴(𝑧) which is
equal to 𝑃(𝑍 ≥ 𝑧).
Example: If 𝑍 is a standard normal random variable, find 𝑃(𝑍 ≥ 2). Sketch the region
under the curve whose area is equal to 𝑃(𝑍 ≥ 2).
The area between two values is represented by 𝐴(𝑧2 ) − 𝐴(𝑧1 ) which is the same as
𝑃(𝑧1 ≤ 𝑍 ≤ 𝑧2 ).
130
Example, if 𝑍 is a standard normal random variable, find 𝑃(−3 ≤ 𝑧 ≤ 3). Sketch the
region under the standard normal curve whose area is equal to 𝑃(−3 ≤ 𝑧 ≤ 3).
Our previous topic is about standard normal random variable where the
mean 𝜇 = 0 and standard deviation 𝜎 = 1. Not all cases are the same. Some may
have a mean other than 0 and a standard deviation other than 1.
131
To solve the problem, you need to standardize – convert all relevant values
of the general normal random variable to 𝑧-scores, and then calculate the
probabilities of these 𝑧-scores from the standard normal table.
𝑋−𝜇
𝑧=
𝜎
Now, to solve the previous problem, convert the scores 70 and 110 to 𝑧-
score first.
𝑋−𝜇
To convert 70 to z-score, use 𝑧 = and let 𝑋 = 70, 𝜇 = 90, 𝜎 = 10
𝜎
𝑋 − 𝜇 70 − 90 20
𝑧= = =− = −2
𝜎 10 10
𝑋−𝜇
To convert 110 to z-score, use 𝑧 = and let 𝑋 = 110, 𝜇 = 90, 𝜎 = 10
𝜎
𝑋 − 𝜇 110 − 90 20
𝑧= = = =2
𝜎 10 10
132
MODULE IV – Data Management
Learning Activity 5 – Probabilities and Normal Distribution
Name: ______________________________________________________
Course, Year and Section: _____________________________________
MULTIPLE CHOICE. Choose the best answer from the given choices and
write the letter of your choice before each number.
1. A data set has a mean of 290 and a standard deviation of 20. Calculate the
z-score for 265.
A. z = 250.5 B. z = 1.25 C. z = - 1.02 D. z = - 1.25
2. A data set has ma mean of 300 and a standard deviation of 40. What value
would have a z-score of z = -2?
A. 220 B. 296 C. 298 D. 380
3. Khail took his math test and scored an 88. If the class average was 78 with
a standard deviation of 5, what percent of students earned a grade that was
HIGHER THAN Khail’s grade?
A. 97.5% B. 93% C. 5% D. 2.5%
5. Use the image normal graph provided. What is the value of the standard
deviation?
A. 6
B. 12
C. 18
D. 36
133
7. The distribution of z-scores will always have a standard deviation of 1.
A. TRUE B. FALSE
8. The mean of the z-core will always be zero even though the raw scores is
100.
A. TRUE B. FALSE
9. The average height of high school boys is 175 cm. with a standard deviation
of 3.5. Approximately what percent of high school students are TALLER than
180cm?
A. 1.43% B. 7.54% C. 7.64% D. 92.36%
10. What is the total area under the standard normal deviation curve?
A. 100 B. 25 C. 1 D. .5
12. According to the empirical rule, how much of the data falls within 1
standard deviations?
A. 25% B. 68% C. 95% D. 99.7%
13. Which best describes the shaded part of this normal distribution graph?
14. Use the following information and the Empirical Rule to estimate the
answer.
The ages of golfers are normally distributed, with a mean of 38 and a
standard deviation of 4. Find the percentage of golfers that are between 30
and 46 years old.
134
A. 68% B. 94% C. 95% D. 99.7%
15. The mean number of accidents a week at a company i6 6.4 with a standard
deviation of 1.5. What proportion of weeks would you expect to have less
than 5 accidents?
A. 0.8238 B. 0.6915 C. 0.1762 D. -0.93
16. The mean GPA of students in a course at College of Arts and Sciences is 3.2
with a standard deviation of 0.3. What percent of students in the course
have a GPA between 2.9 and 3.8?
A. 95% B. 81.5% C. 68% D. 47.5%
17. The mean life of tire is 30,000 km. The standard deviation is 2,000 km.
Then, 68% of all tires will have a life between ______ km and ______ km.
A. 28,000 km and 32,000 km C. 26,000 km and 34,000 km
B. 24,000 km and 34,000 km D. 27,000 km and 31,000 km
18. In research, 30% of heavy smokers are suffering from lung cancer. If 240
heavy smokers are chosen at random, find the probability that 70 to 91
are suffering from lung cancer.
A. 0.6338 B. 0.3632 C. 0.36022 D. 0.00298
20. The 40-yards sprint times for a soccer team are found to be normally
distributed with a mean of 5.2 seconds and a standard deviation of 0.3
seconds. That is the z-score for a player who runs a time of 5.6 seconds?
A. 1.33 B. 1.02 C. 0.88 D. -1.33
135
Lesson 6. Simple Linear Regression and Correlation
Simple linear regression and correlation are both used when we are
investigating the relationship between two variables. In the field of research of
some fields, we are often interested in describing the change in one variable (Y,
the dependent variable) in terms of a unit change in a second variable (X, the
independent variable). Correlation measures the strength and shows the direction
of the relationship of these two variables while simple linear regression shows the
relationship of the variables. A simple linear regression takes a form of
Ŷ = 𝑎 + 𝑏𝑋
Figure 6.1. A scatter diagram to illustrate the linear relationship between 2 variables.
136
straight line that will go through all the points. The least squares line is the line
that goes through the points so that the sum of the squares of the vertical
deviations of the points from the line is minimal. Those with a knowledge of
calculus should recognize that this is a problem of finding the minimum value of
a function. That is, set the first derivatives of the regression equation with
respect to a and b to zero and solve for a and b. This procedure yields the
following formulas for a and b based on n pairs of X and Y: If X is not a random
variable, the coefficients so obtained are the best linear unbiased estimates of
the true parameters.
Looking at the diagram, Figure 6.1, the distribution of the points follows
an upward direction. An upward direction signifies a positive relationship. This
means that as one variable increases, the other also increases.
137
2. A relationship is non-linear when it follows a pattern but not linear.
138
Using the Table 6.1, Elements necessary to compute for least square regression
for changes in % sucrose with changes in N-fertilizer.
X Y Ŷ
Lbs N Mean % X2 XY Predicted Ŷ − 𝑌
(acre) (sucrose) (%sucrose)
0 16.16 0 0 16.22 -0.06
50 15.74 2,500 787 15.78 -0.04
100 15.29 10,000 1,529 15.35 -0.06
150 15.29 22,500 2,293.5 14.92 0.39
200 14.36 40,000 2,872 14.48 -0.12
250 13.94 62,500 3,485 14.05 -0.11
ΣX=750 ΣY=90.78 ΣX =137,500 ΣXY=10,966.5
2
Correlation
139
r = 1 means there is perfect positive correlation
r = -1 means there is a perfect negative correlation
n(ΣXY) – (ΣX)(ΣY)
r= √[𝑛𝛴𝑋 2 – (𝛴𝑋)2 ] [𝑛𝛴𝑌 2 – (𝛴𝑌)2 ]
Note: ^2 = squared
140
slightly older than their wives. This is no big surprise, but at least the data bear
out our experiences, which is not always the case. What we know of statistics,
however, tells us that what we see is not always significant. Let’s apply the
Pearson r formula and see what happens.
141
HUSBANDS WIVES
Pair X Y X2 Y2 XY
1 36 35 1296 1225 1260
2 72 69 5184 4761 4968
3 37 34 1369 1156 1258
4 36 35 1296 1225 1260
5 51 50 2601 2500 2550
6 50 47 2500 2209 2350
7 47 47 2209 2209 2209
8 50 45 2500 2025 2250
9 37 36 1369 1296 1332
10 41 41 1681 1681 1681
Σ 457 439 22005 20287 21118
N = # of pairs
Since we all have the information needed, we can plug it all into the
formula!
r= n(ΣXY) – (ΣX)(ΣY)
√[𝑛𝛴𝑋 2 – (𝛴𝑋)2 ] [𝑛𝛴𝑌 2 – (𝛴𝑌)2 ]
= 10 (21118) – (457)(439)___________
√[𝟏𝟎(𝟐𝟐𝟎𝟎𝟓) – (𝟒𝟓𝟕)𝟐 ] [𝟏𝟎(𝟐𝟎𝟐𝟖𝟕) – (𝟒𝟑𝟗)𝟐 ]
r = 0.99
142
A hypothesis test for correlation will start with a null hypothesis of “zero
correlation.” Then the alternative hypothesis will be “there is a non-zero
correlation.” In the hypothesis below, the Greek letter rho represents the true
correlation in the population from which our samples is drawn.
Ho : ρ = 0
Ha : ρ ≠ 0
143
30 0.34900 0.44900
35 0.32500 0.41800
40 0.30400 0.39300
45 0.28800 0.37200
50 0.27300 0.35400
60 0.25000 0.32500
70 0.23200 0.30300
80 0.21700 0.28300
90 0.20500 0.26700
100 0.19500 0.25400
144
We originally suspected that husbands and wives would tend to be similar
in age. A scatterplot of the data suggests there is a fairly strong linear
relationship present. And the correlation of 0.99 further describes how strong
the relationship is (close to 1 = strong linear relationship). Now we can use r-
squared value of 98% of that variation can be directly attributed to the linear
relationship with the X values (the ages of the men). The remaining 2% of the
variation in the ages of the women is then due to other factors besides the ages
of the men.
Note that the closer to 100% the value of r-squared, the stronger the
relationship between X and Y. However, we cannot say that changes in x cause
the variation in Y. All we can say is there is a strong association between the
two variables. There may be one some other underlying factor that actually
serves as the engine causing change. One of the most common errors made in
interpreting bivariate data is to wrongly equate causation with association.
145
MODULE IV – Data Management
Learning Activity 6 – Simple Linear Regression and Correlation
Name: ______________________________________________________
Course, Year and Section: _____________________________________
Perform what is being asked in the following. Show all your solutions
and encircle/box your final answers.
146
2. Find the correlation coefficient based on Age vs Glucose level from the
following table from a pre-diabetic study of 6 participants and interpret
results.
147