0% found this document useful (0 votes)
43 views54 pages

MMW Module 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views54 pages

MMW Module 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Module

4
Data
Management

Data Management

Introduction to Data Management


Measures of Central Tendency
Measures of Dispersion
Measures of Relative Position
Probabilities and Normal Distribution
Simple Linear Regression and Correlation

94
Learning Outcomes

At the end of the module, the students will be able to:


1. Understand and be knowledgeable on the language used in statistics;
2. Interpret correctly and objectively statistical evidences through the gathered
data, and make inferences out of it;
3. Convert and transform normally distributed data into standardized one;
4. Use and apply the concept of normal distribution in the fields of specializations;
5. Appreciate the value of statistical analysis, know the impact and apply it in your
daily life;
6. Practice and display diligence, patience, honesty, accuracy and precision in
solving statistical problems.

95
Lesson 1. Introduction to Data Management

If we talk about data management, we deal with statistics. Statistics is an


art and science of collection, organization, presentation, analysis and
interpretation of data. Particularly in the field of medicine, agriculture,
education, business, economics, politics and technology, the information
provided that were translated as data give medical practitioners, educators,
managers and decision makers a better understanding of the different
environment where they are and enables them to make more informed, sound
and better decisions.

Statistics play a very vital role in our society today, especially this time
of pandemic (COVID-19). All should be included, be counted and accountable for.
No one should be left behind. Because of the usefulness of statistics in almost all
fields of endeavor, some cautions should also be considered. Impressive figures
can be blown out of proportions of their real or imagined importance.
Unscrupulous minds with vested interests make improper or unethical use of
different statistical methods. Questionable and even conflicting claims backed
up with “statistics” can be accepted as true which leads one to believe that
anything can be proven statistically. Moreover, faulty researchers maybe slanted
to produce a particular outcome, that is, statistical analyses are chosen to
produce such outcomes.

Most importantly, for the above reasons, for the statistics users or the
researchers that they clearly understand the statistical tools or techniques being
used in their researches. Thus, in this module, careful attention will be given to
the role of statistics as a tool in research.

Science is based on the empirical method for making observations – for


systemically obtaining information. It consists of methods for making
observations. Observations are the empirical “stuff” of science. Statistics, as
we have defined, is an art and science of collection, organization, presentation,
analysis and interpretation of data.

Statistics is a set of concepts, rules, and procedures that help us to


collect, organize and present numerical information in the form of tables,
graphs, and charts; understand and analyze statistical techniques underlying
decisions that affect our lives and well-being; and interpret or make informed
decisions.

96
Statistics is being divided into two (2) categories or branches called
descriptive and inferential statistics. We can differentiate the two using the
definition of statistics.

COLLECTING
ORGANIZING DATA DESCRIPTIVE STATISTICS
PRESENTING

ANALYSIS
DATA INFERENTIAL STATISTICS
INTERPRETING

Since we talked about statistical inference, we should be very careful on


every information we take and use. Many situations require information about
large size or group of people. On top of that, we also have to consider the time,
cost, and many more. Data can be collected from a small portion of the group.
Population refers to the group of elements or set of individuals of interest in a
particular study. The smaller group, sample, is a set of individuals selected from
a population, usually intended to represent the population in a study.

PARAMETER
A parameter is a value, usually a numerical value that describes a
population. It may be obtained from a single measurement, or it may be derived
from a set of measurements from the population. (µ-population mean; δ-
population standard deviation)

STATISTIC
A statistic is a value, usually a numerical value that describes a sample.
It may be obtained from a single measurement, or it may be derived from a set
of measurements from the sample. (Ẍ-sample mean; s-sample standard
deviation)

VARIABLE
A variable is any information that differs from one member to another in
a population or sample. It is a characteristic of interest for the elements. The
weight (kg) in Table 1.1 served as the variable.

97
Table 1.1 Weights of Randomly selected Grade IV pupils in AES, 1st Quarter of 2020
Section Weights (Kg)
IV - 1 50 41 36 34 54 60 51 37
IV - 2 22 39 42 42 45 38 38 40
IV - 3 38 28 32 44 42 47 37 28
IV - 4 27 27 40 41 39 32 36 24
IV - 5 40 39 33 33 27 30 31 45

Each weight of pupils included in the data set is called an element. An entity on
which data are collected. Collected measurements on each variable for every
element in a study provide the data. The set of measurements obtained for
particular element is called observation.
In Table 1.1, we see the different measurements for the first observations
(IV-1) are 50, 41, 36, 34, 54, 60, 51, 37. For the second observations (IV-2) are
22, 39, 42, 42, 45, 38, 38, 40, and so on. A data set with 40 elements contains
40 observations.

CONSTANT
A constant is an information about the population or sample that is true
to all members. The value of pi, temperature (Celsius to Fahrenheit and vice
versa), number of days in a week, and different forms of measurements e.g. 12
inches = I foot, are some examples of constant.

Data is a collection of facts, such as numbers, words, measurements,


observations or just description of things.
Data are classified into two categories: Qualitative and Quantitative data.

1. Qualitative Data
Qualitative data describes qualities or characteristics. It is mostly non-
numerical and descriptive in nature. It often but not always captures
emotions, feeling and subjective perception of something.
Qualitative method of research is characterized by the following:

● Contains open-ended questions which aims to address the ‘how’


and ‘why’ of an event and uses unstructured methods of data
collection to fully explore the topic.
● Rely more heavily on interviews and there are more interactions
between the researcher and the respondents.
● The findings cannot be generalized to any specific population but
it can produce some evidences that can be used to seek general
patterns in different studies but with different issue.
98
It can be collected through:
● In-depth interview
● Observation methods
● Document review

Here are some examples:

1. color of hair, eyes and skin


2. home address and phone number
3. experiences of a person taken from diaries

2. Quantitative Data
Quantitative data deals with things that are measurable and can be
expressed in number and figures. It is usually expressed in numerical form
and can be mathematically computed. Qualitative data can be collected
using:

● Experiments/clinical trials
● Observing and recording well-defined objects such as number of
cars which participated in a motorcade.
● Administering surveys with closed-ended questions.
● Paper-pencil questionnaires

Example:
1. Number of siblings
2. Height and weight
3. Temperature in degree Celsius

Quantitative data can either be:

a. Discrete data – a data which cannot be broken down into smaller


parts. This type of data consists of integers. The number of siblings
(1, 2, 3, …) is an example.

b. Continuous data – data that can be infinitely broken down into


smaller parts or data which can take a decimal value. Examples are
height and weight (1.37 meters and 72.6 kilograms)

For example, if you would describe a house, your description can either be
qualitative or quantitative. Here are some descriptions:

99
Qualitative Quantitative
The house is located in Baguio City. The house is 8.5 meters high.
The house is mostly made of cement. The house has 3 bedrooms.
The color of the house is green. The house’s floor area is 125 square
The door is made of oak tree. meters.

Data Levels of Measurement

1. Nominal data
This level of data is categorical in nature; none is greater than or less than
the other, and it is not in any particular order. Also, the categories are
exclusive and exhaustive, meaning, the response can neither be ‘both’ nor
‘neither’.

Example: Sex (male or Female), civil status (married, divorced, separated,


widow)

2. Ordinal data
Ordinal data must also be exclusive and exhaustive, but the difference is
that the responses are ranked or it has order. Here, you can say that one
response is higher or better than the other.

Example: Academic rank (Instructor, Professor), socioeconomic status (Rich,


middle class, poor)

3. Interval
Here, interval of equal length signifies equal differences in the data.
Difference makes sense but ratios do not. An example is temperature, 30 oC
is not twice as hot as 15oC. Also, the ‘true zero’ start point is not applicable.
This means that zero does not signify the absence of the measurement. Zero
degree Celsius does not mean that there is no temperature.

Example: Temperature

4. Ratio
At this level, both differences and ratios are meaningful. Example, 4
Liters of water is twice as much as 2 Liters of water. There also exists the
‘true zero’ start point in which zero means nothing or the absence of the
measurement. Zero liter of water means there is no water.

100
Example: Weight, Height, Number of children

Data

Qualitative Quantitative

Nominal Ordinal Ratio Interval

Data can also be classified according to who collected the data. It can be
a primary data or secondary data.

Primary data – These are data which were collected first hand. It is more
authentic, reliable and objective as compare to secondary data. Primary data can
be obtained through experiments, surveys, questionnaires, interviews and
observations.

Secondary data – These data are collected from already published in any form.
The review of literature of research is based on secondary sources. The
importance of secondary data is when you do not need to go through the hassle
of collecting data when it is already available and published. It will save time,
effort and money in the part of the researcher. Secondary data can be collected
from books, records, magazines, research articles, newspapers, biographies,
databases, etc.

Data Presentation

1. Textual Presentation
In textual or descriptive presentation, the data are presented using texts
or paragraphs. This is usually used when the number of data is not too large.
For example:
The population of Region I as of May 1, 2020 is 5,301,139 based on the
2020 Census of Population and Housing (2020 CPH). This accounts for about
4.86 percent of the Philippine population in 2020. The 2020 population of the
region is higher by 275,011 from the population of 5.03 million in 2015, and
552,767 more than the population of 4.75 million in 2010. Moreover, it is higher

101
by 1,100,661 compared with the population of 4.20 million in 2000.
(psa.gov.ph)

2. Tabular Presentation
In tabular presentation, data are presented using tables to represent even
a large number of data to make it engaging and easier to read. The data are
arranged in rows (horizontal) and columns (vertical). Tabular presentation
avoids unnecessary details and repetitions of data. It reveals patterns which
cannot be seen when it is presented in textual form.

In presenting data using a table, take note of the following:


● A table must have a table number and a title.
● Subtitles are properly mentioned in the column and row headers.
● Contents of the table are defined clearly.
● Units of measurement are clearly stated whenever necessary.
● Legends for symbols/short forms and sources are indicated in the
footnote.
● The data are logically arranged in the table

Here is an example of a table presenting the population of Region I for the


year 2000-2020.

Table 1. Total Population in Region I

Census Year Census Reference Date Total Population


2000 May 1, 2000 4, 200, 478
2010 May 1, 2010 4, 747, 372
2015 August 1, 2015 5, 026, 128
2020 May 1, 2020 5, 301, 139
Source: Philippine Statistics Authority

Table 2. Population of Region I per Province in Region I

Province 2000 2010 2015 2020


Ilocos Norte 514,241 568,017 593,081 609,588
Ilocos Sur 594,206 658,587 689,668 706,009
La Union 657,945 741,906 786,653 822,352
Pangasinan 2,434,086 2,779,862 2,956,726 3,163,190
Total 4,200,478 4,748,372 5,026,128 5,301,139
Source: Philippine Statistics Authority

102
3. Diagrammatical or Graphical Presentation
This type of presentation uses graphs or diagrams such as bar graph, pie
graph, line graph and scatter diagram. Diagrams give a bird’s eye view of the
data and can be easily understood just by looking at the graph.

Some of the charts or graphs which are commonly used are the following:

1. Pie chart
The following pie graph illustrates the population of Region I per province
for the year 2020 using the data in table 2.

Figure 1. Population of Region I by province for the year 2020

It can be seen in the graph that Pangasinan constitutes 60% of the total
population of Region I.

2. Bar graph
The following bar graph shows the comparison among the population of
the provinces in Region I from 2000, 2010, 2015 and 2020 as seen in
Figure 2.

Figure 2. Population by province in Region I (Bar Graph)

103
The bar graph shows that Pangasinan dominates the population of
Region I from year 2000 to 2020. The province with the least population
is Ilocos Norte.

3. Column chart

This example of column graph is similar to the given bar graph.

Figure 3. Population by province in Region I (Column Graph)

4. Line graph

The following example shows the comparison among the total


population in Region I by province as shown in table 2. Region I is composed
of four provinces namely Ilocos Norte, Ilocos Sur, La Union and Pangasinan.

Figure 4. Population by Province in Region I

104
The line graph illustrates that the population of the provinces from
Region I continuously increased from year 2000 to 2020.

5. Scatterplot

The following scatter plot depicts the population of the


Philippines from the year 1990 up to the present.

Figure 5. Philippine Population from 1990 to 2021

The scatter plot shows an almost perfect linear relationship between the
year and the population of the Philippines.

Looking at the given examples, diagrams are mostly used as visual aids. It
cannot be considered as alternatives for numerical data. Diagrams and graphs are
not as accurate as tabular data. Only tabular data can be used for further analysis.

105
MODULE IV – Data Management
Learning Activity 1 – Introduction to Data Management

Name: ______________________________________________________
Course, Year and Section: _____________________________________

Classify the following data whether they are qualitative or quantitative and
nominal, ordinal, ratio or interval.

Type of Data Level of


(Qualitative or Measurement
Data
Quantitative) (Nominal, Ordinal,
Ratio, Interval)
1. Test questions classified as easy,
average or difficult
2. Years of important historical events
(e.g. 1941, 1980, 2000)
3. Flavor of ice cream
4. Age of students enrolled in GECC 103
5. Amount of money in your savings
account
6. Religion
7. Contact Number
8. Home Address
9. Number of minutes allocated for
reviewing before you sleep
10. IQ

106
Lesson 2. Measures of Central Tendency
A measure of central tendency is a single value that attempts to describe
a set of data by identifying the central position within that set of data. As such,
measures of central tendency are sometimes called measures of central
location. You can think of it as the tendency of data to cluster around a middle
value.

The measures of central tendency are the mean, median and mode. Each
of these measures calculates the central point using a different method.

1. ARITHMETIC MEAN or MEAN


The mean is the arithmetic average and is the most popular and well-
known measure of central tendency. The mean is equal to the sum of all
values in the data set divided by the number of values in the data set.
Let 𝑛 be the number of values in a data set, and the values are
𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 , the mean, usually denoted by 𝑥 (x bar) is:

𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛
𝑥=
𝑛
This formula can be written in summation notation,
∑𝑛𝑖=1 𝑥𝑖
𝑥=
𝑛
Example: The scores obtained by the students during a 30-point math quiz
are as follows:
18, 23, 28, 27, 29, 27, 23, 20, 18, 10, 14, 25, 29, 30, 24, 15, 10, 24, 19, 20
What is their average score?

Solution: The average score can be determined by solving for the mean.
∑𝑛𝑖=1 𝑥𝑖
𝑥=
𝑛
18 + 23 + 28 + 27 + 29 + 27 + 23 + 20 + 18 + 10 + 14 + 25 + 29 + 30 + 24 + 15 + 10 + 24 + 19 + 20
=
20
The advantages of using the mean is that it is simple to understand and
easy to calculate. It also takes into account all the values in the dataset. The
only disadvantage is that it is easily affected by outliers. Outliers are the
values that are unusual compare to the rest of the data by being especially
small or large in value.

107
Example: Below are the salaries given to staffs of a certain company.

Staff 1 2 3 4 5 6 7 8 9 10
Salary 21K 12K 10K 13K 14K 15K 12K 18K 90K 96K

Notice that most of the salaries only range from 10K to 21K. There are
two entries which are extremely high, 90K and 96K. The resulting mean salary
of these ten staff is 30.1K which is higher than most of the given salaries.
This is because the mean is being pulled by the two large values. In this
situation, we might need to use another measure of central tendency.

The mean is best used when data is continuous and symmetrically


distributed. When the data values are assigned to different weights, the
weighted mean can be computed.

Let 𝑤𝑖 be the corresponding weight of each data values 𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 ,


the weighted mean is solved by using the following formula:

∑𝑛𝑖=1 𝑤𝑖 𝑥𝑖
𝑥= 𝑛
∑𝑖=1 𝑤𝑖
That is, taking the sum of the product of each data values multiplied to
its weight, divided by the sum of all the weight.

2. MEDIAN
The median is the middle value. It is the value that splits the data in
half. To find for the median, first arrange the dataset in ascending or
descending order.
If the number of items in the dataset is odd, find the middle values
where in there are equal number of data below and above it.
Example, find the median of the following scores of students during a
math quiz:

12 25 30 13 17 28 27 23 24 11 21

Arrange the scores in ascending order and find the middle value,

11 12 13 17 21 23 24 25 27 28 30

108
The median is 23 because there are five scores before it and five scores
after it.

If the number of items in the dataset is even, find the two middle values
and get it average. For example, find the median of the following data:

12 25 30 13 17 28 27 23 24 11 21 21

First, arrange the scores in ascending order and find the two middle values,

11 12 13 17 21 21 23 24 25 27 28 30

Finally, get the mean of the two numbers,

21 + 23
= 22.
2
Therefore, the median of the dataset is 22, which means that there are
six scores lower than 22 and six which are higher.
Median can be used to data with skewed distribution, continuous data
and ordinal data.

3. MODE
The mode is the most frequent score appeared in the dataset. If the
data have multiple values that are tied for occurring the most frequently, the
data have a multimodal distribution. If no value repeats, the data do not have
a mode.
For example, consumers are asked to rate a certain restaurant
according to its overall service. They are asked rate it from 1 to 5, with 5 as
the highest. Here are their responses:

5, 5, 3, 4, 3, 4, 3, 4, 5, 2, 3, 4, 4, 4, 2, 3, 4, 5, 4, 4, 5,
3, 3, 4, 3, 4, 4, 4, 3, 3, 4, 4, 5, 2, 5, 4, 4, 3, 4, 3, 4, 5

Rating Frequency
5 8
4 19
3 11
2 3
1 0

109
Since 4 has the greatest number of frequencies, then the mode is 4.

Mode is typically used with categorical, ordinal, and discrete data. In


fact, the mode is the only measure of central tendency which can be used on
categorical data. An example is when you determined that most children who
love chocolate flavor of ice cream.

110
MODULE IV – Data Management
Learning Activity 2 – Measures of Central Tendency

Name: ______________________________________________________
Course, Year and Section: _____________________________________

Perform what is being asked in the following:

1. Twenty students were asked their shoe sizes. The results are given below:

7 7 6 7 4 5½ 6½ 7½ 7½ 11
4 6 4½ 9 8 8 6 5 7½ 7

Find the mean, median and mode.

2. A farmer buys 10 packets of seeds from two different companies. Each pack
contains 20 seeds and he recorded the number of plants which grow from
each pack.

Company A: 20 5 20 20 20 6 20 20 20 8
Company B: 17 18 15 16 18 18 17 15 17 18

a. Find the mean, median and mode for each company’s seeds.
b. Which company does the mode suggest best?
c. Which company does the mean suggest is best?

3. Frankie keeps record of the number of fish he catches over a number of


fishing trips. His records are:

1, 0, 2, 0, 0, 0, 12, 0, 12, 0, 2, 0, 0, 1, 18, 0, 2, 0, 1

a. What is the mean, median and mode of the data?


b. Why is it not advisable to use the mean to measure its central
tendency?

4. In a talent contest, the scores awarded by eight judges were:

6.8 7.5 5.8 6.9 5.7 7.2 5.9 8.7

a. Determine the mean, median and mode.

111
b. Only six scores are to be used. Which two scores may be omitted to
leave the value of the median the same?

5. The school has to select one student to join a Math Quiz Bee. Tony and Zoro
took part in six trial quizzes to determine who will represent the school.
The following lists show their scores:

Tony: 29 25 22 28 25 26
Zoro: 33 19 16 32 34 18

a. Calculate each mean score.


b. Which student would be chosen to represent the school? Justify your
answer.

112
Lesson 3. Measures of Dispersion
The measures of central tendency are used to determine a central value
which can represent the entire data.

But it does not show anything about the scatteredness or spread of the
data. To illustrate it, let’s take the example below:

The numbers of tray sold per week by two egg producers are:

Producer 1 : 75 85 83 92 98 90 100

Producer 2 : 88 89 89 89 89 89 90

The average numbers of tray sold by the two producers are the same,
which is 89, but they have different data. With regards to the spread of the data,
the following are observed:

Producer 1, the data are more scattered from the mean while Producer 2,
almost all the observations are concentrated around the mean.

It shows that the measure of central tendency alone is not sufficient to


describe a frequency distribution. In addition, the measure of scatteredness
should also be shown. The scatteredness or spread of the observations is called
the dispersion. To elaborate more on dispersion, consider the following example:

Two groups of students, each with 10 members were given an English


comprehensive test. The scores of the students are as follows:

Group 1: 6 9 11 13 15 21 23 28 29 35

Group 2: 15 16 16 17 18 19 20 21 23 25

Both of the groups have an average of 19 points.

The diagram below shows the points obtained by the student per group.
The arrow is pointing at the position of the mean.

Group 1:

113
Group 2:

It is clear that the scores of group 1 have more dispersion than the scores
in group 2. The scores on group 1 are more scattered away from the mean than
in group 2.

The measure of dispersion is the measurement of the scatter of the given


data about the mean.

The measures of dispersion are the range, standard deviation, and


variance.

RANGE

The range is the simplest measure of dispersion. It is defined as the


difference between the highest and lowest value in the dataset.

In symbols, let HV be the highest value and LV be the lowest value. Then,

𝑅𝑎𝑛𝑔𝑒 = 𝐻𝑉 − 𝐿𝑉

STANDARD DEVIATION

The standard deviation is defined as the positive root of the arithmetic


mean of the square of the deviations of the given observation from their
arithmetic mean. The sample standard deviation is denoted by 𝑠 and the Greek
letter sigma (𝜎) for population.

The formula for calculating standard deviation is as follows:

(∑ 𝑥)2
√∑ 𝑥 −
2

𝑠= 𝑛
𝑛−1

114
VARIANCE

The variance is the square of the standard deviation. It is denoted by 𝑠 2 .


2
2 − (∑ x)
2
2 (∑ 𝑥)2
√ ∑ 𝑥 ∑ 𝑥 −
𝑠2 = ( 𝑛 ) = 𝑛
𝑛−1 𝑛−1

Example:

Find the range, standard deviation and variance of the following scores of students in
a 20-point History quiz.

16 14 20 14 18 10 15 16 18 12 20 15

Solution:
Range: 𝑅𝑎𝑛𝑔𝑒 = 𝐻𝑉 − 𝐿𝑉 = 20 − 10 = 10
The range of the students’ score is 10.
Standard deviation:
𝑥 𝑥2
16 256
14 196
20 400
14 196
18 324
10 100
15 225
16 256
18 324
12 144
20 400
15 225
∑ 𝑥 = 188 ∑ 𝑥 2 = 3046

2 (∑ 𝑥)2 (188)2
√∑ 𝑥 − 𝑛 =√ 3046 − 12 = 3.0251
𝑠=
𝑛−1 12 − 1
Variance: 𝑠 2 = 3.02512 = 9.1515

The standard deviation is 3.0251 while the variance is 9.1515

115
MODULE IV – Data Management
Learning Activity 3 – Measures of Dispersion

Name: ______________________________________________________
Course, Year and Section: _____________________________________

MULTIPLE CHOICE. Choose the best answer from the given choices and write
the letter of your choice before each number.

1. Which set of data has the largest range?


A. 30, 24, 40, 25, 22, 34 C. 36, 49, 55, 38, 58, 42
B. 23, 20, 44, 26, 45, 39 D. 64, 48, 50, 46, 44, 62

2.

Class Mean SD The table shows the achievement of four


4M 82 2 classes in a Mathematics test. Which class
4N 91 5 shows the most consistent achievement in
4P 77 3 the test?
4S 85 1

A. 4M B. 4N C. 4P D. 4S

3. A set of data contains 20 numbers. The sum of the numbers is 284 and the
sum of the squares of the numbers is 4,688. Calculate the standard deviation
of the set of data.
A. 5.274 B. 5.724 C. 32.76 D. 36.27

4. The variance of a set of positive numbers p, (p-5), (p-2), (p-3), and (2p-5) is
5.84. Calculate the value of p.
A. 5 B. 6 C. 7 D. 8

5. Sum of 15 numbers is 1200 while sum of square of the numbers is 38100.


Calculate the mean of the numbers.
A. 77 B. 80 C. 82 D. 85

6. Calculate the variance and the standard deviation of the set of the data.
Score 1 2 3 4 5
Frequency 5 6 11 10 8

A. Variance = 1.64, Standard deviation = 1.28


B. Variance = 3.25, Standard deviation = 1.80

116
C. Variance = 8.95, Standard deviation = 2.99
D. Variance = 12.2, Standard deviation = 3.49

7. The stem-and-leaf plot below show a data set. What is the range of the data.

Stem Leaf
5 0 3 4
6 2 5 6
7 1 2 5 7
8 1 4

Sample: 6 2 means 62
A. 27 B. 30 C. 31 D. 34

8. If the variance of a data set is 30.25, what is the standard deviation?


A. 5.5 B. 15.125 C. 60.5 D. 915.0625

9. Given the height, in cm, of 5 students in a preschool is as follow:

Σx = 515, Σx2 = 53,215, Find the variance.

A. 33.0 B. 34.0 C. 34.5 D. 35.5

10. Which one from these two graphs, (below), has the smallest dispersion?

A. APRIL B. MAY c. 35th plots D. None of ABC

117
Lesson 4. Measures of Relative Position
Measures of relative position determine the location of a value, relative
to other value in a data set. The most common measures are percentiles,
quartiles and standard score or also known as the z-score.

PERCENTILES

A percentile is a measure which provides an estimate of proportion of the


data that should fall above and below a given value.

Steps in calculating nth Percentile:

1. Arrange the data in ascending order (smallest value to largest value).


2. Compute an index i
𝑝
i=( )n
100
where p is the percentile of interest and n is the number of observations.
3. a) If i is not an integer, round up. The next integer greater than i
denotes the position of the nth percentile.
b) If i is an integer, the nth percentile is the average of the values in
positions i and i + 1

An illustration of this procedure, let us determine the 85th percentile for


the given data:

1 3 4 6 8 9 10 15 17 20

Since the given data is already arranged from smallest value to largest
value, let us proceed to step 2.
𝑝 85
i= ( )n = (100)10 = 8.5
100
Step 3, because i is not an integer, round up. The position of the 85th
percentile is the next integer greater than 8.5, the 9th position.

Returning to the data, we see that the 85th percentile is the data value
in the 9th position, or 17.

As another illustration of this procedure, let us consider the calculation


of the 50th percentile of the same given data above. Applying step 2, we
obtain
𝑝 50
i=( )n = (100)10 = 5
100

118
Because i is an integer, step 3(b) states that the 50th percentile is the
average of the fifth and sixth data values; thus the 50 th percentile is
(8+9)/2 = 8.5. Note that the 50th percentile is also the median.

QUARTILES
Quartiles are the values that divide a list of numbers into quarters. The four
quartiles are denoted by 𝑄1 , 𝑄2 , and 𝑄3 .

The first quartile, 𝑄1, divides the dataset such that 25% is less than it and
75% is greater. The second quartile, 𝑄2 , is the median, which means it is at the
middle of the dataset. Lastly, the third quartile, 𝑄3 , divides the data set in such
a way that 75% is less than it and 25% is greater.

For example:

𝑄1 𝑄2 𝑄3

1 3 4 6 8 9 10 15 17 20

STANDARD SCORES (z-SCORES)

A standard score indicates how many standard deviations an element is


from the mean. A standard score can be calculated from the following formula:

𝑋−𝜇
𝑧=
𝜎

where 𝑧 is the z-score, 𝑋 is the value of the element, 𝜇 is the mean of the
population, and 𝜎 is the standard deviation.

Here is how to interpret z-score:

● A z-score less than 0 represent an element less than the mean.


● A z-score greater than 0 represents an element greater than the
mean.
● A z-score equal to 0 represents an element equal to the mean.
● A z-score equal to 1 represents an element that is 1 standard
deviation greater than the mean; a z-score equal to 2, 2 standard
deviations greater than the mean; etc.
● A z-score equal to -1 represents an element that is 1 standard
deviation less than the mean; a z-score equal to -2, 2 standard
deviations less than the mean; etc.

119
For example:

1. An achievement test is being administered annually to third graders.


The resulting score has a mean of 89 and a standard deviation of 13. If
Michael had a score of 95, what is his z-score?

Given: 𝜇 = 8; 𝜎 = 13; 𝑋 = 95

Solution:

𝑋−𝜇
𝑧=
𝜎
95 − 89
𝑧=
13
𝑧 = 0.46

The value 𝑧 = 0.46 means that the score of Michael, which is 95, is 0.46
of the standard deviation from the mean.

2. The z-score of James in a Biology test is 1.12. If the mean of the scores
in the test is 79 and the standard deviation is 5.8, what is the score of
James?
Given: 𝜇 = 79; 𝜎 = 5; 𝑧 = 1.12
Solution:

𝑋−𝜇
𝑧=
𝜎

Manipulate the equation, and solve it in term of 𝑋.

𝑧(𝜎) = 𝑋 − 𝜇
𝑋 = 𝑧(𝜎) + 𝜇
Substitute the given values then solve.
𝑋 = 1.2(5) + 79
𝑥 = 85
The score of James is 85.

120
BOX AND WHISKER PLOT

The box and Whisker plot is used to show all the important values.

For example, analyze the given plot below:

The lowest value is 2 and the highest value is 8. The first quartile is 4, the
second quartile is 5 and the third quartile is 7.

Second example, construct the box and whisker plot of the following data:

12, 5, 7, 9, 10, 14, 8, 11, 12, 8, 10, 12

Step 1: arrange the date in ascending order.

5, 7, 8, 8, 9, 10, 10, 11, 12, 12, 12, 14

Step 2: Separate them into quarters.

5, 7, 8, | 8, 9, 10, |10, 11, 11, |12, 12, 14

Step 3: Determine the quartiles, the lowest and the highest value

𝑄1 = 8; 𝑄2 = 10; 𝑄3 = 11.5; Lowest value= 5; Highest value = 14

Step 4: Set-up the Box and Whisker plot

121
MODULE IV – Data Management
Learning Activity 4 – Measures of Relative Position

Name: ______________________________________________________
Course, Year and Section: _____________________________________

MULTIPLE CHOICE. Choose the best answer from the given choices and write
the letter of your choice before each number.

1. Marianne and her brother are both 62 inches tall. Anna is in the 85 th
percentile for height for her age and her little brother is in the 90 th
percentile. Who is taller for their age?
A. They’re the same, both 62 inches.
B. Anna, she’s taller than more people.
C. Brother, he’s taller than more people.
D. You can’t tell if you don’t know the mean and the standard deviation.

2. Quintiles divide a data set into five regions. Which of the following is a true
statement about quintiles?
A. Each region contains about 5% of the data
B. Each region contains about 20% of the data
C. Each region contains about 25% of the data
D. Each region contains about 50% of the data

3. Clark scored at the 99th percentile on a test. How should be interpret this
information?
A. Clark scored better than 99% of people who took this test
B. Clark scored worse than 99% of people who took this test
C. Clark got 99% of the questions on the test right
D. Clark got 99% of the questions on the test wrong

4. Kyle receives a salary in the 70th percentile. Should he have pleased with his
salary?
A. Yes, because most of the employees receive the salary less than or equal
than him.
B. No, because the salary is not sufficient to his needs.
C. Yes, because only 30% of the employees receiving salary greater than him.
D. No, because 50% of the employees are receiving the salary as him.

5. Yes FM 101.1 station has a low number of listeners. 75% of all country stations
in the Philippines have more listeners than this station. Let the data set “L”

122
consist of the number of listeners for each country station in the Philippines.
Which of the following is a true statement about the Yes FM 101.1 station’s
position in L?
A. It is between the 20th and 30th percentile
B. It is below the 20th percentile
C. It is between the 70th and 80th percentile
D. It is above the 80th percentile

6. Using the table, the scores on a summative examination are presented below
in decreasing order of magnitude. A score of 63 is approximately equivalent
to a percentile rank of ____.

47 48 56 56 56 57
57 57 57 57 58 58
59 59 60 60 60 60
61 61 61 62 62 62
63 64 64 65 65 65

A. 20 B. 25 C. 63 D. 82

7. In an 80-item test, the passing mark is the 3rd quartile. What does it imply?
A. The students should answer at least 60 items correctly.
B. The students should answer at least 40 items correctly.
C. The students should answer at most 60 items correctly.
D. The students should answer at most 40 items correctly.

8. The summative scores of selected Grade 10 students are:

9, 12, 19, 10, 26, 24, 17, 15, 30, 17, 5, 9, 15, 8, 17, 12, 15, 20, 21

How many students got a score below 10?

A. 4 B. 5 C. 6 D. 7

9. If Mark Christian discovered that his grade on a recent test was the 72nd
percentile. If 90 students took the test, then approximately how many
students received a higher grade than he did?

A. 72 B. 62 C. 25 D. 18

123
10. National achievement test is administered annually to 6 th graders. The test
has a mean score of 100 and a standard deviation of 15. If Cassandra’s z-
score is 1.20, what was her score on the test?

A. 82 B. 88 C. 100 D. 118

11. The average waist size for teenage males is 29 inches with a standard
deviation of 2 inches. If waist sizes are normally distributed, determine the
z-score of a teenage male a 33-inch waist.

A. 2 B. 1 C. -2 D. -1

12. What does point C on the box plot represent? (See figure below)

A. First Quartile
B. Median
C. Third Quartile
D. Mean

13. What is the least and greatest value of this data set? (See figure below)
A. 6 and 8
B. 6 and 9
C. 8 and 15
D. 15 and 19

14. Which is NOT in the middle 50% of data values? (See figure below)

A. 21 B. 17 C. 15 D. 11

15. If 44 values were used for the data, about how many data values are greater
than the 1st quartile?

A. 11 B. 22 C. 33 D. 128

124
Lesson 5. Probabilities and Normal Distributions

PROBABILITY

A random event is something that may or may not occur while a random
variable can take on any random event as its possible value. For example, when
you toss a die, the possible outcomes can be represented by 𝑋, which is called a
random variable. Tossing the die result to 1 or 𝑋 = 1 is an example of a random
event.

The complement of an event is its non-occurrence, or its opposite. An


example of a random event is when you flipped a coin and it comes up head. Its
complement is when the result is a tail.

Probability is a way of quantifying the likelihood (chance) that some


random event occurs. It is often related as percentages, but formally it should be
given as proportions. For example, if there is a 25% chance of something to
happening, then its probability is 0.25.

The value of probability is between zero and 1, where 0 means improbable


and 1 means absolutely certain.

The probability that the event 𝑋 will occur is denoted by 𝑃(𝑋). Further, the
probability that 𝑋 does not occur is 1 − 𝑃(𝑋). In other words, the probability of
the complement of 𝑋 is 1 − 𝑃(𝑋).

For example, if the weather forecast says that the probability to rain
tomorrow is 0.8, then the probability for it not to rain is 1 − 0.8 = 0.2.

NORMAL DISTRIBUTION

A random variable is normally distributed or has a normal probability


distribution if its relative frequency histogram has the shape of a normal curve.

The normal curve:

125
The shape of the normal curve depends on the mean and standard
deviation of the population for the associated random variable.

The graph above shows a selection of Normal curves, for various of 𝜇 and
𝜎. The curve is always bell shaped, and always centered at the mean 𝜇 . Larger
standard deviations give a curve that is more spread out.

Properties of a Normal Curve

1. All normal curves have the same general bell shaped.


2. The curve is symmetric with respect to a vertical line that passes
through the peak of the curve.
3. The curve is centered at the mean 𝜇 which coincides with the median
and the mode and is located at the point beneath the peak of the
curve.
4. The area under the curve is always 1.
5. The curve is completely determined by the mean 𝜇 and the standard
deviation 𝜎. For the same mean, 𝜇, a smaller value of 𝜎 gives a taller
and narrower curve, whereas a larger value of 𝜎 gives a flatter curve.
6. The area under the curve to the right of the mean is 0.5 and the area
under the curve to the left of the mean is 0.5.

The empirical rule tells what percentage of the values of a normally


distributed variable fall within 1, 2, and 3 standard deviations of the mean.

126
By the empirical rule, the following applies:

● about 68% of the values in a normal distribution will be within one


standard deviation of the mean;
● about 95% of the values in a normal distribution will be within two
standard deviations of the mean;
● By the empirical rule, about 99.7% of the values in a normal distribution
will be within three standard deviations of the mean.

STANDARD NORMAL CURVE

The standard normal curve is the normal curve with mean 𝜇 = 0 and
standard deviation 𝜎 = 1.

127
The area under the curve can be solved using calculus of by using the following table.

𝑧 𝐴(𝑧) 𝑧 𝐴(𝑧) 𝑧 𝐴(𝑧) 𝑧 𝐴(𝑧)


-3.50 0.0002 -1.75 0.0401 0.00 0.5000 1.75 0.9599
-3.45 0.0003 -1.70 0.0446 0.05 0.5199 1.80 0.9641
-3.40 0.0003 -1.65 0.0495 0.10 0.5398 1.85 0.9678
-3.35 0.0004 -1.60 0.0548 0.15 0.5596 1.90 0.9713
-3.30 0.0005 -1.55 0.0606 0.20 0.5793 1.95 0.9744
-3.25 0.0006 -1.50 0.0668 0.25 0.5987 2.00 0.9772
-3.20 0.0007 -1.45 0.0735 0.30 0.6179 2.05 0.9798
-3.15 0.0008 -1.40 0.0808 0.35 0.6368 2.10 0.9821
-3.10 0.0010 -1.35 0.0885 0.40 0.6554 2.15 0.9842
-3.05 0.0011 -1.30 0.0968 0.45 0.6736 2.20 0.9861
-3.00 0.0013 -1.25 0.1056 0.50 0.6915 2.25 0.9878
-2.95 0.0016 -1.20 0.1151 0.55 0.7088 2.30 0.9893
-2.90 0.0019 -1.15 0.1251 0.60 0.7257 2.35 0.9906
-2.85 0.0022 -1.10 0.1357 0.65 0.7422 2.40 0.9918
-2.80 0.0026 -1.05 0.1469 0.70 0.7580 2.45 0.9929
-2.75 0.0030 -1.00 0.1587 0.75 0.7734 2.50 0.9938
-2.70 0.0035 -0.95 0.1711 0.80 0.7881 2.55 0.9946
-2.65 0.0040 -0.90 0.1841 0.85 0.8023 2.60 0.9953
-2.60 0.0047 -0.85 0.1977 0.90 0.8159 2.65 0.9960
-2.55 0.0054 -0.80 0.2119 0.95 0.8289 2.70 0.9965
-2.50 0.0062 -0.75 0.2266 1.00 0.8413 2.75 0.9970
-2.45 0.0071 -0.70 0.2420 1.05 0.8531 2.80 0.9974
-2.40 0.0082 -0.65 0.2578 1.10 0.8643 2.85 0.9978
-2.35 0.0094 -0.60 0.2743 1.15 0.8749 2.90 0.9981
-2.30 0.0107 -0.55 0.2912 1.20 0.8849 2.95 0.9984
-2.25 0.0122 -0.50 0.3085 1.25 0.8944 3.00 0.9987
-2.20 0.0139 -0.45 0.3264 1.30 0.9032 3.05 0.9989
-2.15 0.0158 -0.40 0.3446 1.35 0.9115 3.10 0.9990
-2.10 0.0179 -0.35 0.3632 1.40 0.9192 3.15 0.9992
-2.05 0.0202 -0.30 0.3821 1.45 0.9265 3.20 0.9993
-2.00 0.0228 -0.25 0.4013 1.50 0.9332 3.25 0.9994
-1.95 0.0256 -0.20 0.4207 1.55 0.9394 3.30 0.9995

128
-1.90 0.0287 -0.15 0.4404 1.60 0.9452 3.35 0.9996
-1.85 0.0322 -0.10 0.4602 1.65 0.9505 3.40 0.9997
-1.80 0.0359 -0.05 0.4801 1.70 0.9554 3.45 0.9997
3.50 0.9998

In the table, the value of the variable 𝑧 is the score or the number of
standard deviations away from the mean and 𝐴(𝑧) is the area under the standard
normal curve to the left of the 𝑧 value.

For example, if 𝑧 = 1, then the area under the normal curve to the left of
1, as seen in the figure, is 0.8413.

𝑧 𝐴(𝑧)
1 0.8413

Since the area under a normal distribution is 1, then it can be used to


represent probabilities. In case of a normally distributed variable, it can be
converted to z-score which can be used to determine its probability. Take note
that the z-distribution should only be used to calculate probabilities when the
variable in question is known to be normally distributed.

This means than in the above example of 𝑧 = 1 and 𝐴(𝑧) = 0.8413, then
the probability that 𝑧 ≤ 1 is the same as 𝐴(𝑧), that is, 𝑃(𝑍 ≤ 1) = 0.8413.

Another example, what is the probability that 𝑍 ≤ −1? Sketch the region
under the standard normal curve whose area is equal to 𝑃(𝑍 ≤ −1).

Since 𝐴(−1) = 0.1587, then 𝑃(𝑍 ≤ −1) = 0.1587.

129
What if you want to know the area to the right of a value?

The area to the right of the value is the area of its complement. Given 𝑧
and you would like to know the area to its right, it would be 1 − 𝐴(𝑧) which is
equal to 𝑃(𝑍 ≥ 𝑧).

Example: If 𝑍 is a standard normal random variable, find 𝑃(𝑍 ≥ 2). Sketch the region
under the curve whose area is equal to 𝑃(𝑍 ≥ 2).

𝑃(𝑍 ≥ 2) = 1 − 𝐴(2) = 1 − 0.9772 = 0.0228

The area between two values is represented by 𝐴(𝑧2 ) − 𝐴(𝑧1 ) which is the same as
𝑃(𝑧1 ≤ 𝑍 ≤ 𝑧2 ).

130
Example, if 𝑍 is a standard normal random variable, find 𝑃(−3 ≤ 𝑧 ≤ 3). Sketch the
region under the standard normal curve whose area is equal to 𝑃(−3 ≤ 𝑧 ≤ 3).

𝑃(−3 ≤ 𝑧 ≤ 3) = 𝐴(−3) − 𝐴(3) = 0.9987 − 0.0013 = 0.9973

GENERAL NORMAL RANDOM VARIABLE

Our previous topic is about standard normal random variable where the
mean 𝜇 = 0 and standard deviation 𝜎 = 1. Not all cases are the same. Some may
have a mean other than 0 and a standard deviation other than 1.

Take the following situation as example:

The scores during an exam are normally distributed with a mean on 90


and standard deviation of 10 points. What percentage of students got a score
between 70 and 110?

131
To solve the problem, you need to standardize – convert all relevant values
of the general normal random variable to 𝑧-scores, and then calculate the
probabilities of these 𝑧-scores from the standard normal table.

Recall that the formula for the 𝑧-score is

𝑋−𝜇
𝑧=
𝜎

where 𝑧 is the 𝑧-score, 𝜇 is the mean, 𝜎 is the standard deviation and 𝑋 is


a normal random variable.

Now, to solve the previous problem, convert the scores 70 and 110 to 𝑧-
score first.
𝑋−𝜇
To convert 70 to z-score, use 𝑧 = and let 𝑋 = 70, 𝜇 = 90, 𝜎 = 10
𝜎

𝑋 − 𝜇 70 − 90 20
𝑧= = =− = −2
𝜎 10 10
𝑋−𝜇
To convert 110 to z-score, use 𝑧 = and let 𝑋 = 110, 𝜇 = 90, 𝜎 = 10
𝜎

𝑋 − 𝜇 110 − 90 20
𝑧= = = =2
𝜎 10 10

Thus, by letting 𝑧1 = −2 and 𝑧2 = 2,

𝑃(−2 ≤ 𝑧 ≤ 2) = 𝐴(2) − 𝐴(−2) = 0.9772 − 0.0228 = 0.9544


Therefore, the probability that a student got a score between 70 and 110
is 0.9544.

132
MODULE IV – Data Management
Learning Activity 5 – Probabilities and Normal Distribution

Name: ______________________________________________________
Course, Year and Section: _____________________________________

MULTIPLE CHOICE. Choose the best answer from the given choices and
write the letter of your choice before each number.

1. A data set has a mean of 290 and a standard deviation of 20. Calculate the
z-score for 265.
A. z = 250.5 B. z = 1.25 C. z = - 1.02 D. z = - 1.25

2. A data set has ma mean of 300 and a standard deviation of 40. What value
would have a z-score of z = -2?
A. 220 B. 296 C. 298 D. 380

3. Khail took his math test and scored an 88. If the class average was 78 with
a standard deviation of 5, what percent of students earned a grade that was
HIGHER THAN Khail’s grade?
A. 97.5% B. 93% C. 5% D. 2.5%

4. What does it mean to have a z-score of z = 0 on a quiz?


A. It means you did better than zero people.
B. It means you scored the same grade as the average grade.
C. It means you earned a zero on the quiz.
D. It means the average score was a zero.

5. Use the image normal graph provided. What is the value of the standard
deviation?
A. 6
B. 12
C. 18
D. 36

6. What percent of scores are below a z-score of z


= -1.02? (Use the table at the right)
A. 84.61%
B. 15.87%
C. 15.39%
D. 9.68%

133
7. The distribution of z-scores will always have a standard deviation of 1.
A. TRUE B. FALSE

8. The mean of the z-core will always be zero even though the raw scores is
100.
A. TRUE B. FALSE

9. The average height of high school boys is 175 cm. with a standard deviation
of 3.5. Approximately what percent of high school students are TALLER than
180cm?
A. 1.43% B. 7.54% C. 7.64% D. 92.36%

10. What is the total area under the standard normal deviation curve?
A. 100 B. 25 C. 1 D. .5

11. Which is NOT true about the normal distribution curve?


A. It is bell-shaped.
B. The mean, median, and mode are approximately equal
C. The smaller the standard deviation, the less spread in the curve.
D. It is asymmetrical.

12. According to the empirical rule, how much of the data falls within 1
standard deviations?
A. 25% B. 68% C. 95% D. 99.7%

13. Which best describes the shaded part of this normal distribution graph?

A. All data that is one or higher.


B. All data that is one or more standard deviations above the mean.
C. All data that is between 1 and 3
D. All data that is above the mean.

14. Use the following information and the Empirical Rule to estimate the
answer.
The ages of golfers are normally distributed, with a mean of 38 and a
standard deviation of 4. Find the percentage of golfers that are between 30
and 46 years old.

134
A. 68% B. 94% C. 95% D. 99.7%
15. The mean number of accidents a week at a company i6 6.4 with a standard
deviation of 1.5. What proportion of weeks would you expect to have less
than 5 accidents?
A. 0.8238 B. 0.6915 C. 0.1762 D. -0.93

16. The mean GPA of students in a course at College of Arts and Sciences is 3.2
with a standard deviation of 0.3. What percent of students in the course
have a GPA between 2.9 and 3.8?
A. 95% B. 81.5% C. 68% D. 47.5%

17. The mean life of tire is 30,000 km. The standard deviation is 2,000 km.
Then, 68% of all tires will have a life between ______ km and ______ km.
A. 28,000 km and 32,000 km C. 26,000 km and 34,000 km
B. 24,000 km and 34,000 km D. 27,000 km and 31,000 km

18. In research, 30% of heavy smokers are suffering from lung cancer. If 240
heavy smokers are chosen at random, find the probability that 70 to 91
are suffering from lung cancer.
A. 0.6338 B. 0.3632 C. 0.36022 D. 0.00298

19. Find a if P (Z < a) = 0.35


A. 0.6368 B. 0.3853 C. 0.3632 D. -0.3853

20. The 40-yards sprint times for a soccer team are found to be normally
distributed with a mean of 5.2 seconds and a standard deviation of 0.3
seconds. That is the z-score for a player who runs a time of 5.6 seconds?
A. 1.33 B. 1.02 C. 0.88 D. -1.33

135
Lesson 6. Simple Linear Regression and Correlation

Simple linear regression and correlation are both used when we are
investigating the relationship between two variables. In the field of research of
some fields, we are often interested in describing the change in one variable (Y,
the dependent variable) in terms of a unit change in a second variable (X, the
independent variable). Correlation measures the strength and shows the direction
of the relationship of these two variables while simple linear regression shows the
relationship of the variables. A simple linear regression takes a form of
Ŷ = 𝑎 + 𝑏𝑋

Where Ŷ is the predicted value of Y for a given value of X, a estimates the


intercept of the regression line with the Y axis, and b estimates the slope or rate
of change in Y for a unit change in X.

The regression coefficients, a and b, are calculated from a set of paired


values of X and Y. The problem of determining the best values of a and b involves
the principle of least squares. To illustrate the degree of relationship between
two quantitative variables, we usually start by representing it graphically using
a scatter diagram or scatterplot.

A scatter diagram is a type of diagram that uses the x- and y-coordinates


to display values for two variables. To illustrate the principle, we will use the
artificial data presented as a scatter diagram in Figure 6.1.

Figure 6.1. A scatter diagram to illustrate the linear relationship between 2 variables.

Because of the existence of experimental errors, the observations (Y) made


for a given set of independent values (X) will not permit the calculation of a single

136
straight line that will go through all the points. The least squares line is the line
that goes through the points so that the sum of the squares of the vertical
deviations of the points from the line is minimal. Those with a knowledge of
calculus should recognize that this is a problem of finding the minimum value of
a function. That is, set the first derivatives of the regression equation with
respect to a and b to zero and solve for a and b. This procedure yields the
following formulas for a and b based on n pairs of X and Y: If X is not a random
variable, the coefficients so obtained are the best linear unbiased estimates of
the true parameters.

Looking at the diagram, Figure 6.1, the distribution of the points follows
an upward direction. An upward direction signifies a positive relationship. This
means that as one variable increases, the other also increases.

The types of relationship which can be derived by looking at a scatterplot


are as follows:

1. A relationship is linear when the point on the scatterplot seems to


follow a straight-line pattern. A linear relationship can either be
positive or negative. It is positive when the line tends to go up, and it
is negative when it goes down.

137
2. A relationship is non-linear when it follows a pattern but not linear.

3. A relationship has no correlation if the points on the scatterplot


follows no pattern at all.

In this lesson, we will be concentrating on variables with linear


relationship. When two variables have a linear relationship, we can determine a
linear equation which can be used as the mathematical model of this linear
relationship. The process of finding the equation of a line passing through given
points is known as linear regression.

To illustrate, let’s study Figure 6.1.

b = Σ(X – X̅)(Y – Y̅) = ΣXY – (ΣX ΣY)/n


Σ(X – X̅)2 ΣX2 – (ΣX)2/n

a = (ΣX2) Y̅ - X̅(ΣXY) = Y̅ - bX̅


ΣX2 – (ΣX)2 /n

138
Using the Table 6.1, Elements necessary to compute for least square regression
for changes in % sucrose with changes in N-fertilizer.

X Y Ŷ
Lbs N Mean % X2 XY Predicted Ŷ − 𝑌
(acre) (sucrose) (%sucrose)
0 16.16 0 0 16.22 -0.06
50 15.74 2,500 787 15.78 -0.04
100 15.29 10,000 1,529 15.35 -0.06
150 15.29 22,500 2,293.5 14.92 0.39
200 14.36 40,000 2,872 14.48 -0.12
250 13.94 62,500 3,485 14.05 -0.11
ΣX=750 ΣY=90.78 ΣX =137,500 ΣXY=10,966.5
2

X̅=125 Y̅=15.13 Mean


2
X =22,916.67

b = ΣXY – (ΣX ΣY)/n = 10966.5 – (750)(90.78)/6 = -0.0087


ΣX2 – (ΣX)2/n 137500 – (750)2 / 6

a = Y̅ - bX̅ = 15.3 – (-0.0087)(125) = 16.22

The resulting regression equation is, Ŷ = 𝟏𝟔. 𝟐𝟐 + (−𝟎. 𝟎𝟎𝟖𝟕)𝑿. This


equation says that for every additional pound of fertilizer N, % sucrose decreases
by 0.0087 sucrose percentage points. Our best estimate of percent sucrose from
0 to 250 lb N/acre is determined by substituting the N rate in the regression equation
and calculating Y (the last column of Table 6.1). For example, we may w ant to
estimate % sucrose for 140 lb N/acre, then

Ŷ = 𝟏𝟔. 𝟐𝟐 − 𝟎. 𝟎𝟎𝟖𝟕 (𝟏𝟒𝟎) = 15.002

Correlation

Correlation analysis measures how two variables are related. The


correlation coefficient (r) is a statistic that tells you the STRENGTH and
DIRECTION of that relationship. It is expressed as a positive or negative number
between -1 and 1. The value of the number indicates the strength of the
relationship:

r = 0 means there is no correlation

139
r = 1 means there is perfect positive correlation
r = -1 means there is a perfect negative correlation

The sign of the correlation coefficient indicates whether the direction of


the relationship is positive (direct) or negative (inverse). Variables which have a
direct relationship (a positive correlation) increase and decrease together. In an
inverse relationship (a negative correlation), one variable increase while the
other decreases. While the sign indicates how one variable changes with respect
to another variable, the magnitude of the number indicates the strength of a
relationship. The absolute value of the correlation coefficient gives us the
relationship strength. The larger the number, the stronger the relationship.

It is important to remember that while correlation coefficients can be


used for prediction (i.e. if we know the value for one variable, and the
correlation, we can predict what the value of the second variable will be) they
may NOT be used for causation (i.e. we cannot say that one variable causes
another).

There are several types of correlation coefficients formulas. One of the


most commonly used formulas is Pearson Product-Moment Correlation
Coefficient. Let’s look at how we can calculate the correlation coefficient using
this method developed by Karl Pearson during the latter half of the nineteenth
century while conducting a series of studies on individual differences with Sir
Francis Galcon. We typically now refer to it as the Pearson’s r. The calculation
is based on the concept of the Z-score; specifically, taking the mean of the Z-
score products from the X and Y variables. However, on this lesson, we are going
to use the derived formula of Pearson’s r:

n(ΣXY) – (ΣX)(ΣY)
r= √[𝑛𝛴𝑋 2 – (𝛴𝑋)2 ] [𝑛𝛴𝑌 2 – (𝛴𝑌)2 ]

Note: ^2 = squared

By way of our first illustration, let’s consider something with which we


are all familiar: age. Let’s begin by asking if people tend to marry other people
of about the same age. Our experience tells us “yes,” but are confident with
this answer? One way to address the question is to look at pairs of ages for a
sample of married couples. The sample data below does just that with the ages
of 10 married couples. Going across the columns we see that, yes, husbands and
wives tend to be of about the same age, with men having a tendency to be

140
slightly older than their wives. This is no big surprise, but at least the data bear
out our experiences, which is not always the case. What we know of statistics,
however, tells us that what we see is not always significant. Let’s apply the
Pearson r formula and see what happens.

Husband (X) Wife (Y)


36 35
72 69
37 34
36 35
51 50
50 47
47 47
50 45
37 36
41 41

Always start an investigation of bivariate data with a graph. Notice the


scatterplot indicates a fairly strong, linear association between the ages of
husbands and wives.

141
HUSBANDS WIVES
Pair X Y X2 Y2 XY
1 36 35 1296 1225 1260
2 72 69 5184 4761 4968
3 37 34 1369 1156 1258
4 36 35 1296 1225 1260
5 51 50 2601 2500 2550
6 50 47 2500 2209 2350
7 47 47 2209 2209 2209
8 50 45 2500 2025 2250
9 37 36 1369 1296 1332
10 41 41 1681 1681 1681
Σ 457 439 22005 20287 21118

Another issue to consider is how many data points we have. Although we


have 10 for husbands and 10 for wives, we do not have 20 independent data
points. Instead, we have 10 pairs of data points, so n = 10. Remember, correlation
concerns the relationship between the two variables, and husbands and wives
definitely form pairs. It certainly wouldn’t make sense to treat all 20 individuals
as independent and allow them to pair up in some random manner.

N = # of pairs

Since we all have the information needed, we can plug it all into the
formula!

r= n(ΣXY) – (ΣX)(ΣY)
√[𝑛𝛴𝑋 2 – (𝛴𝑋)2 ] [𝑛𝛴𝑌 2 – (𝛴𝑌)2 ]

= 10 (21118) – (457)(439)___________
√[𝟏𝟎(𝟐𝟐𝟎𝟎𝟓) – (𝟒𝟓𝟕)𝟐 ] [𝟏𝟎(𝟐𝟎𝟐𝟖𝟕) – (𝟒𝟑𝟗)𝟐 ]

r = 0.99

So based on our data, there is a positive correlation of 0.99 between the


ages of husbands and wives. Now, is this significant? To check for a significant
Pearson r correlation, we turn to a new set of tables. The concept, however, will
remain the same. We need to compare our calculated r to a critical r value found
in the table.

142
A hypothesis test for correlation will start with a null hypothesis of “zero
correlation.” Then the alternative hypothesis will be “there is a non-zero
correlation.” In the hypothesis below, the Greek letter rho represents the true
correlation in the population from which our samples is drawn.

Ho : ρ = 0
Ha : ρ ≠ 0

Critical values for the Pearson r

Df = n-2 α = 0.05 α = 0.01


1 0.99700 0.99990
2 0.95000 0.99000
3 0.87800 0.95900
4 0.81100 0.91700
5 0.75400 0.87400
6 0.70700 0.83400
7 0.66600 0.79800
8 0.63200 0.76500
9 0.60200 0.73500
10 0.57600 0.70800
11 0.55300 0.68400
12 0.53200 0.66100
13 0.51400 0.64100
14 0.49700 0.62300
15 0.48200 0.60600
16 0.46800 0.59000
17 0.45600 0.57500
18 0.44400 0.56100
19 0.43300 0.54900
20 0.42300 0.53700
21 0.41300 0.52600
22 0.40400 0.51500
23 0.39600 0.50500
24 0.38800 0.49600
25 0.38100 0.48700
26 0.37400 0.47900
27 0.36700 0.47100
28 0.36100 0.46300
29 0.35500 0.45600

143
30 0.34900 0.44900
35 0.32500 0.41800
40 0.30400 0.39300
45 0.28800 0.37200
50 0.27300 0.35400
60 0.25000 0.32500
70 0.23200 0.30300
80 0.21700 0.28300
90 0.20500 0.26700
100 0.19500 0.25400

For a Pearson r correlation, our df = n – 2, where n is still equal to the


number of pairs. So, what is our decision? Even if we use the more conservative
alpha = 0.01, our calculated r (0.99) exceeds the critical r (from the table) of
0.765. So, our correlation is significant.

When it comes to correlations, we are faced with a problem not observed


with our previous hypothesis testing. The issue is with significance. Look at the
above table. If we had 3 subjects, what is the critical r? with alpha = 0.05 and a
df = 1 our critical r would be 0.997. That means that we would need a very strong
correlation for significance. Now look at the other end. What if we had a df =
100? Now our critical r becomes 0.195 at alpha = 0.05. What this may tell you is
that all we need for a significant correlation is a high number of subjected pairs.
This means that if the data set is large enough, even a very slight correlation may
appear to be significant. You should always examine a scatterplot of your data
before you considering computing and testing correlation. If the scatterplot
doesn’t suggest a linear pattern in the data – don’t calculate correlation.

Coefficient of Determination (r2)

Coefficient of determination is another statistic of interest. The good


news, just square the r. The coefficient of determination is used to establish
the proportion of the variability among the Y scores that can be accounted for
by the variability among the X scores. For the husband-and-wife data we
computed Pearson’s r to be 0.99 (remember – this means the data points follow
quite closely to a positive-sloping line). If we now square r we get approximately
0.98. As the coefficient of determination (often just called r-squared by
statisticians) we generally report this number in percent form, so we have about
98%.

144
We originally suspected that husbands and wives would tend to be similar
in age. A scatterplot of the data suggests there is a fairly strong linear
relationship present. And the correlation of 0.99 further describes how strong
the relationship is (close to 1 = strong linear relationship). Now we can use r-
squared value of 98% of that variation can be directly attributed to the linear
relationship with the X values (the ages of the men). The remaining 2% of the
variation in the ages of the women is then due to other factors besides the ages
of the men.

Note that the closer to 100% the value of r-squared, the stronger the
relationship between X and Y. However, we cannot say that changes in x cause
the variation in Y. All we can say is there is a strong association between the
two variables. There may be one some other underlying factor that actually
serves as the engine causing change. One of the most common errors made in
interpreting bivariate data is to wrongly equate causation with association.

145
MODULE IV – Data Management
Learning Activity 6 – Simple Linear Regression and Correlation

Name: ______________________________________________________
Course, Year and Section: _____________________________________

Perform what is being asked in the following. Show all your solutions
and encircle/box your final answers.

1. In determining the relationship of plant growth rate (absolute ratio of dry


matter accumulation per plant) and growth rate of the plant tissue (leaf area
growth rate), the following data were obtained in the time interval from 22
to 29 days after planting for 10 soybean cultivars.

Growth Rate Leaf Area Growth Rate


(g/plant/day) (cm/plant/day)
Y X
0.339 34.33
0.398 51.70
0.386 48.61
0.385 45.64
0.378 42.52
0.368 39.09
0.356 38.43
0.354 36.13
0.353 41.30
0.351 37.36

A. Find the regression line of Y on X.


B. Give some reasons why or why not the leaf area growth rate is a good
predictor of the plant rate for the data.

146
2. Find the correlation coefficient based on Age vs Glucose level from the
following table from a pre-diabetic study of 6 participants and interpret
results.

Age Glucose Level


Subject
X Y
1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81

3. Two treatments were used in an experiment to determine the N-treatment


effect on growth rate of potato tubers. The conventional method used 260
Kg N/ha in four applications and 45 cm irrigations. The improved treatment
was intended to minimize leaching of nitrogen. It received 170 Kg N/ha in
ten applications and 27 cm irrigations. The following average accumulated
dry matter of tubers were obtained.

Days after Average dry matter Yield (g/plant)


emergence, X Conventional, Y1 Improved, Y2
35 8 9
49 50 52
71 180 170
91 250 270
104 270 310

A. Find the regression equation of Y1 on X.


B. Find the regression equation of Y2 on X.
C. Compute for coefficient of correlation between X and Y1, Y1 and Y2,
and X an Y2.
D. Compute the coefficient of determination between X and Y1; X and
Y2. Interpret the results.

147

You might also like