0% found this document useful (0 votes)
5 views20 pages

Note

Chapter Five discusses the concept of sampling, which involves selecting a part of a population to make inferences about the whole. It defines key terms such as population, census, sampling frame, and sampling design, and explains the importance of sample size determination and various sampling methods. The chapter emphasizes the need for representative samples to ensure valid conclusions and outlines different techniques including random sampling, stratified sampling, cluster sampling, and systematic sampling.

Uploaded by

Feyisa Danu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

Note

Chapter Five discusses the concept of sampling, which involves selecting a part of a population to make inferences about the whole. It defines key terms such as population, census, sampling frame, and sampling design, and explains the importance of sample size determination and various sampling methods. The chapter emphasizes the need for representative samples to ensure valid conclusions and outlines different techniques including random sampling, stratified sampling, cluster sampling, and systematic sampling.

Uploaded by

Feyisa Danu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Chapter Five

5. Sampling
The concept of sampling
 Sampling may be defined as the selection of some part of an aggregate or totality on
the basis of which a judgment or inference about the aggregate or totality is made.
 In other words, it is the process of obtaining information about an entire population by
examining only a part of it.
 In most of the research work and surveys, the usual approach happens to be to make
generalizations or to draw inferences based on samples about the parameters of
population from which the samples are taken.
 The researcher quite often selects only a few items from the universe for his study
purposes. All this is done on the assumption that the sample data will enable him to
estimate the population parameters.
 The items so selected constitute what is technically called a sample, their selection
process or technique is called sample design.
 Sample should be truly representative of population characteristics without any bias so
that it may result in valid and reliable conclusions.
Some Fundamental Definitions
Before we talk about details and uses of sampling, it seems appropriate that we should be
familiar with some fundamental definitions concerning sampling concepts and principles.
1. Population: refers to the total of items about which information is desired.
 The population represents the target of an investigation, and the objective of the
investigation is to draw conclusions about the population hence we sometimes call it
target population.
 The population or universe can be finite or infinite.
 The population is said to be finite if it consists of a fixed number of elements so that it
is possible to enumerate it in its totality. For instance, the population of a city, the
number of workers in a factory, e.t.c.
 An infinite population is that population in which it is theoretically impossible to
observe all the elements. E.g. number of stars in a sky. From a practical consideration,
we use the term infinite population for a population that cannot be enumerated in a
reasonable period of time. This way we use the theoretical concept of infinite
population as an approximation of a very large finite population.
Examples
􀀹 Population of trees under specified climatic conditions
􀀹 Population of animals fed a certain type of diet
􀀹 Population of farms having a certain type of natural fertility
􀀹 Population of households, etc
2. Census: a complete enumeration of the population. But in most real problems it cannot be
realized, hence we take sample.
3. Sampling frame: it is the group or cluster or list of items from which the sample is to be
drawn. Whatever the frame may be, it should be a good representative of the population.
4. Sampling design: it is a definite plan for obtaining a sample from the sampling frame. It
refers to the technique or the procedure the researcher would adopt in selecting some sampling
units.
5. Statistic(s) and parameter(s): A statistic is a characteristic of a sample, whereas a parameter
is a characteristic of a population. Thus, when we work out certain measures such as mean,
median, mode or the like ones from samples, then they are called statistic(s) for they describe the
characteristics of a sample. But when such measures describe the characteristics of a population,
they are known as parameter(s). For instance, the population mean(m) is a parameter, whereas
the sample mean ( X ) is a statistic.
6. Errors: there would be a certain amount of inaccuracy in the information collected. This
inaccuracy may be two types. i.e. sampling error(error variance. ) and Non sampling errors.
a) Sampling error:
 Is the discrepancy between the population value and sample value.
 May arise due to in appropriate sampling techniques applied.
b) Non sampling errors: are errors due to procedure bias such as:
 Due to incorrect responses.
 Measurement.
 Errors at different stages in processing the data.
Frame error Population Sampling error = Frame error+
chance error + response error.
Respo
nse Sample frame (If we add measurement error or the
Sample non-sampling error to sampling
Chance error error, we get total error).

Response error

 The more homogeneous the universe, the smaller the sampling error.
 Sampling error is inversely related to the size of the sample i.e., sampling error decreases
as the sample size increases and vice-versa.
7. Sampling distribution: it is all values of a particular statistic, say mean, together with their
relative frequencies.
8. Sampling: The process or method of sample selection from the population.
9. Sampling unit: the ultimate unit to be sampled or elements of the population to be sampled.
5.1. Why sampling is needed?

-Reduced cost - Sampling can save time and money. A sample study is usually less expensive
than a census study and produces results at a relatively faster speed.
-Greater speed
-Greater accuracy - Sampling may enable more accurate measurements for a sample study is
generally conducted by trained and experienced investigators.
-Greater scope
-Avoids destructive test
-The only option when the population is infinite (large).
Sometimes taking a census makes more sense than using a sample if there is Non-
representativeness and detailedness is needed.
5.2 Sample Size Determination

Sample Size refers to the number of sampling units selected from the population for
investigation.

The size of the sample should be determined by a researcher keeping in view the following
points:
(i) Nature of universe (population): Universe may be either homogenous or heterogeneous in
nature. If the items of the universe are homogenous, a small sample can serve the purpose. But if
the items are heterogeneous, a large sample would be required. Technically, this can be termed
as the dispersion factor.
(ii) Number of classes proposed: If many class-groups (groups and sub-groups) are to be
formed, a large sample would be required because a small sample might not be able to give a
reasonable number of items in each class-group.
(iii) Type of sampling: Sampling technique plays an important part in determining the size of
the sample. A small random sample is appropriate to be much superior to a larger but badly
selected Sample.
(iv) Standard of accuracy and acceptable confidence level: If the standard of accuracy or the
level of precision is to be kept high, we shall require relatively larger sample.
(v) Availability of finance: In practice, size of the sample depends upon the amount of money
available for the study purposes. This factor should be kept in view while determining the size of
sample for large samples result in increasing the cost of sampling estimates.
(vi) Other considerations: Nature of units, size of the population, size of questionnaire,
availability of trained investigators, the conditions under which the sample is being conducted,
the time available for completion of the study are a few other considerations to which a
researcher must pay attention while selecting the size of the sample.
Process of determining Sample Size

 Precision is the range within which the answer may vary and still be acceptable;
confidence level indicates the likelihood that the answer will fall within that range, and
the significance level indicates the likelihood that the answer will fall outside that range.
 We can always remember that if the confidence level is 95%, then the significance level
will be (100 – 95) i.e., 5%; if the confidence level is 99%, the significance level is (100 –
99) i.e., 1%, and so on.
 We should also remember that the area of normal curve within precision limits for the
specified confidence level constitutes the acceptance region and the area of the curve
outside these limits in either direction constitutes the rejection regions.
 In other ways, Confidence interval is the specific interval estimate of the parameter
determined by using the data obtained from sample and specific confidence level of the
estimate.
 Confidence level of an interval estimate of a parameter is the probability that the interval
estimate will contain the parameter. Three common confidence intervals are used: 90%,
95% and 99% confidence interval.
If the specific sample mean is selected say x .There is 95% probability that it falls within the
σ
range of μ ±1.96 ( ) . Likewise there is 95% probability that the interval specified by
√n
x ± 1.96
( √σn ) will contain μi . e x−1.96 ( √σn )< μ< x +1.96( √σ2 ) .hence one can be 95% confident
that the populations mean is contained in the interval when the values of the variable are
normally distributed.
E.g. the teacher wishes to estimate the average age of the students enrolled. From past studies
standard deviation is 2 years. Sample of 50 students is selected and the mean is found to be 23.2
years. Find 95% confidence interval of the population mean.
since 95 % confidence interval isdesired , zα value is1.96 .

x−z
α
2( √σn )< μ< x + z α2 ( √σn )
23.2-1.96(
√ 50 ) ( √ 50 ) .
2 2
<¿ μ <23.2+1.96

23.2-0.6 < μ <23.2+0.6 years=23.2± 0.6 years .


The teacher can say with 95 % confidence that the average age of the students is between 22.6
and 23.8 years based on 50 students. I.e. there is 95% probability that the confidence interval
built around specific sample mean would contain the population mean.
 α is alpha which represents the total area in both of the tail of normal curve.
α
 represents the area in each one of the tail.
2
 The relationship between α and the confidence interval is 1−α∧α −1.
E.g. when 95 % confidence interval is to be found α =0.05 .since 1-0.05=0.95 or 95%.when
α =0.01 , 1−0.01=0.99.
Formula for confidence interval is as follow.

x– z
α
2 ( √σ2 )< μ< x + z α2 ( √σ2 )
α α
For 95% , z =1.96 . for 99 % , z =2.58 . if n≥30,S can be used in place of σ where σ is
2 2
unknown.

z
α
2 ( √σn )is called the maximum error of estimate.
Sample size: it depend on the maximum error of estimate, the population standard deviation and
the degree of confidence.
-the population standard deviation is assumed to be known (has been estimated from the previous
studies).
The formula for sample size is derived from the maximum error of estimate (e) formula.

e= z
α
2 ( √σn ) .this is solved for n as follow
( )
2
α
z ×σ
n= 2 where σ= standard deviation of the population (to be estimated from past
e
experience
z =the value of the standard variate at a given confidence level (to be read
from the table or given) and it is 1.96 for a 95%confidence level.
n = size of the sample
e = acceptable error (the precision).
N.B:- if you get n is fraction, use the next whole number for size n.
- The above formula is applicable when the population happens to be large (n>30). But, in case
of small population, the above stated formula for determining sample size will become.
2 2
Z ∗N∗σ
n= .
( N−1 ) e 2+ Z 2 ¿ σ 2
Example; the president of the university ask statistic of instructors to estimate the average age of
student in the university .How large sample size is necessary? The statistics instructors decide
the estimate should be accurate within 1 year and be 99%confident.from the previous study,
standard deviation of the instructors’ age is known to be 3 years.
Solution:
α
α =0.01(1-0.99), z =2.58, e=1, and σ =3
2

( )
2
α
( )
2
z ×σ 2.58 ×3
n= 2 = n= =59.9
1
e
5.4 Sampling Methods (Techniques)
There are two types of sampling techniques:
Random Sampling (probability sampling) & Non probability sampling
5.4.1. Probability sampling

Is a method of sampling in which all elements in the population have a pre-assigned non zero
probability to be included in to the sample.
Examples:
• Simple random sampling
• Stratified random sampling
• Cluster sampling
• Systematic sampling
1. Simple Random Sampling:
 Is a method of selecting items from a population such that every possible sample of
specific size has an equal chance of being selected.
 All elements in the population have the same pre-assigned non zero probability to be
included in to the sample.
 Simple random sampling can be done using the lottery method.
2. Stratified Random Sampling:
 The population will be divided in to non overlapping groups called strata.
 Random selection can be carried out within each sub-group. Then, the randomly selected
representatives of the sub-groups together form the stratified sample.
 The random selection can be done in proportion, according to the size or number in the
population of each sub-group. This is called proportional allocation. This requires
information about the relative sizes of the strata in the population. That is to say that the
exact population numbers or good estimates of these numbers should be available.
 Elements in the same strata should be more or less homogeneous while different in
different strata.
 It is applied if the population is heterogeneous.
 Some of the criteria for dividing a population into strata are: Sex (male, female); Age
(under 18, 18 to 28, 29 to 39); Occupation (blue-collar, professional, other).
3. Cluster Sampling:
 This is a method of sampling involving a naturally occurring group of individuals rather
than an individual.
 In other words, a cluster sample is one in which the research interest characteristics have
been identified, the areas or zones in which these characteristics exist have also been
identified and samples from each of the identified zones randomly constituted. The
population is divided in to non overlapping groups called clusters.
 A simple random sample of groups or cluster of elements is chosen and all the sampling
units in the selected clusters will be surveyed.
 Clusters are formed in a way that elements within a cluster are heterogeneous, i.e.
observations in each cluster should be more or less dissimilar.
 Cluster sampling is useful when it is difficult or costly to generate a simple random
sample. For example, to estimate the average annual household income in a large city
we use cluster sampling, because to use simple random sampling we need a complete list
of households in the city from which to sample. To use stratified random sampling, we
would again need the list of households. A less expensive way is to let each block within
the city represent a cluster. A sample of clusters could then be randomly selected, and
every household within these clusters could be interviewed to find the average annual
household income.
4. Systematic Sampling:
 A complete list of all elements within the population (sampling frame) is required.
 The procedure starts in determining the first element to be included in the sample.

 Then the technique is to take the kth item from the sampling frame.
N
Let, N= population size, n= sample size, k = = sampling interval
n

 Choose any number between 1 and k, suppose it is j (1≤ j ≤ k ¿


 The jth unit is selected at first and then (j+k)th, (j+2k)th, …etc until the required sample
size is reached
 Let us take that the sample size = n, and the population size N =, then the sampling
interval Kth will be given by Kth = N/n. For instance, if N = 1000, n = 100 then K = 10.
We can randomly pick any number from 1 to 10. In this case, the selection of any number
determines the entire sample. Example: if we pick 2, then 2, 12, 22, 32, 42 etc
automatically become members of the sample.
 You would have noticed that the main advantage here is that it requires less work. The
disadvantage can be from the fact that if the listing of the population is not randomly
done, periodicity can be introduced. Periodicity means a situation where every K th
member of the population has some characteristics peculiar or unique to only those
members.
5.4.2. Non probability sampling
 It is a sampling technique in which the choice of individuals for a sample depends on
the basis of convenience, personal choice or interest.
Examples: • Judgment sampling.
• Convenience sampling
• No-rule sampling:
1. Judgment Sampling
- In this case, the person taking the sample has direct or indirect control over which items are
selected for the sample.
2. Convenience Sampling
- In this method, the decision maker selects a sample from the population in a manner that is
relatively easy and convenient.
3. No-rule sampling: we take a sample without any rule, being the sample representative if the
population is homogeneous and we have no selection bias.
Chapter Six
6. SIMPLE CORRELATION AND LINEAR REGRESSION
 Linear regression and correlation is studying and measuring the linear relationship
among two or more variables.
 When only two variables are involved, the analysis is referred to as simple correlation
and simple linear regression analysis, and
 When there are more than two variables the term multiple regression and partial
correlation is used.
 Regression Analysis: is a statistical technique that can be used to develop a
mathematical equation showing how variables are related.
 Correlation Analysis: deals with the measurement of the closeness of the relationship
which are described in the regression equation.
 We say there is correlation when the two series of items vary together directly or
inversely
Simple Correlation
Suppose we have two variables X= (x1, x2, xn) and Y = (y1, y2, yn) When higher values of X are
associated with higher values of Y and lower values of X are associated with lower values of Y,
then the correlation is said to be positive or direct.
Examples:
- Income and expenditure
- Number of hours spent in studying and the score obtained
When higher values of X are associated with lower values of Y and lower values of X are
associated with higher values of Y, then the correlation is said to be negative or inverse.
Examples:
- Demand and supply
The correlation between X and Y may be one of the following
1. Perfect positive (slope=1)
2. Positive (slope between 0 and 1)
3. No correlation (slope=0)
4. Negative (slope between -1 and 0)
5. Perfect negative (slope=-1)
The presence of correlation between two variables may be due to three reasons:
1. One variable being the cause of the other. The cause is called “subject” or “independent”
variable, while the effect is called “dependent” variable.
2. Both variables being the result of a common cause. That is, the correlation that exists between
two variables is due to their being related to some third force.
Example:
Let X1= be HEEE result

Y1=be rate of surviving in the University

Y2=be the rate of getting a scholar ship.

Both X1&Y1 and X1&Y2 have high positive correlation, likewise

Y1 & Y2 have positive correlation but they are not directly related, but they are related to each

other via X1.

3. Chance:
The correlation that arises by chance is called spurious correlation.
Examples:
• Weight of individuals in Ethiopia and income of individuals in Kenya.
Therefore, while interpreting correlation coefficient, it is necessary to see if there is any
likelihood of any relationship existing between variables under study.
Correlation coefficient is the measure used to determine the strength of the relationship
between two variables. There are several types of correlation coefficients. One the common
types of correlation coefficients is the Pearson Product Moment Correlation Coefficient
(PPMC). The correlation coefficient computed from the sample data measures the strength and
direction of a linear relationship between two variables. The symbol for the sample
correlation coefficient is r. The symbol for the population correlation coefficient is ρ (Greek
letter rho).
There are several ways to compute the value of the correlation coefficient. One method is
Pearson's Product-moment Correlation Coefficient.
6.1.1 Pearson's Product-moment Correlation Coefficient
This measure considers not ranks rather magnitudes of observation. Formula and procedure is
only applicable on quantitative data. The Coefficient of Correlation( r ) is a measure of the
strength of the relationship between two variables. It requires interval or ratio scaled data
(variables).

If the coefficient(r) has a value;


 Under 0.20, it indicates very weak correlation
 0.21 - 0.40 = weak correlation
 0.41 - 0.70 = moderate correlation
 0.71 - 0.91 = strong correlation
 >0.91 = very strong correlation
The formula is given as follow.

r=
∑ ( Xi− X ) (Yi−Y )
√ ∑ ( Xi− X )2 ∑ ( Yi−Y )2
Where, n is the number of data pairs and x & y are variables .i.e. Dependent Variable(Y): The
variable that is being predicted or estimated and independent Variable(x): The variable that
provides the basis for estimation (It is the predictor variable)
n ∑ XY −∑ ( X ) ∑ (Y )
The short cut formula is r =
√¿¿¿
Remark:
Always this r lies between -1 and 1 inclusively.
Interpretation of r:
1. Perfect positive linear relationship (if r= 1)
2. Some Positive linear relationship (if r is between 0 and 1)
3. No linear relationship (if r= 0)
4. Some Negative linear relationship (if r is between 0 and -1)
5. Perfect negative linear relationship (if r= -1)
Example 8.1: The data below shows age and average daily income of six farmers. Compute the
value of the correlation coefficient.
Solution: Make a table, find the values of xy, x2, and y2 and place these values in the
corresponding columns of the table.
Farmer code Age (X) Average daily XY X2 Y2
income (Y)
A 43 128 5504 1849 16384
B 48 120 5760 2304 14400
C 56 135 7560 3136 18225
D 61 143 8723 3721 20449
E 67 141 9447 4489 19881
F 70 152 10640 4900 23104
∑ X=345 ∑ X=819 ∑ XY =47634 ∑ X=20399 ∑ X=112443

Substitute in the formula and solve for r.


n ( ∑ xy ) −( ∑ x ) (∑ y)
r=
√¿¿¿

6 ( 47634 ) −( 345 ) (819)


r=
√¿ ¿ ¿

285804−282555
r=
√ [ 122394−119025 ][ 674658−670761 ]

3249
r= = 0.897
√ 13128993

The correlation coefficient suggests a strong positive relationship between age and average daily
income of farmers.
Coefficient of Determination
The Coefficient of determination (r2) is the proportion of the total variation of dependent
variable Y that is explained by the variation in the independent variable X. It is the square
of the coefficient of correlation(r) and ranges from 0 to 1.
From the above example, r=0.897.
r2= (0.897)2 =0.81. This is a proportion or a percent. We can say that 81 percent of the
variation in average daily income is explained by the variation in age.
6.1.2 Spearman's Rank Order Correlation Coefficient or rank correlation)
Is the technique of determining the degree of correlation between two variables in case of
ordinal data where ranks are given to the different values of the variables. The main objective
of this coefficient is to determine the extent to which the two sets of ranking are similar or
dissimilar. This coefficient is determined as under:
Spearman's coefficient of correlation (or rs) is given by:
6 ∑ Di
2
r s=1− 2
n(n −1)
Where, rs=coefficient of rank correlation

D=the difference between paired ranks

n=the number of pairs

Rank correlation is a non-parametric technique for measuring relationship between paired


observations of two variables when data are in the ranked form.
Example:
Aster and Almaz were asked to rank 7 different types of lipsticks, see if there is correlation
between the tests of the ladies.
lipsticks A B C D E F G
Aster 2 1 4 3 5 7 6
Almaz 1 3 2 4 5 6 7
Solution:
RX 2 1 4 3 5 7 6 Total
RY 1 3 2 4 5 6 7
D=RX-RY 1 -2 2 -1 0 1 -1
D2 1 4 4 1 0 1 1 12
6 ∑ Di
2
6 (12)
r s=1− =1− =0.786 , there is positive correlation.
n ( n −1 )
2
7(48)
Example 2
Eight nations report the following data on their infant mortality rate and general mortality
rate. Rank the data. Does there appear a relationship between the two mortality rates?
Canada U.S.A Swede U.K France Japan China Spain
n
Infant mortality 8.1 10.5 6.4 9.6 10.0 6.2 50 9.6
Mortality 7.0 9.0 11.0 11.0 10.7 6.0 8 8.1
Step 1: Rank the data from lowest to highest. The lowest score should be ranked 1 and the
highest score, 8. Be sure to use the mean for two values that tie. For example, Swede and the
U.K tie for the worst general mortality rate. Since they tie the seventh and eight positions, both
are assigned the position 7.5 (7 + 8 /2). Rewrite the ranked data.
Canad U.S.A Swed U.K Franc Japa China Spain
a en e n
Infant mortality 3 7 2 4.5 6 1 8 4.5
Mortality 2 5 7.5 7.5 6 1 3 4
Step 2: Rearranging the data in a column, calculate the Spearman rank correlation coefficient.

Infant"Mortality Mortality D D2
x y (x - ,
(x-y) 2

Canada 3 2 y)
1.0 1.00
U.S.A 7 5 2.0 4.00
Sweden 2 7.5 -5.5 30.0
U.K 4.5 7.5 -3.0 9.00
France 6 6 0.0 0.00
Japan 1 1 0.0 ·0.00
China 8 3 5.0 25.00
Spain 4.5 4 0.5 0.25
69.50

6 ∑ Di
2
6 (69.5) 417
r s=1− =1− =1− =0.173
n ( n −1 )
2
2
9(8 −1) 504

Interpret the results- A correlation of 0.173· suggests there is little correlation between the
rankings of these nations' infant mortality rates and general mortality rates. The small correlation
that does exist is positive, which suggest that as a nation’s infant mortality ranking increases, so
does its general mortality rate.

6.2. Simple Linear Regression


-Simple linear regression refers to the linear relationship between two variables
-it is used to predict the value of a single continuous DV (which we will call Y) from a single
continuous IV (which we will call X). Regression assumes that the relationship between IV and
the DV can be represented by the equation.
The regression equation is: Y= a + bx, where;
 The regression equation: Y= a + bx, where;
 Y and X are variables
 a and b are constants
 The constant a stands for the value Y when x = 0, and represents the y-intercept.
 The constant b represents the slope of the line.
-The regression line is one of many but is the line of best fit.

b=
∑ ( Xi−X ) ( Yi−Y )
∑ ¿¿¿
a=Y −b X

b is a constant indicating the slope of the regression line, and it gives a measure of the change in
Y for a unit change in X. It is also regression coefficient of Y on X.
Example 1: The following data shows the score of 12 students for Accounting and Statistics
examinations.
a) Calculate a simple correlation coefficient
b) Fit a regression line of Statistics on Accounting using least square estimates.
c) Predict the score of Statistics if the score of accounting is 85.
Student 1 2 3 4 5 6 7 8 9 10 11 12
Acc. (X) 74 93 55 41 23 92 64 40 71 33 30 71
Sat. (Y) 81 86 67 35 30 100 55 52 76 24 48 87

First draw Scatter plot of raw data. Scatter plot of a raw data is used to determine the nature of
relationship. After scatter plot, the next step is to compute r (correlation coefficient). If r is
significant, the next step is to determine the equation of regression line. Determine regression
line where r is not significant and making prediction using regression line is meaningless.
As you see from the scatter plot, it seems there is some linear relationship between the variables.
Studen 1 2 3 4 5 6 7 8 9 10 11 12 Total Mea
t n
Acc. 74 93 55 41 23 92 64 40 71 33 30 71 687 57.2
(X) 5
Sat. 81 86 67 35 30 100 55 52 76 24 48 87 741 61.7
(Y) 5
X2 547 864 302 168 52 8464 409 160 504 108 900 504 4559
6 9 5 1 9 6 0 1 9 1 1
Y2 656 739 448 122 90 1000 302 270 577 576 230 756 5252
1 6 9 5 0 0 5 4 6 4 9 5
XY 599 799 368 143 69 9200 352 208 539 792 144 617 4840
4 8 5 5 0 0 0 6 0 7 7

n ( ∑ xy ) −( ∑ x ) (∑ y)
r=
√¿¿¿

12∗48407−687∗741
r=
√¿ ¿ ¿

a).The Coefficient of Correlation (r) has a value of 0.92. This indicates that the two variables are
positively correlated (Y increases as X increases).
b) Using OLS (ordinary least square).

b=
∑ ( Xi−X ) ( Yi−Y )
∑ ¿¿¿
48407−12∗57.25∗61.75
b= ¿ 0.9560
45591−12 ¿ ¿

a=Y −b X ¿ 61.75−0.9560∗57.25=7.0194

Y^ =7.0194 +0.9560 X is the estimated regression line

This means that for each unit change in X, Y changes by 0.9560 units. Regression line can be
used to make prediction for dependent variable.
E.g. using regression line predict the value of dependent variable ( ^y ), if=85.
C) Insert X=85 in the estimated regression line.
Y^ =7.0194 +0.9560 X
Y^ =7.0194 +0.9560 ( 85 )=88.28
For valued prediction, the value of correlation coefficient must be significant. Also, for any
specific value of x, variable y must be normally distributed about regression line. The standard
deviation of each dependent variable must be the same for each value of independent variable.
-prediction is made based on the present condition or on the premises that the present trend will
continue.
6.2.1. Coefficient of determination (r2).
It is the ratio of explained variation to total variation and is denoted by r2. It is also the measure
of variation of dependent variable that is explained by the regression line. Variation due to
relationship is called explained variation. Variation due to chance is called unexplained variation.
Both explained and unexplained variation is called total variation.
explained variation
r2 =
. r2 is to square correlation coefficient (r) and change to percent.
total variation
If r=0.90, then r2 =0.81wich is equivalent to 81%. Which means 81% of the variation in
dependent variable is accounted for by the variation in independent variables. The rest 19%
variation is unexplained variation. This is called coefficient of non-determination (CND) and is
found by subtracting the coefficient of determination ( r2) from 1.
E.g. if r=0.6. r2=0.36 which means only 36% of the variation in the dependent variable can be
attributed to variation in independent variable. CND=1.00- r2

You might also like