Stat Module I
Stat Module I
Introduction
In this regard, this chapter covers two main topics such as sampling theory and sampling
distributions. The chapter includes a definition of sampling as well as the rationale for
selecting a sampling survey over a census survey. The classification of sampling
techniques and their subclassifications has also been discussed. Basic statistical
characteristics of sample distributions, sample means, and sample proportions have also
been introduced.
Dear student! Data can be obtained from existing sources or from surveys and
experimental studies designed to collect new data through census (total enumeration of
the population) or sampling. Instant population refers to all items that have been chosen
for study. While sample refers to a portion or subset of the population selected. Some
basic concepts of sampling are discussed below.
The term population or universe conveys a different meaning than a traditional one. In
census survey, the count of individuals (men, women and children) is known as
population. But in research methodology population means the characteristics of a
specific group. For example, secondary school teachers of East Hararge Zone, who have
some specific features (teaching experience, male and female, academic qualification.
teaching attitudes, teaching aptitude etc.). Another example, high school students of
Harar town who have some specific characteristics (age group, boys and girls personality,
scholastic aptitude, academic motivation etc.).
Population can be classified into two categories- finite and infinite. The population is said
to be finite if it consists of a fixed number of elements so that it is possible to enumerate
in its totality. Examples of finite population are the populations of a city, the number of
workers in a factory, etc. An infinite population is that population in which it is
theoretically impossible to observe all the elements. In an infinite population the number
of items is infinite. Example of infinite population is the number of stars in sky. From
practical consideration, we use the term infinite population for a population that cannot
be enumerated in a reasonable period of time.
Sampling: It is the process of selecting the sample for estimating the population
characteristics. In other words, it is the process of obtaining information about an entire
population by examining only a part of it. Sampling is selecting sampler (or part of the
items from the population) from populations. Mathematically, we can describe samples
and populations by using measures such as the mean, median, mode, and standard
deviation. When these terms describe the characteristics of a sample, they are called
statistics. When they describe the characteristics of a population, they are referred
parameters.
Sampling Unit: Elementary units or group of such units which besides being clearly
defined, identifiable and observable, are convenient for purpose of sampling are called
sampling units. For instance, in a family budget enquiry, usually a family is considered as
the sampling unit since it is found to be convenient for sampling and for ascertaining the
required information. In a crop survey, a farm or a group of farms owned or operated by a
household may be considered as the sampling unit.
Statistic: Characteristics of the sample. For example, sample Mean, proportion, etc.
If we are convinced that the sample statistics are accurate estimate of the population
characteristics, we could use sample statistics to estimate the population parameter
without measuring the entirety of the items under study. In order to be consistent,
tacticians use lower case roman letters to denote sample statistics, and Greek or capital
letters to denote population parameters. Table 1.1 below reveals summaries of the
definitions and the symbols.
Population Sample
Definition Collection of all items being dealt Subjects of the population
Characteristics Parameter Statistic
Symbols Population Size = N Sample size = n
Population Size = Sample size = x
Target Population:Population standard Deviation
A target population = group
is the entire Sample standard
about whichDeviations = s is
information
desired and conclusion is made.
Sample Design: Sample design refers to the plans and methods to be followed in
selecting sample from the target population and the estimation technique formula for
computing the sample statistics. These statistics are the estimates used to infer the
population parameters.
Although, a census operation gives a more reliable data, sampling method is more desired
when:
the population is very large, i.e., infinite and it would be impossible to conduct
census surveys;
when quick results are required it would be appropriate to conduct sample surveys
rather than census surveys;
in studies involving destruction of the elementary units under study, it would only
be appropriate to go for sample testing. Items such as light bulbs and ammunition
often must be destroyed as a part of testing process;
cost of conducting surveys would be very prohibitive in census method, and
therefore, it is advisable to carry out a sample survey, and lastly; and
sometimes accuracy may be lost because of the large size of the population.
Sampling involves a small portion of the population and therefore, would involve
very few people for conducting surveys and for data collection and compilation.
This would not be so in the census method and the chances of committing errors
would increase.
Dear learners! In order to answer the research questions, it is doubtful that the
researcher should be able to collect data from all cases. Thus, there is a need to select a
sample. The entire set of cases from which a researcher's sample is drawn is called the
population. Since researchers neither have the time nor the resources to analyze the entire
population, they apply sampling techniques to reduce the number of cases. The sampling
process comprises several stages:
Define the Population: Population must be defined in terms of elements, sampling units,
extent and time. Because there is very rarely enough time or money to gather information
from everyone or everything in a population, the goal becomes finding a representative
sample (or subset) of that population.
Sampling Frame: As a remedy, we seek a sampling frame which has the property that
we can identify every single element and include any in our sample. The most
straightforward type of frame is a list of elements of the population (preferably the entire
population) with appropriate contact information. A sampling frame may be a telephone
Sampling Unit: A sampling unit is a basic unit that contains a single element or a group
of elements of the population to be sampled. The sampling unit selected is often
dependent upon the sampling frame. If a relatively complete and accurate listing of
elements is available (e.g. register of purchasing agents) one may well want to sample
them directly. If no such register is available, one may need to sample companies as the
basic sampling unit.
Sampling Method: The sampling method outlines the way in which the sample units are
to be selected. The choice of the sampling method is influenced by the objectives of the
research, availability of financial resources, time constraints, and the nature of the
problem to be investigated. All sampling methods can be grouped under two distinct
heads, that is, probability and non-probability sampling.
Sample Size: The sample size calculation depends primarily on the type of sampling
designs used. However, for all sampling designs, the estimates for the expected sample
characteristics (e.g. mean, proportion or total) desired level of certainty, and the level of
precision must be clearly specified in advanced. The statement of the precision desired
might be made by giving the amount of error that we are willing to tolerate in the
resulting estimates. Common levels of precisions are 5% and 10%.
Sampling Plan: In this step, the specifications and decisions regarding the
implementation of the research process are outlined. As the interviewers and their co-
workers will be on field duty of most of the time, a proper specification of the sampling
plans would make their work easy and they would not have to reverting operational
problems.
Select the Sample: The final step in the sampling process is the actual selection of the
sample elements. This requires a substantial amount of office and fieldwork, particularly
if personal interviews are involved.
Dear student! Two major types of error can arise when a sample of observations is taken
from a population: sampling error and non-sampling error. Anyone reviewing the results
of sample surveys and studies, as well as statistics practitioners conducting surveys and
applying statistical techniques, should understand the sources of these errors.
Sampling Error
Sampling error refers to differences between the sample and the population that exists
only because of the observations that happened to be selected for the sample. Sampling
error is an error that we expect to occur when we make a statement about a population
that is based only on the observations contained in a sample taken from the population.
To illustrate, suppose that we wish to determine the mean annual income of North
American blue-collar workers. To determine this parameter we would have to ask each
North American blue-collar worker what his or her income is and then calculate the mean
of all the responses. Because the size of this population is several million, the task is both
expensive and impractical. We can use statistical inference to estimate the mean income
of the population if we are willing to accept less than 100% accuracy. We record the
incomes of a sample of the workers and find the mean of this sample of incomes. This
sample mean is an estimate, of the desired, population mean. But the value of the sample
mean will deviate from the population mean simply by chance because the value of the
sample mean depends on which incomes just happened to be selected for the sample. The
difference between the true (unknown) value of the population mean and its estimate, the
sample mean, is the sampling error. The size of this deviation may be large simply
because of bad luck-bad luck that a particularly unrepresentative sample happened to be
selected. The only way we can reduce the expected size of this error is to take a larger
sample.
Given a fixed sample size, the best we can do is to state the probability that the sampling
error is less than a certain amount. It is common today for such a statement to accompany
the results of an opinion poll. If an opinion poll states that, based on sample results, the
Non-sampling Error
Non-sampling error is more serious than sampling error because taking a larger sample
won’t diminish the size, or the possibility of occurrence, of this error. Even a census can
(and probably will) contain non-sampling errors. Non-sampling errors result from
mistakes made in the acquisition of data or from the sample observations being selected
improperly.
Errors in data acquisition - This type of error arises from the recording of incorrect
responses. Incorrect responses may be the result of incorrect measurements being taken
because of faulty equipment, mistakes made during transcription from primary sources,
inaccurate recording of data because terms were misinterpreted, or inaccurate responses
were given to questions concerning sensitive issues such as sexual activity or possible tax
evasion.
Non-response error - Non-response error refers to error (or bias) introduced when
responses are not obtained from some members of the sample. When this happens, the
sample observations that are collected may not be representative of the target population,
resulting in biased results.
Activities
- What are the main stages of sampling process?
- What is sampling error?
- What is non - sampling error?
Dear learner! In statistics, there are two methods of selecting samples from populations:
Random or probability sampling, and Non-random, non-probability or judgment
sampling.
(I) Probability (Random) Sampling: - is sampling when all items (i.e., each
element) in the population have a chance of being chosen in the sample and the
probability of each element of the population included in the sample is known.
There are several probabilities sampling technique that will be discussed later.
Probability Sampling
There are a number of techniques of taking probability sample. But here only four
important techniques have been discussed as follows:
1. Simple random sampling.
2. Systematic sampling.
3. Stratified sampling.
4. Cluster sampling.
Simple Random Sampling: - is selecting samples so that each possible sample has an
equal chance of being picked, and each element in the population has the same
probability of being included in the sample and is independent of whether some other
element is chosen. Example: Suppose that a restaurant has four branches (N, S, E and W)
and that it wants to select samples of two branches at a time in order to evaluate the
operation of the branches. Using simple random sample there are six different samples of
size 2 that can be drawn from the population (i.e., the four branches). These six samples
are (NS); (NE); (NW); (SE); (SW); and (EW). The probability of each sample is 1/6 to be
selected from the population and the probability of an element in the sample is ½.
In another understanding a simple random sample is one in which each element of the
population has an equal and independent chance of being included in the sample i.e., a
sample selected by randomization method is known as simple-random sample and this
technique is simple random-sampling. A randomization is a method and is done by using
a number of techniques such as: tossing a coin, throwing a dice, lottery method, blind
folded method and random table of ‘Tippett’s Table’.
Advantages
(a) It requires a minimum knowledge of population.
(b) It is free from subjectivity and free from personal error.
(c) It provides appropriate data for our purpose.
(d) The observations of the sample can be used for inferential purpose.
Disadvantages
(a) The representativeness of a sample cannot be ensured by this method.
(b) This method does not use the knowledge about the population.
(c) The inferential accuracy of the finding depends upon the size of the sample.
2. Systematic Sampling
Systematic sampling is an improvement over the simple random sampling. This method
requires the complete information about the population. There should be a list of
Let sample size = n and population size = N. Now we select each N/nth individual from
the list and thus we have the desired size of sample which is known as systematic sample.
Thus, for this technique of sampling population should be arranged in any systematic
way.
Illustration: - Suppose that there are 1000 resident or households in one Keblle with
different income levels. If the statistician/researcher has the list of all households
randomly listed and wants to study the income disparity in that Kebelle by taking 50
samples. Since there are 1000 households the sampling can be accomplished by taking
1000
every 20th household on the list [ ]. To determine which of the first 20 elements to
50
being with the statistician/researcher can randomly chose a number from 1 to 20. Once
this number is chosen (let’s say 3), then the statistician selects the 3 rd, 23rd, 33rd, 43rd,
households from the list. Such kind of sampling is systematic sampling.
Often systematic sampling is regarded as identical as the simple random sampling. This is
true only if the elements of the population are in random order on the list. This means the
elements of the population are in random order on the list. This means the elements of the
population on the list are not in a sort of periodicity or any other type of pattern on the
list.
Advantages
(a) This is a simple method of selecting a sample.
(b) It reduces the field cost.
(c) Inferential statistics may be used.
(d) Sample may be comprehensive and representative of population.
(e) Observations of the sample may be used for drawing conclusions and
generalizations.
Disadvantages
3. Stratified Sampling
It is an improvement over the earlier method. When employing this technique, the
researcher divides his population in strata on the basis of some characteristics and from
each of these smaller homogeneous groups (strata) draws at random a predetermined
number of units. Researcher should choose that characteristic or criterion which seems to
be more relevant in his research work.
Illustration: If a researcher wants to deal with the income inequality situation in Adama
city. The researcher can divide the households in to different groups. As follows:
o Civil Servant
o Merchant
o Petty Traders & local drink sellers
Proportionate sampling refers to the selection from each sampling unit of a sample that
is proportionate to the size of the unit. Advantages of this procedure include
representativeness with respect to variables used as the basis of classifying categories and
increased chances of being able to make comparisons between strata. Lack of information
on proportion of the population in each category and faulty classification may be listed as
disadvantages of this method.
Cluster Sampling: - is sampling in which one divides the elements in the population in
to a number of clusters or groups. One then begins by choosing at random a sample of
these clusters, after which a simple random sample of the elements in each chosen cluster
is selected. Sometimes, this is referred as two stage cluster sampling. To select the intact
group as a whole is known as a Cluster sampling. In Cluster sampling the sample units
contain groups of elements (clusters) instead of individual members or items in the
population.
Illustration: Still taking the study of the income disparity condition in Adama. In this
case, the Adama city will be classified by locality (i.e., in to Northern, southern part of
Adama, etc.). Once the city is classified in to various clusters, randomly some of the
clusters (i.e., locality in our case) will be chosen and the researcher randomly selects
elements from the chosen cluster.
Advantages
(a) It may be a good representative of the population.
(b) It is an easy method.
(c) It is an economical method.
(d) It is practicable and highly applicable in education.
(e) Observations can be used for inferential purpose.
Disadvantages
(a) Cluster sampling is not free from error.
(b) It is not comprehensive.
The term incidental or accidental applied to those samples that are taken because they are
most frequently available, i.e., this refers to groups which are used as samples of a
Advantages
(a) It is very easy method of sampling.
(b) It is frequently used in behavioral sciences.
(c) It reduces the time, money and energy i.e., it is an economical method.
Disadvantages
(a) It is not a representative of the population.
(b) It is not free from error.
(c) Parametric statistics cannot be used.
2. Judgment Sampling
This involves the selection of a group from the population on the basis of available
information thought. It is to be representative of the total population. Or the selection of a
group by intuition on the basis of criterion deemed to be self-evident. Generally
investigator should take the judgment sample so this sampling is highly risky.
Advantages
(a) Knowledge of the investigator can be best used in this technique of sampling.
(b) This technique of sampling is also economical.
Disadvantages
(a) This technique is objective.
(b) It is not free from error.
(c) It includes uncontrolled variation.
(d) Inferential statistics cannot be used for the observations of this sampling, so
generalization is not possible.
3. Purposive Sampling
Advantages
(a) Use of the best available knowledge concerning the sample subjects.
(b) Better control of significant variables.
(c) Sample groups data can be easily matched.
(d) Homogeneity of subjects used in the sample.
Disadvantages
(a) Reliability of the criterion is questionable.
(b) Knowledge of population is essential.
(c) Errors in classifying sampling subjects.
(d) Inability to utilize the inferential parametric statistics.
(e) Inability to make generalization concerning total population.
4. Quota Sampling
This combined both judgment sampling and probability sampling. The population is
classified into several categories: on the basis of judgment or assumption or the previous
knowledge, the proportion of population falling into each category is decided. Thereafter
a quota of cases to be drawn is fixed and the observer is allowed to sample as he likes.
Quota sampling is very arbitrary and likely to figure in Municipal surveys.
Advantages
(a) It is an improvement over the judgment sampling.
(b) It is an easy sampling technique.
(c) It is most frequently used in social surveys.
Disadvantages
(a) It is not a representative sample.
Since research design is a plan by which research samples may be selected from a
population and under which experimental treatments are administered and controlled so
that their effect upon the sample may be measured. Therefore, a second step in the
establishment of an experimental design is to select the treatments that will be used to
control sources of learning change in the sample subjects.
Activities
- What is probability and non-probability sampling?
- List down advantages and disadvantages of each type of probability and non-
probability sampling?
Dear student! So far, we have examined how samples can be taken from population.
Using one of the already discussed samples technique if we take several samples from a
population, the statistics of we would compute for each sample need not be the same and
most likely would vary from sample to sample. In this sub-topic we will discuss about
sampling distribution. Sampling distribution is a probability distribution of all the
values of sample statistics. We do have sampling distribution of the mean, proportion etc.
A sampling distribution is created by, as the name suggests, sampling. There are two
ways to create a sampling distribution. The first is to actually draw samples of the same
size from a population, calculate the statistic of interest, and then use descriptive
techniques to learn more about the sampling distribution. The second method relies on
the rules of probability and the laws of expected value and variance to derive the
sampling distribution.
Illustration 1
Required:
Solution
N∁n = 5∁3 = 10
For each sample we can complete the mean value (i.e., the sample statistics). The
following table reveals the mean value for each sample.
Samples Mean ( x )
3, 6, 9 6
3, 6, 12 7
μ=
∑ x = 3+6+9+ 12+15
n 5
45
¿ =9
5
This mean value ( μ) varies from some of the sample mean. This leads us in to concept of
sampling distribution.
μ x=
∑x
no . of x
¿9
σ
σ x=
√n
σ
2
=
∑ ( xi −μ )
2
90
¿ =18
5
σ =√ σ 2
¿ √ 18=4.243
1. Expected value of the sample mean E ( x ) (or the mean of the sample means) is equal
to the population mean. Algebraically ( x ) = E ( x ) =
2. Give the population mean (), population standard deviation (σ), the sample size (n)
and population size (N); the standard deviation of the sample mean is given as:
σ
σ ❑= - - - - - - - - - For infinite population.
√n
A population is said to be infinite when it is not possible to list or count all the elements
included in the population, (i.e., when the elements are unlimited). Or, in the cases when
the elements in the population are limited, the population may be considered as infinite
when the sample size is small and as rule of thumb statisticians consider the population as
infinite when n 5% of N. A population is said to be finite when n > 0.05 N. The value
N n
N 1
is referred as finite population correction factor.
Illustration 2
The average lifetime of a light bulb is 3000 hours with a standard deviation of 696 hours.
A simple random sample of 36 bulbs is taken.
(a) What is the expected value, standard deviation, and shape of the sampling
distribution of x ?
(b) What is the probability that the average life time in the sample will be between
2670.56 and 2809.76 hours?
(c) What is the probability that the average life time in the sample will be equal to or
greater than 3219.24 hours?
(d) What is the probability that the average life time in the sample will be equal to or
less than 3180.96 hours?
(e) How large of a sample needs to be taken to provide a 0.01 probability that the
average life time in the sample will be equal to or greater than 3219.24 hours
Solution:
σ 696
σ x= = =116
√ n √36
b) P ( 2670.56 ≤ x ≤2809.76 )
¿P
[ 2670.56−3000 x−μ 2809.76−3000
116
≤
σx
≤
116 ]
¿ P (−2.84 ≤ Z ≤−1.64 )
¿ 0.0482
c) ( x ≥ 3219.24 )
¿P
[ x −μ 3219.24−3000
σx
≥
116 ]
¿ P ( Z ≥1.89 )
¿ 0.02 94
d) P ( x ≤ 3180.96 )
¿P
[ x −μ 3280.96−3000
σx
≤
116 ]
¿ P ( Z ≤1.56 )
¿ 0.9406
e) 0.01=P ( x ≥3219.24 )
¿P
[ x −μ 3219.24−3000
σx
≥
σx ]
¿ P Z≥( 219.24
σx )
Statistics For Management II Page 24
219.24
Z 0.01 ≈
σx
219.24
2.33=
696
√n
696
2.33× =219.24
√n
n=54.71≈ 55
Dear student, in the previous section of this chapter we have discussed about sampling
distribution of the sample mean. Another sampling distribution that you will soon
encounter is that of the difference between two sample means. The sampling plan calls
for independent random samples drawn from each of two normal populations.
Suppose two populations of size N1 and N2 are given. For each sample of size n1 from
first population, compute sample mean x 1 and standard deviation σ x . Similarly, for each
1
sample of size n2 form second population, compute sample mean x 2 and standard
deviation σ x .
2
For all combinations of these samples from these populations, we can obtain the sampling
distribution of the difference of two sample means ( x 1−x 2). The mean and the standard
distributions are given by:
μx1−¿ x2 =¿ ¿ μ x −¿ μ
1 x2 ¿
Since the standard error of a sampling distribution is the standard deviation of the
sampling distribution, the standard error of the difference between means is:
σ
√
2 2
σ1 σ 2
x 1−¿x = + ¿
2
n1 n2
Z=¿ ¿
We find the Z score by assuming that there is no difference between the population
means.
Illustration
In a study of annual family expenditures for general health care, two populations were
surveyed with the following results:
If the variances of the populations are σ12 = 2800 and σ22 = 3250, what is the probability
of obtaining sample results ¿) as large as those shown if there is no difference in the
means of the two populations?
Solution
Z≥¿¿
( 346−320 ) −( 0 )
Z≥
√ 2800 3250
40
+
35
Z ≥ 2.04
Z ≥ 2.04=0.5000−0.4793=0.0207
Dear learner, in this part of the chapter we will discuss about sampling distribution of
sample proportion. The sample proportion ( P) is the point estimator of the population
proportion p. The formula for computing the sample proportion is
x
P=
n
Where:
x = the number of elements in the sample that possess the characteristic of interest
n = sample size
The sample proportion ( P)) is a random variable and its probability distribution is called
the sampling distribution of P. The sampling distribution of P is the probability
distribution of all possible values of the sample proportion P.
Illustration 1
Consider a population of N = 5 given numbers 3, 6, 9, 12, and 15. Let’s take even
numbers. Consider a sample of size 3 (n = 3) that are drawn from the population the
samples, sample proportions are given in table below.
Required:
Solution
2
the proportion of even numbers is = 0.4.
3
N∁n = 5∁3 = 10
The following are the elements in the sample.
Given the above table can construct the probability distribution of the
sample proportions as shown in the table below.
Symbolically: E ( p) = P
2. Just as with the standard deviation of the sample means ( σ x ), the standard deviation
of the sample proportion (σ p) also depends on whether the population is finite or
infinite. It follows that the standard deviation of the sample proportion is:
σ p=
√ √
N −n
N −1
p (1− p)
N −1
--- for finite population (i.e., n > 0.05 N)
√
σ p=
p(1− p)
N −1
--- for finite population (i.e., n < 0.05 N)
Illustration 2
A new soft drink is being market tested. It is estimated that 60% of consumers will like
the new drink. A sample of 96 taste-tested the new drink.
Required:
(c) What is the probability that equal to or more than 30% of consumers will indicate
they do not like the drink?
Solution
σ p=
√ p ×(1− p)
N
=
√
0.6 × 0.4
96
=0.05
(b) The probability that equals to or more than 70.4% of consumers will indicate they
like the drink
p−P 0.704−0.6
P( p ≥ 0.704)=P( ≥ )
σp 0.05
P(Z ≥ 2.08)=0.0188
(c) The probability that equals to or more than 30% of consumers will indicate they
do not like the drink. We need to compute the probability that less than 70% of
consumers will indicate they like the drink?
p−P 0.70−0.6
P ( p <0.70 ) =P ( < )
σp 0.05
P(Z <2.00)=0.9772
For all combinations of these samples from these populations, we can obtain the sampling
distribution of the difference of two sample proportions ( p1−¿ p ¿). The mean and the
2
σ
2
√
p1−¿ p = P1
(1− P¿¿1)
n1
+P2
(1− P¿¿ 2)
n2
¿¿¿
Z=¿ ¿
If sample size n1 and n2 are large, that is, n1 ≥ 30 and n2 ≥ 30, the sampling distribution of
the difference of two sample proportions is clearly approximated by normal distribution.
Illustration
Solution
The probability that the difference in sample proportion is less than or equal to 0.02 ( P ¿):
Z≤¿¿
¿ P ( Z ≤−1.32 )
¿ 0.0934
Hence, the desired probability for the difference P1−¿ P ¿ ≤ 0.02 in sample proportion is
2
0.0934.
Summary
= µ and E( p) = P. After developing the standard deviation or standard error formulas for
these estimators, we described the conditions necessary for the sampling distributions of
and to follow a normal distribution. Other sampling methods including stratified random
sampling, cluster sampling, systematic sampling, convenience sampling, and judgment
sampling were discussed.
Statistics For Management II Page 32
Glossary
Simple random sample – a simple random sample of size n from a finite population of
size N is a sample selected such that each possible sample of size n has the same
probability of being selected.
Random sample - a random sample from an infinite population is a sample selected such
that the following conditions are satisfied: (1) Each element selected comes from the
same population; (2) each element is selected independently.
and whenever a finite population, rather than an infinite population, is being sampled.
The generally accepted rule of thumb is to ignore the finite population correction factor
whenever n/N ≤ .05.
Central limit theorem - is a theorem that enables one to use the normal probability
distribution to approximate the sampling distribution of whenever the sample size is
large.
------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------
2) Which sampling technique is more favorable justify it
------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------
-----------------------------------------------------
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------
ii. If a customer buys a carton of four bottles, what is the probability that the
mean amount of the four bottles will be greater than 32 ounces?
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
5) In a specific election, a state representative received 52% of the votes cast. One year
after the election, the representative organized a survey that asked a random sample
of 300 people whether they would vote for him in the next election. If we assume that
his popularity has not changed, what is the probability that more than half of the
sample would vote for him?
-------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------
Statistics For Management II Page 36
-------------------------------------------------------------------------------------------------------
-------------------------------------------------
6) Assume there are two species of green beings on Mars. The mean height of Species 1
is 32 while the mean height of Species 2 is 22. The variances of the two species are
60 and 70, respectively and the heights of both species are normally distributed. You
randomly sample 10 members of Species 1 and 14 members of Species 2. What is the
probability that the mean of the 10 members of Species 1 will exceed the mean of the
14 members of Species 2 by 5 or more?
------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------
CHAPTER TWO
STATISTICAL ESTIMATION
Introduction
Dear Students! Welcome to the second chapter of this module. In Chapter 1 we have
discussed about sampling and sampling distribution. The sampling distribution of the
mean shows how far sample means could be from a known population mean. Similarly,
the sampling distribution of the proportion shows how far sample proportions could be
from a known population proportion.
Dear students! Do you know the meaning of statistical inference? Statistical inference
is the act of generalizing from a sample to a population with calculated degree of
certainty. The two forms of statistical inference are estimation and hypothesis testing.
This chapter introduces estimation. The next chapter introduces hypothesis testing.
A statistical population represents the set of all possible values for a variable. In practice,
we do not study the entire population. Instead, we use data in a sample to shed light on
the wider population. The process of generalizing from the sample to the population is
statistical inference.
Estimation: is the process of using statistics as estimates of parameters. It is any
procedure where sample information is used to estimate/ predict the numerical
value of some population measure (called a parameter).
Types of Estimates:
Dear student! We can categorize two types of estimates about a population: a point
estimate and an interval estimate.
The most important point estimates (given that they are single values) are:
o Sample mean x for population mean (μ);
o Sample proportion ( p) for population proportion (P);
o Sample variance ( s2) for population variance (σ 2) and
(σ )
s
o Sample standard deviation ( ) for population standard deviation
Activities
- What is estimation?
Illustration One
Suppose that a business statistics professor wants to estimate the mean summer income of
his second-year business students. Selecting 25 students at random, he calculates the
sample mean weekly income to be Br. 400. The point estimate is the sample mean. In
other words, he estimates the mean weekly summer income of all second-year business
students to be Br. 400. Thus, Br. 400 is point estimate value of the actual mean (average)
weekly income of students.
Illustration Two
The computed value of the sample mean lifetime is ( x=5.77 ). It is reasonable to regard
5.77 as a very plausible value of μ "our best guess" for the value of μ based on the
available sample information.
Illustration Three
Suppose we have the sample 10, 20, 30, 40 and 50 selected randomly from a population
its mean (μ) is unknown.
¿
∑ xi = 10+20+30+ 40+50 =30
n 5
On the other hand, if we state that the mean, μ, is between, x ± 10, the range of values
from 20 (30-10) to 40 (30+10) is an interval estimate.
Dear student! In the previous part of this chapter, we discussed point estimation of
population mean and population proportion. In this part of the chapter, we will discuss
how we estimate the value of population mean based on interval estimation.
Interval estimation consists of two numerical values defining an interval within which
lies the unknown parameter we want to estimate with a specified degree of confidence
95% 5% 1.96
99% 1% 2.58
x −μ
Z=
σ
√n
Because the sample mean can be greater than or less than the population mean, Z can be
positive or negative. Thus, the preceding expression takes the form:
σ
μ=x ± Z
√n
The value of the population mean ( μ), lies somewhere within this range. Rewriting this
expression yields the confidence interval for population mean:
σ σ
x−Z ≤ μ≤ x+Z
√n √n
When the population distribution is normal and at the same time σ is known, we can
estimate (regardless of the sample size) using the following formula.
σ
μ=x ± Z
√n
Where,
x = sample mean
Z = value from the standard normal table reflecting confidence level
σ = population standard deviation
n = sample size
α = the proportion of incorrect statements (α = 1 – Confidence level)
= unknown population mean
From the above formula we can learn that an interval estimate is constructed by adding
and subtracting the error term to and from the point estimate. That is, the point estimate is
found at the center of the confidence interval.
To find the interval estimate of population mean, μ we have the following steps.
1. Compute the standard error of the mean (σ x)
α
2. Compute from the confidence coefficient.
2
α
3. Find the Z value for the from the table
2
Illustration 1
The vice president of operations for Ethio Telecom is in the process of developing a
strategic management plan. He believes that the ability to estimate the length of the
average phone call on the system is important. He takes a random sample of 60 calls
from the company records and finds that the mean sample length for a call is 4.26
minutes. Past history for these types of calls has shown that the population standard
deviation for call length is about 1.1 minutes. Assuming that the population is normally
distributed and he wants to have a 95% confidence, help him in estimating the population
mean.
Solution:
Given: n = 60 calls
x = 4.26 minutes
σ = 1.1 minutes
CL = 0.95
σ 1.1
Step 1: σ x = = = 0.142
√n √60
α 0.05
= = 0.025
2 2
Step 4: μ = ( x ± Zα × σ x)
2
= 4.26 ± 0.28
3.98 ≤ μ≤ 4.54
Illustration 2
A survey conducted by “Addis Zemen Gazetta” found that the sample mean age of men
was 44 years and the sample mean age of women was 47 years. Altogether, 454 people
from Addis were included in the reader poll – 340 women and 114 men. Assume that the
population standard deviation of age for both men and women is 8 years.
a. Develop a 95% confidence interval estimate for the mean age of the population
men who read the gazetta.
b. Develop a 95% confidence interval estimate for the mean age of the population
women who read the gazetta.
c. Compare the widths of the two interval estimates form part (a) & (b) which one
has a better precision? Why?
Solution:
x = 44 years
σ = 8 years
CL = 0.95
α 0.05
= = 0.025
2 2
Step 4: μ = ( x ± Zα × σ x)
2
= 44 ± 1.96 × 0.75
= 44 ± 1.47
42.53 ≤ μ≤ 45.47
Step 5: Conclusion: the 95% confidence interval estimate for the mean age of the
population men who read the gazetta is between 42.53 and 45.47 years.
x = 447 years
σ = 8 years
CL = 0.95
σ 8
Step 1: σ x = = = 0.434
√n √340
α 0.05
= = 0.025
2 2
Step 4: μ = ( x ± Zα × σ x)
2
= 47 ± 1.96 × 0.434
= 47 ± 0.85
46.15 ≤ μ≤ 47.85
Step 5: Conclusion: the 95% confidence interval estimate for the mean age of the
population women who read the gazetta is between 46.15 and 47.85 years.
c. Part b has a better precision because the sample size is larger as compared with
part a.
Illustration Three
Time magazine reports information on the time required for caffeine from products such
as coffee and soft drinks to leave the body after consumption. Assume that the 99%
confidence interval estimate of the population mean time for adults is 5.6 hrs. to 6.4 hrs.
i. What is the point estimate of the mean time for caffeine to leave the body after
consumption?
ii. If the population standard deviation is 2 hrs., how large a sample was used to
provide the interval estimate?
Solution:
5.6+6.4
i. point estimate ¿ =6 hrs .
2
σ
6.4=x +Z
√n
12=2 x
x=6 hours
α = 1- CL = 1- 0.99 = 0.01
α/2 = 0.005
Z α =Z 0.005=2.58
2
σ
6.4=x +Z
√n
2
6.4=6+ 2.58
√n
5.14
0.4=
√n
5.14
√ n=
0.4
√ n=1285
n=165
Confidence interval estimate of μ - Normal population, σ unknown, n large
If we know that the population is normal, and we know the population standard deviation
(σ ), the confidence interval for μ should be constructed in the manner already shown i.e.,
s
μ=x ± Z
√n
Illustration 1
Suppose that a car rental firm in Addis wants to estimate the average number of miles
traveled by each of its cars rented. A random sample of 110 cars rented reveals that the
sample means travel distance per day is 85.5 miles, with a sample standard deviation of
19.3 miles. Compute a 99% confidence interval to estimate μ.
Solution:
x = 85.5 miles
s = 19.3 miles
CL = 0.99
S 19.3
Step 1: s x = = = 1.84
√n √110
Step 4: μ = ( x ± Zα × s x )
2
= 85.5 ± 4.747
80.753 ≤ μ≤ 90.247
Step 5: Conclusion: we state with 99% confidence that the average distance
traveled by rented cars lies between 80.753 and 90.247 miles.
Illustration 2
A study is being conducted in a company that has 800 engineers. A random sample of 50
of these engineers reveals that the average sample age is 34.3 years, and the sample
standard deviation is 8 years. Assuming normality, construct a 98% confidence interval to
estimate the average age of all engineers in this company.
Solution:
Given: n = 50 engineers
N = 800 engineers
x = 34.3 years
s = 8 years
CL = 0.98
Step 1: s x =
S
√ √
N −n
√ n N −1
800−50
800−1
= 1.10
α 0.0 2
= = 0.01
2 2
Step 4: μ = ( x ± Zα × s x )
2
= 34.3 ± 2.56
31.74 ≤ μ≤ 36.86
Step 5: Conclusion: We state with 98% confidence that the mean age of engineers
lies between 31.74 and 36.86 years.
Dear student! Here we will discuss how we estimate the value of population proportion
based on interval estimation through illustrations.
p− p p− p
=
√
σ
Z= p pq
n
However, here p is unknown and we want to estimate p by p and hence Z becomes:
p− p
Z=
√ pq
n
p= p ± Z
√ pq
n
p= p ± Z α
2 √ pq
n
p= p ± Z α s p
2
Where:
p = sample proportion
q=1− p
α = 1 – CL
n = sample size
Illustration 1
Solution:
Given:
p = 0.39
q = 0.61
CL = 0.95
Step 1: s p=
√ √
pq
n
=
0.39× 0.61
87
= 0.0523
α 0.05
= = 0.025
2 2
Step 4: p = ( p ± Zα × s p)
2
= 0.39 ± 0.1025
0.2875 ≤ p ≤ 0.4925
Illustration 2
Solution:
Given:
n = 400
0.73 ≤ p ≤ 0.87
0 .73+ 0. 87
=0. 80
2
Point estimate =
0.73 = p −¿ Zα × s p
2
0.87 = p +¿ Zα × s p
2
1.60 = 2 p
p=0.8
Number of females:
p = p −¿ Zα × s p
2
0.87 = 0.8 0 −¿ Zα × s p
2
0.07 = Zα ×0.02
2
3.50 = Zα
2
P¿
CL=0.49977 × 2
CL=99.954 %
Illustration 3
A random sample of 400 faculty members at AAU contained 120 people who believed
that the University should improve its library service. On the basis of this sample
information, an analyst calculated the confidence interval (0.25, 0.35) for the population
proportion of faculty members favoring improvement. What is the level of confidence of
this interval?
Solution:
Given:
n = 400
x = 120
p=0.30
p = p −¿ Zα × S p
2
0.05 = Zα ×
2 √ 0.70× 0.30
400
0.05 = Zα × 0.023
2
2.17 = Zα
2
CL=0.485 ×2
CL=97 %
Dear learner! Have you understood methods which estimate interval estimation value of
population mean? Hear we will discuss certain concepts and methods to compute the
interval estimation value of the difference between two population means.
Letting μ1 denote the mean of population 1 and μ2 denote the mean of population 2, we
will focus on inferences about the difference between the means: μ1−μ2 . To make an
inference about this difference, we select a simple random sample of n1 units from
population 1 and a second simple random sample of n1 units from population 2. The two
samples, taken separately and independently, are referred to as independent simple
random samples. In this section, we assume that information is available such that the
two population standard deviations, σ 1 and σ 2, can be assumed known prior to collecting
the samples. We refer to this situation as the σ 1 and σ 2 known case. In the following
example we show how to compute a margin of error and develop an interval estimate of
the difference between the two population means when σ 1 and σ 2 are known.
Let us define population 1 as all customers who shop at the inner-city store and
population 2 as all customers who shop at the suburban store.
μ1 - mean of population 1 (i.e., the mean age of all customers who shop at the
inner-city store)
μ2 - mean of population 2 (i.e., the mean age of all customers who shop at the
suburban store)
x 1- Sample mean age for the simple random sample of n1 inner-city customers
x 2 - Sample mean age for the simple random sample of n2 suburban customers
The point estimator of the difference between the two population means is the difference
between the two sample means (i.e. x 1−x 2). As with other point estimators, the point
estimator x 1−x 2 has a standard error that describes the variation in the sampling
distribution of the estimator. With two independent simple random samples, the standard
error of x 1−x 2 is as follows:
σ x −x =
1 2
√
σ 12 σ 22
+
n1 n2
x 1−x 2 ± Zα
2 √ σ 12 σ 22
+
n1 n 2
Illustratioon 1
Let us return to the Greystone example. Based on data from previous customer
demographic studies, the two population standard deviations are known with σ 1=9 years
and σ 2=10 years. The data collected from the two independent simple random samples of
Greystone customers provided the following results.
Solution
Using the above expression, we find that the point estimate of the difference between the
mean ges of the two populations is:
Thus, we estimate that the customers at the inner-city store have a mean age five years
greater than the mean age of the suburban store customers.
Using 95% confidence and Zα =Z0.025 =1.96 , we have interval estimate of:
2
x 1−x 2 ± Zα
2 √ σ 12 σ 22
+
n1 n 2
5 ± 4.06
Thus, the margin of error is 4.06 years and the 95% confidence interval estimate of the
difference between the two population means is:
Illustratioon 2
A research team is interested in the difference between serum uric acid levels in patients
with and without Down's syndrome. In a large hospital for the treatment of the mentally
retarded, a sample of 12 individuals with Down's syndrome yielded a mean of x 1=4.5
mg/100 ml. In a general hospital a sample of 15 normal individuals of the same age and
sex were found to have a mean value of x 2=3.4 mg/100 ml. If it is reasonable to assume
that the two populations of values are normally distributed with variances equal to 1 and
1.5 respectively, find the 95 percent confidence interval for μ1−μ2 .
Give:
n1=12 n2 =15
x 1=45 x 2=3.4
2 2
σ 1 =1σ 1 =1.5
√ σ 12 σ 22
+
n1 n2
√ 1 1.5
+
12 15
=0.4282
x 1−x 2 ± Zα
2 √ σ 12 σ 22
+
n1 n 2
Discussion: As this is a z-interval, we know that the correct value of z to use is 1.96. We
interpret this interval that the difference between the two population means is 1.1 and we
are 95% confident that the true mean lies between 0.26 and 1.94.
Dear learner! The previous examples of interval estimation are on the basis of standard
normal distribution (Z test). Standard normal distribution (Z test) is preferable when
population or sample standard deviation is known and the sample size is large (n ≥ 30). If
the sample standard deviation (s) is used as an estimator of the population standard
deviation (σ ) the sample size is small (n < 30), and if the population has a normal
distribution, interval estimation of the population mean can be based up on a probability
distribution known as t-distribution.
Characteristics of t-distribution
1. The t-distribution is symmetric about its mean (0) and ranges from - ∞ to ∞.
iii. Look up t α , v
2
Illustration 1
If a random sample of 27 items produces x=128.4 and s = 20.6. What is the 98%
confidence interval for µ? Assume that x is normally distributed for the population. What
is the point estimate?
Solution:
The point estimate of the population mean is the sample mean, in this case 128.4 is the
point estimate.
Given:
n=27
x=128.4
s=20.6
CL=0.98
v=n−1=27−1=26
s 20.6
i. s x = = =3.96
√ n √27
0.02
tα = =0.01
2
2
v. = 128.4 ± 2.479(3.96)
= 128.4 ± 9.82
118.56 ≤ ≤ 138.22
We state with 98% confidence that the population mean lies between 118.56 and 138.23.
Illustration 2
A sample of 20 cab fares in Bahir Dar city shows a sample mean of Br 2.50 and a sample
standard deviation of Br. 0.50. Develop a 90% confidence interval estimate of the mean
cab fares in Bahir Dar city. Assume the population of cab fares has a normal distribution.
Given:
n=20
x=2.50
s=0.50
CL=0.90
v=n−1=20−1=19
s 0.50
i. s x = = =0.112
√ n √20
0.10
tα = =0.05
2
2
s
iv. µ=x ± t α , v
2 √n
= 2.50 ± 0.194
2.31 ≤ ≤ 2.69
We state with 90% confidence that the mean of cab fares in Bahir Dar city lies
between Birr 2.31 and 2.69.
Illustration 3
Sales personnel for X Company are required to submit weekly reports listing customer
contacts made during the week. A sample size of 61 weekly contact reports showed a
mean of 22.4 customer contacts per week for the sales personnel. The sample standard
deviation was 5 contacts.
a. Develop a 95% confidence interval estimate for the mean number of weekly
customer contacts for the population of sales personnel.
b. Assume that the population of weekly contact data has a normal distribution. Use
the t distribution to develop a 95% confidence interval for the mean number of
weekly customer contacts.
c. Compare your answer for parts (a) and (b). What do you conclude from your
results?
Solutions:
a) Given:
n=61 weekly contact reports
x=22.4 contact
s=5 contact
CL=0.95
s 5
i. s x = = =0.64
√ n √61
0.05
Zα = =0.025
2
2
iii. Z α =Z0.025=1.96
2
s
iv. µ=x ± Z α
2 √n
v. = 22.4 ± 1.96(0.64)
= 22.4 ± 1.25
21.15 ≤ ≤ 23.65
We can state with 95% confidence that the mean weekly contact lies between
21.15 and 23.65 contacts.
b) Given:
n=61 weekly contact reports
x=22.4 contact
s=5 contact
CL=0.95
ν = n – 1 = 61 – 1 = 60
s 5
i. s x = = =0.64
√ n √61
0.05
tα= =0.025
2
2
v. = 22.4 ± 2.00(0.64)
= 22.4 ± 1.28
21.12 ≤ ≤ 23.68
We can state with 95% confidence that the mean weekly contact lies between 21.12
and 23.68 contacts.
Dear student! The reason for taking a sample from a population is that it would be too
costly to gather data for the whole population. But collecting sample data also costs
money; and the larger the sample, the higher the cost. To hold cost down, we want to use
as small a sample as possible. On the other hand, we want a sample to be large enough to
provide “good” approximation/estimates of population parameters. Consequently, the
question is “How large should the sample be?”
Dear student! Based on the previous discussions that the confidence interval for μ is
σ σ
µ=x ± Z α . From this expression Z α is called error of estimation (e). That is, the
2 √n 2 √n
σ
difference between x and µ which results from the sampling process. So, e=Z α .
2 √n
( )
2
Zα × σ
2
n=
e
Illustration 1
A gasoline service station shows a standard deviation of Birr 6.25 for the changes made
by the credit card customers. Assume that the station’s management would like to
estimate the population mean gasoline bill for its credit card customers to be within ±
Birr 1.00. For a 95% confidence level, how large a sample would be necessary?
Solution:
Given:
e = Birr 1.00
σ = Birr 6.25
CL = 0.95
Z α =Z 0.025=1.96
2
( )
2
Zα × σ
2
n=
e
( )
2
1.96 ×6.25
n=
1
n=150.06=151
Illustration 2
The National Travel and Tour Organization (NTO) would like to estimate the mean
amount of money spent by a tourist to be within Birr 100 with 95% confidence. If the
amount of money spent by tourist is considered to be normally distributed with a standard
Solution:
e = Birr 100
σ = Birr 200
CL = 0.95
Z α =Z 0.025=1.96
2
( )
2
Zα × σ
2
n=
e
( )
2
1.96 ×200
n=
100
n=15.37=16
H −L
σ=
4
The rough approximation is because 95.4% of the total population falls
1
within ± 2 σ . σ = range.
4
p= p ± Z α
2 √ pq
n
e=Z α
2 √ pq
n
, squaring both sides
2 2 pq
e = Zα
( ) 2
n , solving for n
( )
2
Zα pq
2
n=
e2
p and q
Since we are trying to determine n, we cannot have . Instead, we should have p
and q. so it becomes:
( )
2
Zα
2
n= pq
e
Illustration 1
Suppose that a production facility purchases a particular component part in large lots
from a supplier. The production manager wants to estimate the proportion of defective
parts received from this supplier. She believes that the proportion of defects is no more
than 0.2 and wants to be with in 0.02 of the true proportion of defects with a 90% level of
confidence. How large a sample should she take?
Solution:
Given:
p = 0.2
q =0.8
CL = 0.90
Z α =Z 0.05=1.64
2
( )
2
Zα
2
n= pq
e
( )
2
1.64
n= 0.2 ×0.8
0.02
n=1075.84 ≈ 1076
Illustration 2
What is the largest sample size that would be needed in estimating a population
proportion to be within ± 0.02, with a confidence coefficient of 0.95?
Solution:
Given:
e = 0.02
CL = 0.95
Z α =Z 0.025=1.96
2
( )
2
1.96
n= 0.5 ×0.5
0.02
n=2401
If p is unknown and there is no possibility of estimating it, use 0.5 as the value of p
because it will generate the greatest possible sample size as compared with other values.
We presented interval estimates for a population mean for three cases. In the σ known
case, historical data or other information is used to develop an estimate of σ prior to
taking a sample. Analysis of new sample data then proceeds based on the assumption that
σ is known. In the σ unknown case and the sample size is large, the sample data are used
to estimate both the population mean and the population standard deviation. In the σ
unknown and the sample size is small case, the sample data are used to estimate both the
population mean and the population standard deviation through t distribution.
In the σ known case, the interval estimation procedure is based on the assumed value of σ
and the use of the standard normal distribution. In the σ unknown and the sample size is
large case; the interval estimation procedure uses the sample standard deviation s and the
Z distribution. In the σ unknown and the sample size is small case; the interval estimation
procedure uses the sample standard deviation s and the t distribution. In all cases the
quality of the interval estimates obtained depends on the distribution of the population
and the sample size. If the population is normally distributed the interval estimates will
be exact in both cases, even for small sample sizes. If the population is not normally
distributed, the interval estimates obtained will be approximate. Larger sample sizes will
provide better approximations, but the more highly skewed the population is, the larger
the sample size needs to be to obtain a good approximation.
The general form of the interval estimate for a population proportion is ± margin of error.
In practice the sample sizes used for interval estimates of a population proportion are
generally large. Thus, the interval estimation procedure is based on the standard normal
distribution.
Point estimator - The sample statistic, such as , s, or , that provides the point estimate of
the population parameter.
Point estimate - The value of a point estimator used in a particular instance as an
estimate of a population parameter.
Interval estimate - an estimate of a population parameter that provides an interval
believed to contain the value of the parameter. For the interval estimates in this chapter, it
has the form: point estimate ± margin of error.
Margin of error - The ± value added to and subtracted from a point estimate in order to
develop an interval estimate of a population parameter. σ known The case when historical
data or other information provides a good value for the population standard deviation
prior to taking a sample. The interval estimation procedure uses this known value of σ in
computing the margin of error.
Confidence level - The confidence associated with an interval estimate. For example, if
an interval estimation procedure provides intervals such that 95% of the intervals formed
using the procedure will include the population parameter, the interval estimate is said to
be constructed at the 95% confidence level.
Confidence interval – is another name for an interval estimate.
σ unknown - The more common case when no good basis exists for estimating the
population standard deviation prior to taking the sample. The interval estimation
procedure uses the sample standard deviation s in computing the margin of error.
t distribution - A family of probability distributions that can be used to develop an
interval estimate of a population mean whenever the population standard deviation σ is
unknown and is estimated by the sample standard deviation s.
Degrees of freedom – is a parameter of the t distribution. When the t distribution is used
in the computation of an interval estimate of a population mean, the appropriate t
distribution has n - 1 degrees of freedom, where n is the size of the simple random
sample.
A sample survey of 54 discount brokers showed that the mean price charged for a trade of
100 shares at $50 per share was $33.77. The survey is conducted annually. With the
historical data available, assume a known population standard deviation of $15.
A. $50 C. $15
B. $33.77 D. $100
2) Using the sample data, what is the margin of error associated with a 95% confidence
interval?
A. 1 C. 3
B. 2 D. 4
3) Develop a 95% confidence interval for the mean price charged by discount brokers
for a trade of 100 shares at $50 per share.
HYPOTHESIS TESTING
Introduction
Dear Students! In Chapter Two we have discussed the first statistical inference which is
estimation. In this chapter we will discuss the second statistical inference which is
hypothesis testing. The chapter comprises of concepts about test of hypothesis for a
single population and two independent populations. It has been tried to show how
hypothesis can be tested for single mean, proportion, and differences of means and
proportions.
Dear learner! At times we wish to examine statistical evidence, and determine whether it
supports or contradicts a claim that has been made (or that we might wish to make)
concerning the entire population. This is done in a somewhat asymmetric fashion,
analogous to the approach taken in the Ethiopian system of criminal justice (adopted
The evidence is viewed as being the result of some statistical procedure. We calculate the
probability that the same procedure – if carried out in a world where the statement really
is true – would, purely due to sampling error, provide evidence at least as contradictory
to the statement on trial as is the evidence we have in fact seen. This probability, called
the significance level of the sample data with respect to the statement, is then interpreted.
If it is large, we conclude that the evidence against the statement is weak, since we must
acknowledge that, in a presumed world in which the statement is true; our studies would
frequently provide such evidence purely due to our exposure to sampling error. However,
if this probability is small, we conclude that the evidence at hand is quite different from
that which we would expect to see if the statement were true, i.e., we conclude that the
evidence strongly argues against the statement’s truth, and we lean towards finding the
statement “guilty.”
Just as in a criminal trial, we never conclude that the statement is “innocent” – at most,
we find it “not guilty.” In other words, our analysis leaves us in one of two camps: We
have strong evidence that the original statement is false, or we do not have such evidence.
Therefore, if we wish to make an affirmative case for a claim, we are forced to take the
opposite of that claim as the statement we put on trial. Only in this way might we
conclude, at the end, that the data – if strong evidence against the claim on trial – serves
to support the original claim.
Dear learner! What do you think when someone says Hypothesis Testing? In our day-to-
day life we are overwhelmed with various hypothetical thinking or assumptions which
are termed as hypothesis.
For example, suppose we wanted to determine whether a coin was fair and
balanced. A null hypothesis might be that half the flips would result in Heads and
half, in Tails. The alternative hypothesis might be that the number of Heads and
Tails would be very different. Symbolically, these hypotheses would be expressed
as:
H 0 : p=0.5
H 1 : p ≠ 0.5
Suppose we flipped the coin 50 times, resulting in 40 Heads and 10 Tails. Given
this result, we would be inclined to reject the null hypothesis. We would
conclude, based on the evidence, that the coin was probably not fair and balanced.
Dear learner! Some researchers say that a hypothesis test can have one of two outcomes:
you accept the null hypothesis or you reject the null hypothesis. Many statisticians,
however, take issue with the notion of "accepting the null hypothesis." Instead, they say:
you reject the null hypothesis or you fail to reject the null hypothesis.
Dear learner! Statisticians follow a formal process to determine whether to reject a null
hypothesis, based on sample data. This process, called hypothesis testing, consists of
four steps.
State the hypotheses. This involves stating the null and alternative hypotheses.
The hypotheses are stated in such a way that they are mutually exclusive. That is,
if one is true, the other must be false.
Formulate an analysis plan. The analysis plan describes how to use sample data
to evaluate the null hypothesis. The evaluation often focuses around a single test
statistic.
Analyze sample data. Find the value of the test statistic (mean score, proportion,
t-score, z-score, etc.) described in the analysis plan.
Interpret results. Apply the decision rule described in the analysis plan. If the
value of the test statistic is unlikely, based on the null hypothesis, reject the null
hypothesis.
A. Type I error. A Type I error occurs when the researcher rejects a null
hypothesis when it is true. The probability of committing a Type I error is
called the significance level. This probability is also called alpha, and is
often denoted by α.
B. Type II error. A Type II error occurs when the researcher fails to reject a
null hypothesis that is false. The probability of committing a Type II error
is called Beta, and is often denoted by β. The probability of not
committing a Type II error is called the Power of the test.
If we reject a hypothesis when it should be accepted, we say that a Type I error has
been made. If, on the other hand, we accept a hypothesis when it should be rejected,
we say that a Type II error has been made. In either case, a wrong decision or error in
judgment has occurred. In order for decision rules (or tests of hypotheses) to be good,
they must be designed so as to minimize errors of decision. This is not a simple
matter, because for any given sample size, an attempt to decrease one type of error is
generally accompanied by an increase in the other type of error. In practice, one type
of error may be more serious than the other, and so a compromise should be reached
in favor of limiting the more serious error. The only way to reduce both types of error
is to increase the sample size, which may or may not be possible.
Dear learner! In testing a given hypothesis, the maximum probability with which we
would be willing to risk a Type I error is called the level of significance, or
significance level, of the test. This probability, often denoted by α , is generally
specified before any samples are drawn so that the results obtained will not influence
our choice. In practice, a significance level of 0.05 or 0.01 is customary, although
other values are used. If, for example, the 0.05 (or 5%) significance level is chosen in
designing a decision rule, then there are about 5 chances in 100 that we would reject
the hypothesis when it should be accepted; that is, we are about 95% confident that
we have made the right decision. In such case we say that the hypothesis has been
Activity
Consider the following hypotheses that relate to the medical example mentioned earlier.
Suppose a person takes a medical test that attempts to detect the disease. Discuss the
consequences of a Type I error and a Type II error.
Dear learner! The analysis plan includes decision rules for rejecting the null hypothesis.
In practice, statisticians describe these decision rules in two ways - with reference to a P-
value or with reference to a region of acceptance.
The set of values outside the region of acceptance is called the region of
rejection. If the test statistic falls within the region of rejection, the null
hypothesis is rejected. In such cases, we say that the hypothesis has been rejected
at α level of significance.
Dear learner! A test of a statistical hypothesis, where the region of rejection is on only
one side of the sampling distribution, is called a one-tailed test. One tail hypothesis test
further can be classified as right one tail test and left one tail test. The basis to decide the
type of test is mainly the sign of comparison used in the alternative hypothesis part.
For example, suppose the null hypothesis states that the mean is less than or equal to 10.
The alternative hypothesis would be that the mean is greater than 10. The region of
rejection would consist of a range of numbers located on the right side of sampling
distribution; that is, a set of numbers greater than 10.
Example: Identify the types of tail tests for the following pairs of hypothesis:
A) Ho: P< 0.4∧Ha: P ≥ 0.45
B) Ho: P ≥ 0.12∧Ha : P<0.12
C) Ho: μ=24∧Ha: μ ≠ 24
Solution:
Dear learner! Here the sample information is taken from a set of population where the
population information is fully unknown or difficult to know. Then an assumption will be
tested whether it is failed to accept or reject it. The sample taken from the population is
assumed to be large whenn>30 .
If the standard deviation of the populationδ is known, then based on the central limit
theorem, then the sampling distribution of the mean x would follow the standard normal
distribution for a large sample size.
x −μ x−μ
Z= =
The Z-statistics is given by: σx σ
√n
In this formula the numerator ( x−μ), measures how far the observed sample mean x is
from the hypothesized mean μ. The denominator σ x is the standard error of the mean so
the Z test statistics represents how many standard errors x is ¿ μ.
A packaging device is set to fill detergent powder packets with a mean weight of 5kg.
These are known to drift upwards over a period of time due to machine fault, which is not
tolerable. A random sample of 100 packets is taken and weighed. This sample has a mean
weight of 5.03kg and a standard deviation of 0.21kg. Can we conclude that the mean
weight produced by the machine has increased? Use a 5 percent level of significance.
Solution:
Here the appropriate test statistics is Z because though the population standard deviation
is unknown, the sample size is large at 100.
Decision rule: Accept the null hypothesis if the Z cal is less than Z tab
Reject
Ho
Accept Ho
Z
Ztab=1.6
0 45
Z tab=1.645
Decision: Accept Ho, i.e., the mean weight does not increase
x−μ
x−μx
t= = s
sx
√n
Illustration
Suppose the average breaking strength of steel rods is specified to be 18.5 thousand lbs.
For this a sample of 14 rods was tested. The mean and standard deviation obtained were
17.85 and 1.955, respectively. Test the significance of deviation through 5% level of
significance.
Solution: Let us take the null hypothesis that there is no significant deviation in the
breaking strength of the rods, that is,
α =0.05
Since the tail is two tail tests, the given alpha has to be divided in to two equal parts as:
α
=0.025
2
The sample size is smaller and the population standard deviation is given as unknown
(estimated using sample deviation). Hence the appropriate test statistics to be used will be
t - test.
Decision rule: if the value of t cal is between -2.16 and 2.16, accept the hypothesis else
reject it.
Reject
Reject
Ho
Ho
Accept Ho
Z
Ztab=- Ztab=2.1
2.16
0 6
x−μ x 17 . 85−18 . 5
t cal= = =−1.24
s 1 . 955
√n √ 14
t tab=t α /2 ,13=−2. 16
Decision: There is no significant deviation of sample mean from the population mean,
i.e., accept H o .
Dear learner! We have seen how to conduct hypothesis tests for a mean. We now turn to
proportions. The process is completely analogous, although we will need to use the
standard deviation formula for a proportion.
p− p p− p
Z= =
√
σp pq
n
Illustration
Suppose a manufacturer claims that at least 95% of the equipment which he supplied to a
factory conformed to the specification. An examination of the sample of 200 pieces of
equipment revealed that 18 were faulty. Test the claim of the manufacturer.
Solution:
H o : p ≥0.95∧H a : p<0.95
Reject
Ho
Accept Ho
Z
p− p p− p
Z cal= =
√
σp pq
n
0.91−0.95
Z cal= =−2.67
√ ( 0.95 ) 0.05
200
Z tab=−1.45
Decision: Reject Ho because Z cal is less than Z tab which is within the area of rejection.
Hence, we conclude that the proportion of equipment conforming to specifications is not
95 percent.
Dear learner! Testing the difference implies checking the presence or absence of
difference and their direction comparison of population parameter based on sample
Let x 1and x 2be the sample means obtained in large samples of sizes N 1 and N 2 drawn
from respective populations having means μ1 and μ2 and standard deviations σ 1 and σ 2.
Consider the null hypothesis that there is no difference between the population means
(i.e., μ1= μ2), which is to say that the samples are drawn from two populations having the
same mean.
μ x − x =0 and σ
1 2 x −x =
σ 21 σ 22
+
N 1 N2 1 2
√
The test statistic will be estimated:
(x ¿ ¿ 1−x 2)−( μ x −x )
Z cal= 1 2
¿
√
2 2
σ σ
1 2
+
n1 n2
(x ¿ ¿ 1−x 2)−( μ x −x )
Z cal= 1 2
¿
√
2 2
s s
1 2
+
n1 n2
(x ¿ ¿ 1−x 2 )−(μ x −x )
t cal= 1 2
¿
√
2 2
s s
1 2
+
n1 n2
Illustration
Solution
H 0 : μ1−μ2 ≤0
H 1 : μ 1−μ2 >0
Next, we need to find the standard deviation. Recall the above formulas, we had that the
mean of the difference is:
μ x − x =μ1−μ2=0
1 2
Note: We can substitute the sample means and sample standard deviations for a point
estimate of the population means and standard deviations. Hence,
x 1−x 2=5.2−4.8=0.4
√
s 21 s22
√
2 2
2.4 1.2
sx − x = + = + =0.404
1 2
n 1 n2 45 40
Accept Ho
Z
ttab=1.69
0
0
(x ¿ ¿ 1−x 2 )−(μ x −x )
t cal= 1 2
¿
√
2 2
s s
1 2
+
n1 n2
( 0.4 ) −( 0 )
¿
0.404
¿ 0.988
To decide whether to accept or reject the set null hypothesis, it is mandatory to determine
both t cal and t tab and conduct comparison. t tab is the value of t score obtained from table
considering degree of freedom and level of significance α .
To calculate the degrees of freedom, we can take the smaller of the two numbers n 1 - 1
and n2 - 1. So, in this example we use 39 degrees of freedom. The t tab gives a value of
1.690 for the t 0.05 value. Notice that 0.988 is still smaller than 1.690 and the result is the
same. Since the t-score is smaller than 1.690, we fail to reject the null hypothesis and
state that there is insufficient evidence to make a conclusion about employees performing
better at work with music playing.
3.5.5. Hypothesis Test for the Difference between Two Population Proportions
2 p2 σp 2
p1 q 1 p 2 q 2
σ p −p = +
1 2
n1 n2
The Z statistics for the difference between two population proportions is stated as:
( p 1− p2 )−( p1− p2 )
Z=
σ p −p
1 2
In variably, the standard error σ p −p of the difference between sample proportions is not
1 2
known. Thus, when a null hypothesis states that there is no difference between the
population proportions, we combine two sample proportions ( p1∧ p2 ) to get one unbiased
estimates of population proportion as follows:
Pooled estimate:
n1 p1 + n2 p 2
p=
n1 + n2
Illustration
18 22
n1=60 , p1= =0.30; n2=100 , p 2= =0.22
60 100
Test statistics:
( p 1− p2 )−(P1−P2 )
Z=
sp −p
1 2
n1 p1+ n2 p 2 60 ×18+100 × 22
p= = =0.25
n1 + n2 60+100
Where,
1 2
√
sp − p = p q (
1 1
+ ); q=1− p
n 1 n2
Reject
Ho
Accept Ho
Z
0.30−0.22
Z cal= =1.131
0.0707
Z tab at α =0.05=1.64
Hypothesis testing is a statistical procedure that uses sample data to determine whether a
statement about the value of a population parameter should or should not be rejected. The
hypotheses are two competing statements about a population parameter. One statement is
called the null hypothesis ( H 0), and the other statement is called the alternative
hypothesis ( H a ).
Whenever historical data or other information provides a basis for assuming that the
population standard deviation is known, the hypothesis testing procedure for the
population mean is based on the standard normal distribution. Whenever σ is unknown,
the sample standard deviation s is used to estimate σ and the hypothesis testing procedure
is based on the t distribution. In both cases, the quality of results depends on both the
form of the population distribution and the sample size. If the population has a normal
distribution, both hypothesis testing procedures are applicable, even with small sample
sizes. If the population is not normally distributed, larger sample sizes are needed. In the
case of hypothesis tests about a population proportion, the hypothesis testing procedure
uses a test statistic based on the standard normal distribution.
In all cases, the value of the test statistic can be used to compute Z cal and t cal values for
the test. These values used to determine whether the null hypothesis should be rejected. If
either of these values is less than or equal to the level of significance α, the null
hypothesis can be rejected.
Null hypothesis - The hypothesis tentatively assumed true in the hypothesis testing
procedure.
Level of significance – is the probability of making a Type I error when the null
hypothesis is true as equality.
One-tailed test - A hypothesis test in which rejection of the null hypothesis occurs for
values of the test statistic in one tail of its sampling distribution.
Test statistic - A statistic whose value helps determine whether a null hypothesis should
be rejected.
Two-tailed test - A hypothesis test in which rejection of the null hypothesis occurs for
values of the test statistic in either tail of its sampling distribution.
1. Randomly 1500 selected pine trees were tested for traces of the Bark Beetle
infestation. It was found that 153 of the trees showed such traces. Test the
hypothesis that more than 10% of the Tahoe trees have been infested. (Use a 5%
level of significance.)
2. A manufacturer claimed that at least 95% of the equipment that she supplied to a
factory conformed to specifications. An examination of a sample of 200 pieces of
equipment revealed that 18 were faulty. Test her claim at significance levels of (a)
0.01 and (b) 0.05.
3. A random sample of 12 families in one city showed an average monthly food
expenditure of Birr 1380 with a standard deviation of Birr 100 and a random
sample of 15 families in another city showed an average monthly food
expenditure of Birr 1320 with a standard deviation of birr 120.test whether the
difference between the two means is significant at 0.01
4. A television research analyst wishes to test a claim that more than 50% of the
households will tune in for a TV episode. Specify the null and the alternative
hypotheses to test the claim.
True or False
2. Type I error is the probability of accepting the null hypothesis when it is true
Multiple choices
Lino Douglas A. and Robert D. mason, Basic statistics for Business and Economics.
Hoel Paul G. and Jessen Raymond, Basic Statistics for Business and Economics
Chapter One
4. i, 0.7486
ii. 0.9082
5. 0.7549
6. 0.9345
Chapter Two
1. B
2. D
3. C
4. (a) 2.179
(b) -1.676
(c) 2.457
(d) -1.708 and 1.708
(e) -2.014 and 2.014
Chapter Three
True/false
1. True
2. False
3. False
4. True
Multiple choices
1. A
2. C
3. A
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952