0% found this document useful (0 votes)
39 views78 pages

R W2 G N047 y XBwev GR HTP 1 BPNK 0 Op 1 XB OOc Yeyhz 3 L

The document provides an overview of sample surveys, detailing their objectives, parameters, and the differences between sampling and complete enumeration. It discusses the advantages of sampling, the types of errors that can occur, and the planning and administration necessary for effective surveys. Key aspects include defining survey objectives, data collection methods, and the importance of questionnaire design to minimize errors.

Uploaded by

UMAR MADAKI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views78 pages

R W2 G N047 y XBwev GR HTP 1 BPNK 0 Op 1 XB OOc Yeyhz 3 L

The document provides an overview of sample surveys, detailing their objectives, parameters, and the differences between sampling and complete enumeration. It discusses the advantages of sampling, the types of errors that can occur, and the planning and administration necessary for effective surveys. Key aspects include defining survey objectives, data collection methods, and the importance of questionnaire design to minimize errors.

Uploaded by

UMAR MADAKI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

UNIT-I

INTRODUCTION TO SAMPLE SURVEYS

The main objective of a sample survey is to obtain information about population.


Population may be defined as a group of units defined according to the objectives of the
survey. The population may consist of all the households in a village or locality or that of
all the fields under a particular crop in a geographical area. The information that we seek
about the population is normally, the total number of units, aggregate values of various
characteristics, averages of these characteristics, proportions of units possessing specified
attributes etc.

Parameter and Statistics

Parameter are the numerical constants of the population, e.g. population mean (),
population variance (2) etc. If x1, x2, …..xn be a random sample the any function of
sample observations is called a statistic. Its value may vary from sample to sample, e.g.
sample mean (x) and sample variance (s2). A statistic used for obtaining an estimate of a
population parameter from a set of observations is called its estimator, whereas the value
of an estimator for a given sample is known as the estimate of the unknown parameter.

Accuracy and Precision

Accuracy refers to the amount of deviations of the estimate from the true value
whereas the precision refers to the size of this deviation by repeated applications of the
sampling procedure. Precision is usually expressed in terms of the standard error of the
estimator. Less precision is reflected by a larger standard error.

Sampling versus Complete Enumeration

The word population or universe in Statistics is used to refer to any aggregate


collection of individuals or of their characteristics which can be numerically specified.
Depending upon the number of elements a population may be fine or infinite. A population
containing a limited number of individuals or members is called a finite population, for
example the population of books in a library is an example of a finite population. A
population with unlimited number of individuals or members is known as infinite

-1-
population, e.g. the population of pressures at various points in the atmosphere, number of
stars in the sky etc. A sample is a finite part of a statistical population whose properties are
studied to gain information about the whole. When dealing with people, it can be defined
as a set of respondents selected from a larger population for the purpose of a survey.

The recording of all the units of a population for a certain characteristic is known a
complete enumeration. It is also termed as census. Sampling on the other hand is the act or
process of selecting a suitable sample, or a representative part of a population for the
purpose of determining parameters or characteristics of the whole population. The study
population may be regarded as consisting of units which are to be used for the purpose of
sampling. Each unit is regarded as an individual or an indivisible part, when the selection
is made. Such a unit is known as sampling unit, e.g. a person, an animal, a household, a
village etc. A list all of the sampling units in the population with proper identification
particulars is called a sampling frame. Sampling frame provides the basis for the selection
and identification of the units in the sample. As the sampling frame forms the basic
material from which a sample is drawn. The frame often contains information about the
size and structure of the population, which is used in sample survey in a number of ways.
Also the fraction of the population selected in the sample is called the sampling fraction.

Advantages of Sampling over Complete Enumeration

Major advantages of sampling over complete enumeration are:

i) reduction in cost

ii) greater speed

iii) wider scope

iv) greater accuracy

Since a sample is only a part of the population, obviously it is less costly as


compared to a census. A sample may provide the needed information quickly. For
example, if you are a Doctor and a disease has broken out in a village within your area of
jurisdiction, the disease is contagious and it is killing within hours nobody knows what it
is. You are required to conduct quick tests to control the situation. If you try a census of
those affected, it may take a long time to arrive with your results. In such a case just a few

-2-
of those already infected could be used to provide the required information. Many
populations about which inferences must be made are quite large. But the big size of the
population makes it physically impossible to conduct a census. In such a case, selecting a
representative sample may be the only way to get the information. Also there are some
populations that are so difficult to get access and so only a sample can be used. The
inaccessibility may be economic or time related. In such cases only a sample should be
used. A sample may be more accurate than a census. A census can provide less reliable
information than a carefully obtained sample.

Sampling and Non-Sampling Errors

A sample is expected to represent the population from which it is taken; however,


there is no guarantee that any sample will be precisely representative of the population
from which it comes. A sample may be unrepresentative because of sampling or non-
sampling errors.

Sampling Errors

Sampling errors comprise the differences between the sample and the population
that are due solely to the particular units that happen to have been selected. For example,
suppose that a sample of 100 females from Haryana is taken and all are found to be taller
than six feet. It is very clear even without any statistical proof that this would be a highly
unrepresentative sample leading to invalid conclusions. Sampling error may be committed
due to the chance factor. Unusual units in a population do exist and there is always a
possibility that an abnormally large number of them will be chosen. Sampling error may
also be committed due to sampling bias which is a tendency to favour the selection of units
that have particular characteristics. Sampling bias is usually the result of a poor sampling
plan. The most notable is the bias of non-response when for some reason some units have
no chance of appearing in the sample.

Non Sampling Errors

Non-sampling errors occur whether a census or a sample is being used. A non-


sampling error is an error that results solely from the manner in which the observations are
made. The simplest example of non-sampling error is inaccurate measurements due to

-3-
malfunctioning instruments or poor procedures. For example, if persons are asked to state
their own weights themselves, no two answers will be of equal reliability. An individual’s
weight fluctuates during the day and so the time of weighing will also affect the answer.

Planning and Administration of Surveys

Sample surveys are widely used as a cost effective instrument of data collection
and for making valid inferences about population parameters. Planning and
Administration of surveys deals with the planning, preparations and execution of surveys
such that cost and time for collection of information and errors are minimized to the
possible extent. Planning and preparations of the survey precede the actual operation of
the survey. This is an extremely important task since the quality of the survey results
depends considerably on the preparations made before the survey is conducted. The
amount of work needed for planning varies greatly with the type of material available and
the nature of the information to be collected.

Important aspects requiring attention at the planning stage are as follows:

i) Objectives of the survey

ii) Purpose of Survey

iii) The Data to be collected

iv) Methods of data collection

v) Questionnaire and schedules

vi) Survey, reference and reporting periods

vii) Sample size and sampling design

viii) Planning of pilot survey

Objectives of the survey

The first step in planning a survey is to formulate its objectives. There must be a
need for carrying out the enquiry and it is important to know just what is desired. Thus
the first step in planning a survey is to formulate its objectives. The objectives of the
survey must be spelled out clearly along with the manner in which the results are going to

-4-
be used. The administrator who is in need of some statistical information is expected to
formulate the objectives of a survey. Usually his formulation of the objectives will be
rough and vague. It is for the survey statistician to give a clearer formulation of the
objectives and get it approved by the administrator. The survey statistician’s formulation
of the objectives should include a clear statement regarding the items of information to be
covered, the population to be studied, and the form in which the data would be tabulated
and also the accuracy aimed at in the final results. The survey statistician may start with
the final tabulation that would be required by the administrator and then specify the items
of information which should be collected in the survey for obtaining these tables. As
regards the accuracy of the final figures, he may have to take into account the financial
and manpower resources that would be available for the survey and the use to which the
figures would be put. It may be noted that some compromises may have to be arrived at
between the cost of the survey and the accuracy of the results.

Purpose of Survey

The purpose of the survey should ordinarily indicate the population to be


sampled. For example, in a survey of manufacturing establishments the population may
be the totality of establishments operating in the country during a certain period. The
populations intended to be covered by the survey are called target populations. When the
sampled population differs from the target population, the results of the survey will apply
to the sampled population only.

If there are different agencies which can collect the required information in their
fields of specialization as a bye-product of their normal administrative duties, the surveys
may be conducted in the form of different uni-purpose surveys, each survey being
undertaken by one of the agencies specializing in that field. The decision about uni-
purpose or multi-purpose surveys will depend much on the situations under which the
survey is to be carried out. If the data for the different subjects of enquiry can be
collected by the method of mail enquiry where there is not much journey time involved,
uni-purpose surveys may be adopted. On the other hand if the method of interview is
adopted and if the same primary stage units are to be used for the different subjects of
enquiry, then a multi-purpose survey may be thought of.

-5-
The Data to be Collected

It should be possible from the purpose of the survey to derive a fairly broad list of
items that would provide information on the problems under investigation. This list
should be supplemented by other items that are correlated with the main items and can
throw additional light on related questions. For example, in a survey of general attitudes,
one may collect information on the related items such as marital status, number of
children, religion, occupation etc. When all the items have been assembled, the utility of
obtaining information on them should be considered. A number of items can be discarded
at this stage and only items relevant to the purposes of the survey should be retained. In
this process care should be taken that no important item is missing.

Methods of Data Collection

Having decided about the items of information that should be collected and the
form in which the data collected are to be tabulated, it is necessary to examine the mode
of collecting this information. First the survey statistician should consider whether it is
necessary to collect the information by complete enumeration or by sample survey. If the
objective of the survey is to supply accurate information for each unit and if that unit
happens to be the unit of enquiry, then complete enumeration is indispensable. Instead if
the objective is to provide estimates of aggregates or ratios at some regional level, then
complete enumeration need not necessarily be the best method and the possibility of
using sampling methods should be explored. Depending on the nature of information
required and the population under study we have the following important methods for
collecting information:

i) Director Personal Interview Method

ii) Mail questionnaires Method

iii) Interviews by Enumerators

iv) Interview on Telephone

Director Personal Interview

The method of personal interview is widely used in social and economic surveys.
In these surveys, the investigator personally contacts the respondents and can obtain the
-6-
required data fairly accurately. The interviewer asks the questions pertaining to the
objectives of survey and the information so obtained, is recorded on a schedule prepared
for the purpose. This method is most suitable for collecting data on conceptually difficult
items from respondents. In this method, the response rate is usually good and the
information is, more reliable and correct. However, more expenses and, time is required
to contact the respondents.

Mail questionnaires Method

In this method, the investigator prepares a questionnaire and sends it by mail to


the respondents. The respondents are requested to complete the questionnaires and return
them to the investigator within a specified time. This method is suitable where
respondents are spread over a wide area. Though the method is less expensive, normally
it has a poor response rate. The other problem with this method is that it can be adopted
only where the respondents are literate and can understand the questions. The success of
the method depends on the skill with which the questionnaire is drafted, and the extent to
which willing cooperation of the respondents is secured.

Interviews by Enumerators

This method involves the appointment of enumerators by the surveying agency.


Enumerators go to the respondents and fill up the responses in the schedule themselves.
For success of this method, the enumerators should be given proper training for soliciting
co-operation of the respondents. This method can be usefully employed where the
respondents to be covered are illiterate.

Telephone Interview

In case, the respondents in the population to be covered can be approached by


phone, their responses to various questions included in the schedule can be obtained over
phone. If long distance calls are not involved and only local calls are to be made, this
mode of collecting data may also prove quite economical. It is, however, desirable that
interviews conducted over the phone are kept short so as to maintain the interest of the
respondent. It may be noted that the decision regarding the method of enquiry to be used

-7-
in a survey should be taken after considering the practicability, accuracy and cost of
using the different methods of data collection.

Questionnaire and schedules

The survey statistician should consider whether the questionnaire or a schedule is


to be used in the survey for collecting the information. A questionnaire consists of a list
of questions which the investigator is expected to read out to the respondents. The
responses of the respondents to these questions are recorded. In this case the investigator
is not supposed to influence the response, in any way, by his interpretation of the terms in
the questions. In mail enquiries usually the questionnaire method is used for collecting
the information. A schedule consists of only the items on which the information is to be
collected and the actual procedure of collection of data is left to the investigator. In this
case proper training in the concepts and definitions and in the technique of interview is to
be imparted to the investigators.

It may be seen that the schedule method of enquiry is subject to more investigator
bias than the questionnaire method of enquiry. Though this may be true in case of items
of information which may be easily understood and may be reported accurately, this is
not likely to be true when more complicated items of information are to be canvassed. It
may be noted that the questionnaire is likely to be familiar to the respondents and a
schedule is to be preferred when the items of information are complicated and need
considerable explanation. Preparation of a schedule or a questionnaire with suitable
instructions needs to be given considerable attention in designing a survey as the utility of
the results of the survey depends to a large extent on this. The order of presenting the
questions is important. The questions which are likely to help the investigators in
establishing cordial relations with the respondents should be put first. The questions on
similar subjects should come together in the questionnaire or schedule. Sometimes the
questions should be arranged in such a way that it is suitable for tabulation. If this
procedure of arranging the questions is unsuitable for the investigators as to the sequence
in which the questions are to be put. It may be noted that the wording of the questions
should not lead to ambiguous answers. As far as possible they should be such that the
answers can be recorded in terms of numbers, dates or in specific codes.

-8-
To reduce the non-sampling errors that would arise due to ambiguous definitions
and misunderstanding of the questions by the investigators or respondents, it is necessary
to give detailed explanatory notes and instructions of the items of information included in
the questionnaire or schedule. The instructions should include the concepts and
definitions that are to be used in the survey and the method of enquiry. Clarification of
the doubts by the investigators is to be so done that there is uniformity in the concepts
and definitions used by the different investigators. If the data are being collected by the
method of mail enquiry the explanatory notes and instructions should be unambiguous,
comprehensive and as brief as possible. If the data are being collected by the method of
interview or by actual physical observation, the instructions can be made more detailed.
In this case also instructions should be unambiguous and clear.

Survey and reference periods

The objectives of the survey should also specify whether the survey is to be an
isolated one or a periodically repeated one. In case of repeated surveys, the series of
surveys should be so planned as to minimize the overall cost, ensuring a specified
precision for the estimates or to maximize the precision of the estimates for a given fixed
cost.

Another aspect which needs careful consideration is the period of the survey
together with the reference period (time period to which the information refers) for the
different items of information. The reference period depends much on the items of
information and the conditions under which the survey is to be conducted. It may be
necessary to have different reference periods for different items of information. The
reference period may be taken as one year in case of rare items which occur only once or
twice in a year and which are usually remembered well.

For items of information subject to seasonal fluctuations it is desirable to stagger


the survey over the whole year and collect data every month or season from the same or a
different set of sample units. The same set of sample units may be used, if the main
objective is to study the seasonal variation in the estimates and a different set of sample
units may be used, if the main objective is to obtain a picture for the year as a whole,
provided the data collected in the successive periods have a positive correlation. In the

-9-
latter case data collected at one point of time may be misleading because they are subject
to seasonal effects and the seasonal variation is not reflected. Further, staggering the
survey has the advantage of enabling us to use a smaller number of survey personnel over
a longer period instead of having to use a larger number of survey personnel for a shorter
period.

Sample Size and Sampling Design

The question of how large a sample should be is a difficult one. Sample size can
be determined by various constraints. In general, sample size depends on:

i) The nature of the analysis to be performed

ii) The desired precision of the estimates one wishes to achieve

iii) The kind and number of comparisons that will be made

iv) The number of variables that have to be examined simultaneously and how
heterogeneous a universe is sampled.

This constraint influences the sample size as well as sample design and data
collection procedures. Technical considerations suggest that the required sample size is a
function of the precision of the estimates one wishes to achieve, the variability or
variance, one expects to find in the population and the statistical level of confidence one
wishes to use.

Having decided the items of information to be collected and the method of


enquiry an adequate sampling frame is to be procured. The frame is to be carefully
examined for possible errors of omission and duplication. The information available for
the sampling units for groups of sampling units should be collected and the survey design
should make use of this information to the fullest in improving the efficiency of the
survey. A rational choice is to be made from the possible sampling designs taking into
consideration sampling error, non-sampling errors and cost. In this connection the results
of past surveys, if any, may be of help. If no past experience is available it is necessary
to conduct empirical studies with some past census data or to carry out pilot surveys with
a view to compare the efficiencies of possible sampling designs. The survey design
should have built-in-devices to assess and control non-sampling errors.

-10-
Pilot Surveys

Pilot surveys can be used for many purposes. They can be used for obtaining
information regarding the variability in the population and the cost of survey under
various sampling schemes and to build up cost and variance functions, which can be used
for planning future surveys. It will also help to develop the field procedure, to test the
schedule for assessing the possibility of getting accurate information on the items
included and to train the investigators as to how they should fill up the schedule. In some
surveys answers to certain questions required for a fixed time period such as a day, a
week, a month or a year. Sometimes, it may not be possible to take a decision regarding
the time to which the question should relate. In such cases, the questionnaire can be put
to test in a pilot survey. Pilot surveys may not be required in situations where there is
already some information on the various points from previous surveys. Pilot surveys
should also be suitably planned to give the required information. The planning of pilot a
survey will depend on the purpose for which it is intended. The extent and the scope of
the pilot survey will also depend on the amount of money one is prepared to spend for
this purpose. In sample surveys where no prior information is available, it is worthwhile
to spend a certain amount of expenditure on a pilot survey.

Field organization

One way of organizing the work of data collection in a particular subject is to


make use of an already existing agency, which can collect this information as a bye-
product of their normal administrative duties. In this case the cost is likely to be only
marginal because the agency will be having their normal duties or at a nominal additional
cost. But the quality of work in this case may not be satisfactory because the main
interest of the survey personnel is likely to be in their normal work and not in this survey.
Further in this system there may not be much scope for selection of investigators suitable
for this work, since one has to make use of the available personnel. An alternative system
is to have a permanent field organization with a permanent field staff. The experience
gained by the investigators in such an organization in earlier surveys may be of much
help in efficiently carrying out future surveys. With such an organization it is possible to

-11-
develop a suitable method of supervision and field scrutiny for improving the quality of
the data.

The survey personnel intended for collection of data should have a sound
physique to endure long journeys under trying circumstances. If it is intended to collect
data by actual inspection of sample units and taking measurements counting or using
measuring instruments or be eye estimation, the investigator should be capable of taking
measurements with instruments or by eye estimation fairly accurately. He must know
well the language in which the schedules and instructions are printed. If the data are
going to be collected by interrogating an informant, the investigator must be able to put
respondents at ease and to persuade them to give the required information. The
investigators should have a good deal of patience. He should not get annoyed easily and
should not annoy the informants. While selecting men for field survey, attempts should
be made to test for these qualities with the help of objective tests. Before the selected
survey personnel are sent out for field work they should be given sufficient training on
the methods of approaching the informant, putting questions, and identifying and
selecting sample units.

In order to make sure that the data collected are accurate, it is desirable to get
some schedules checked by an independent batch of workers. It will have a favourable
psychological effect on the investigators, if they are told that their work is being checked
by an independent batch of workers. If possible some sample units can be made common
to two investigators without letting them know as to which are the common sample units.
They might be told that some units are common. This will also have a favourable
psychological effect on them.

Sampling Designs

A sampling design is a definite plan for obtaining a sample from a given


population. It refers to the technique or the procedure the researcher would adopt in
selecting items for the sample. Sampling design is determined before any data are
collected. While developing a sampling strategy, the researcher must pay attention to the
following points:

-12-
i) The first step in developing any sample design is to clearly define the population to
be sampled.

ii) A decision has to be taken concerning a sampling unit before selecting sample.
Sampling unit may be of some geographical area such as a state, district, village,
etc. or construction units such as house, flat, etc. It may be a social unit such as
family, club, school, etc. or an individual.

iii) Frame should be comprehensive, correct, reliable and appropriate. It is extremely


important for the frame to be as representative of the population as possible.

iv) The size of sample should neither be excessively large, nor too small. It should be
optimum. An optimum sample is one which fulfills the requirements of efficiency,
representativeness, reliability and flexibility.

v) In determining the sample design, one must take into consideration the specific
population parameters, which are of interest.

vi) Cost considerations, from practical point of view, have a major, impact upon
decisions relating to not only the size of the sample but also to the sample design.
Cost constraint can even lead to the use of a non-probability sample.

Characteristics of a Good Sample Design

A good sampling design is said to posses the following characteristics:

i) It should result in a truly representative sample.

ii) I should result in a small sampling error

iii) It should be within budget available for the research study.

iv) It should be such so that systematic bias can be controlled in a better way.

-13-
Classification of sampling designs

The technique of selecting a sample is of fundamental importance in sampling


theory and usually depends upon the nature of investigation. Sampling designs may be
broadly classified as (i) Probability or random sampling designs (ii) Non-probability
sampling designs.

Probability sampling designs are the methods of selecting samples in which each
unit of the population has some definite probability of its being selected in the sample. In
other words, each possible sample has assigned to it, a known probability of selection.
Sampling units are selected in a random manner, and hence it is possible to determine the
precision of the estimates and construct confidence interval for the parameters. Simple
random sampling, stratified random sampling and systematic sampling the most
commonly used probability sampling designs.

In case of non-probability sampling designs or methods the selection of sampling


units depends entirely on the discretion or judgment of the investigator. In these methods,
the investigator inspects the entire population and selects a sample of typical units that he
considers as representative of the population. Often the non-probability sampling designs
or methods are also called purposive or judgment sampling methods. A major
disadvantage of these methods is that they lack a proper mathematical foundation and
hence are not amenable to the development of sampling theory.

Simple Random Sampling

The simplest of the methods of probability sampling is known as the method of


simple random sampling, often known as the method of random sampling. In this method
an equal probability of selection is assigned to each available unit of the population at the
first and each subsequent draw. Thus, if the number of units in the population is N, the
probability of selecting any unit of the first draw is 1/N and the probability of selecting
any unit from among the available (N-1) units at the second draw is 1/(N-1) and so on.
Simple random sampling may also be defined as a method of selecting n units out, of N
units in the population such that each possible sample among the total possible Ncn
samples has an equal chance of its being selected. In case of simple random sampling the
probability of a specified unit of the population being selected at any given draw is equal

-14-
to the probability of its being selected at the first draw. The successive draws may be
made with or without replacing the units selected in the preceding draw. The former is
called the procedure of sampling with replacement, the latter is called sampling without
replacement. The basic assumption for simple random sampling is that the population can
be subdivided into a finite number of distinct and identifiable units and sampling frame is
available.

Procedure of Selecting a Random Sample

Commonly used procedures for selecting a random sample are:

i) Lottery Method

ii) Random Number Tables Method.

Stratified Sampling

Simple random sampling is most appropriate when the entire population from
which the sample is taken is homogeneous. Stratified sampling techniques are generally
followed when the population is heterogeneous and where it is possible to divide it into
certain homogeneous sub-populations, which are called strata. The strata differ from one
another but each is homogeneous within itself. The units are selected at random from
each of these strata. The number of units selected from different strata may vary
according to their relative importance in the population. The sample, which is the
aggregate of the sampled units of each of the stratum, is called a stratified sample and the
technique of drawing this sample is known as stratified sampling.

Advantages of Stratified Sampling over Random Sampling

i) The cost per observation in the survey may be reduced.

ii) Estimates of the population parameters may be obtained for each sub-population.

iii) Have increased accuracy at given cost.

iv) Have better administrative control.

-15-
A Systematic Random Sample

If the sampling units are arranged in a systematic manner, then sample is drawn
not at random but by taking sampling units systematically at equally spaced intervals
along same order. The sample obtained in this manner is called a systematic sample and
the technique is called the systematic sampling. A systematic random sample is obtained
by selecting one unit on a random basis and then choosing additional elementary units at
equi-spaced intervals until the desired number of units is obtained. For example, suppose
there are 100 students in your class and you want select a sample of 20 students. Further
suppose that the names are listed on a piece of paper in an alphabetical order. If you
choose to use systematic random sampling, divide 100 by 20, you will get 5 as the
sampling interval. Randomly select any number between 1 and 5. Suppose the number
you have picked is 4, that will be your starting number. So student number 4 has been
selected at random and then you will select every 5th name until you reach the last one.
You will end up with 20 selected students.

Cluster Sampling

In some situations the elementary units are in the form of groups, composed of
smaller units. A group of elementary units is called a cluster. Sampling is done by
selecting a sample of clusters and then carrying out the complete enumeration of clusters.
This is called cluster sampling. For example in taking a sample of households we select a
few villages and then enumerate them completely. The systematic sampling may also be
taken as the cluster sampling in which a sample of one cluster is taken and then it is
completely investigated. Cluster sampling is typically used when the researchers cannot
get a complete list of the members of a population they wish to study but can get a
complete list of groups or 'clusters' of the population. It is also used when a random
sample would produce a list of individuals so widely scattered that surveying them would
prove to be much expensive. This sampling technique may be more practical and/or
economical than simple random sampling or stratified sampling. For example, a cluster
may be something like a village or a school in a state. So you decide all the schools in
Hisar are clusters. You want 20 schools selected. You can use simple or systematic
random sampling to select the schools, then every school selected becomes a cluster. If

-16-
your interest is to interview teachers on their opinion of some new program, which has
been introduced, then all the teachers in a cluster must be interviewed, though very
economical cluster sampling is very susceptible to sampling bias. Like for the above case,
you are likely to get similar responses from teachers in one school due to the fact that
they interact with one another.

Quota Sampling

Quota sampling is a method of sampling widely used in opinion poll surveys and
market research. The quota sampling starts with the idea that a sample should be well
spread geographically over the population and that it should contain the same fraction of
individuals having a certain characteristics, as does the population. In this technique the
population is divided into a number of strata whose weights are obtained from a recent
census or a large-scale survey. Interviewers are then assigned quotas for the number of
interviews to be taken from each stratum. For example, an interviewer might be told to go
out and select 20 adult men and 20 adult women, 10 teenage girls and 10 teenage boys so
that they could interview them about their television viewing. The interviewer is free to
choose his sample provided the quota requirements are fulfilled. The main difference
between quota sampling and the stratified simple random sampling is that in quota
sampling the selection of the sample within strata is not strictly random. The interviewer
may omit certain section of individuals or may discard certain of the area entirely
according to his convenience. Quota sampling suffers from the drawback that the sample
is not a random sample and therefore the sampling distributions of any statistics are
unknown.

Inverse Sampling

In general, it is understood in the SRS methodology for qualitative characteristic


that the attribute under study is not a rare attribute. If the attribute is rare, then the
procedure of estimating the population proportion P by sample proportion n/N is not
suitable. In sampling for a rare attribute, a sample of fixed size may result in having no
individuals with the attribute present in the sample. Inverse sampling methodology is
often used in following cases:

i) Estimation of frequency of rare type of genes.

-17-
ii) Estimation of proportion of some rare type of cancer cells in a biopsy.

iii) Estimation of proportion of rare type of blood cells affecting the red blood
cells.

In inverse sampling methodology, the sampling is continued until a predetermined


number of units possessing the attribute under study occur in the sample which is useful
for estimating the population proportion. The sampling units are drawn one-by-one with
equal probability and without replacement. The sampling is discontinued as soon as the
number of units in the sample possessing the attribute equals a predetermined number.
Assume that there are N units in the population m and M of them possess a rare
attribute. Units are selected one by one using simple random sampling. Sampling is
terminated as soon as a pre-specified number of m units with the rare attribute is found.
Let n be the number of units in the resulting sample. Haldane (1945) proposed the
m 1
estimator p  for the population proportion and showed that it is unbiased for the
n 1
for the population proportion P = M/N in case of an infinite population. Note that n in
inverse sampling is a random variable and N, M, and m are fixed. Observe that the
m 1
estimator p  is not the proportion of individuals in the sample with the rare
n 1
attribute. This may look like a defect of the proposed estimator. However, the sample
m
proportion is biased for the true proportion in case of inverse sampling.
n

-18-
UNIT-II

SIMPLE RANDOM SAMPLING

Simple Random Sampling (SRS) is the simplest and most common method of
selecting a sample, in which the sample is selected unit by unit, with equal probability of
selection for each unit at each draw. In other words, simple random sampling is a method
of selecting a sample of n units from a population of size N by giving equal probability of
selection to all units. It is a sampling scheme in which all possible combinations of n
units may be formed from the population of N units with the same chance of selection.
It is of two types:
(i) Simple random with replacement: If a unit is selected, observed, and replaced
in the population before the next draw is made and the procedure is repeated n times, it
gives rise to a simple random sample of n units. This procedure is known as simple
random sampling with replacement and is denoted as SRSWR.
(ii) Simple random sampling without replacement: If a unit is selected, observed,
and not replaced in the population before making the next draw, and the procedure is
repeated until n distinct units are selected, ignoring all repetitions, it is called simple
random sampling without replacement and is denoted by SRSWOR. Let us discuss the
properties of the estimators of population mean, variance, and proportion in each of these
cases.
METHOD OF SELECTING RANDOM SAMPLE
SAMPLE SELECTION
A sample can be selected from a population in many ways. In this chapter, we
will discuss only two simple methods of sample selection.
CHIT METHOD OR LOTTERY METHOD
Suppose we have N = 10,000 blocks in India. We wish to draw a sample of n =
100 blocks to draw an inference about a character under study, e.g., average amount of
alcohol used or number of bulbs used in each block produced by a certain company.
Assign numbers to the 10,000 blocks and write these numbers on chits and fold them in

-19-
such way that all chits look identical. Put all the chits in a box. Then there are two
possibilities:
WITH REPLACEMENT SAMPLING
Select one chit out of 10,000 chits in the box and note the number of the block
written on it. This is the first unit selected in the sample. Before selecting the second chit,
we replace the first chit in the box and mix with the other chits thoroughly. Then select
the second chit and note the name of the block written on it. This is called the second unit
selected in the sample. Go on repeating the process, until 100 chits have been selected.
Note that the chits are selected after replacing the previous chit in the box some chits may
be selected more than once. Such a sampling procedure is called Simple Random
Sampling with Replacement or simply SRSWR sampling. Let us explain with a few
numbers of block s in a population as follows:
Suppose a population consists of N = 3blocks, say A, B and C. We wish to draw
all possible samples of size n = 2 using SRSWR sampling. The possible ordered samples
are: AA, AB, AC, BA, BB, BC, CA, CB, CC. Thus a total 9 samples of size 2 can be drawn
from the population of size 3, which in fact is given by 32 = 9. In general, the total
number of samples of size n drawn from a population of size N in with replacement
sampling is Nn and is denoted by s(n).
Thus s(n) = Nn
WITHOUT REPLACEMENT SAMPLING
In case of without replacement sampling, we do not replace the chit while
selecting the next chit; i.e., the number of chits in the box goes on decreasing as we go on
selecting chits. Hence, there is no chance for a chit to be selected more than once. Such a
sampling procedure is called Simple Random Sampling Without Replacement or simply
SRSWOR sampling. Let us explain it as follows: Suppose a population consists of N = 3
blocks A, Band C. We wish to draw all possible unordered samples of size n = 2.
Evidently, the possible samples are: AB, AC, BC. Thus a total of 3 samples of size 2 can
be drawn from the population of size 3, which in fact is given by 3C2 = 3. In general, the
total number of samples of size n drawn without replacement from a population of size N
is given by NCn.

-20-
Thus
𝑁!
s(Ω) = NCn = 𝑛!(𝑁−𝑛)!

where n! = n(n-1)(n-2)(n-3)……........2.1, and 0! = 1


Note that it is a very cumbersome job to make identical chits if the size of the
population is very large. In such situations, another method of sample selection is based
on the use of a random number table. A random number table is a set of numbers used for
drawing random samples. The numbers are usually compiled by a process involving a
chance element, and in their simplest form, consist of a series of digits 0 to 9 occurring at
random with equal probability.
RANDOM NUMBER TABLE METHOD
As mentioned above, in this table the numbers from 0 to 9 are written both in
columns and rows. For the purpose of illustrations, we used Pseudo-Random Numbers
(PRN). We generally apply the following rules to select a sample:
Rule 1: First we write all random numbers into groups of columns. We take as many
columns in each group as the number of digits in the population size.
Rule 2: List all the individuals or units in the population and assign them numbers 1, 2,
3,...,N.
Rule 3: Randomly select any starting point in the table of random numbers. Write all the
numbers less than or equal to N that follow the starting point until we obtain n numbers.
If we are using SRSWOR sampling discard any number that is repeated in the random
number table. If we are using SRSWR sampling retain the repeated numbers.
Rule 4: Select those units that are assigned the numbers listed in Rule 3. This will
constitute a required random sample.
Let us explain these rules as follows:
Suppose we are given a population of N = 225 units and we want to select a
sample of say n = 36 units from it. To pick up a random sample of 36 units out of a
population of 225 units, use any three columns from the random number table. For
example, use column 1 to 3, 4 to 6, etc. rejecting any number greater than 225 (and also
the number 000). As an example, the following table lists the 36 units selected using
SRSWR sampling procedure with the use of Pseudo-Random Numbers (PRN).

-21-
Units selected in the sample
014 049 053 039 196 183 171 225 179 153 142 138
070 083 001 209 222 075 219 092 155 012 099 211
027 039 048 048 080 161 006 059 199 150 025 173
In the case of SRSWOR sampling, the figures 039, 048 would not get repeated;
i.e. we would take every unit only once, so we will continue to select two more distinct
random numbers as 078 and 163.
Although the above method of selecting a sample by using a random number table
is very efficient, may make a lot of rejections of the random numbers, therefore we would
like to discuss a shortcut method called the remainder method.
REMAINDER METHOD
Using the above example, if any three-digit selected random number is greater
than 225 then divide it by 225. We choose the serial number from 1 through 224
corresponding to the remainder when it is not zero and the serial number 225 when the
remainder is zero. However, it is necessary to reject the numbers from 901 to 999
(besides 000) in adopting this procedure as otherwise units with serial number 1 to 99
will have a larger probability (5/999) of selection, while those with serial number 100 to
225 will have probability only equal to 4/999. If we use this procedure and also the same
three figure random numbers as given in columns 1 to 3, 4 to 6, etc., we obtain the
sample of units which are assigned numbers given below. Again in SRSWR sampling the
number that gives rise to the same remainder are not discarded while in SRSWOR
sampling procedure such numbers are discarded. Thus an SRSWR sample is as given
below:
Units selected in the sample
138 151 099 025 014 022 197 176 I I 209 042 194
015 049 095 040 027 124 116 097 126 142 073 158
108 053 046 001 207 156 201 027 II I 209 065 184
Note that in the SRSWR sample, only one unit 209 is repeated, thus for SRSWOR
sampling, we continue to apply remainder approach until another distinct unit is selected,

-22-
which is 089 in this case. Further note that the first random number 992 was discarded
due to requirement of this rule.
Results on simple random sampling with replacement
Suppose we select a sample of n ≥ 2 units from the population of size N by using
SRSWR sampling. Let yi, i =1,2,...,n, denote the value of the ith unit selected in the
Sample and Yi, i = 1,2,...,N, be the value of the ith unit in the population. Then we have
the following theorems:
1
Theorem 3.1: The sample mean 𝑦̅ 𝑛 = ∑𝑛𝑖=1 𝑦𝑖 is an unbiased estimator of the
𝑛
1 1
population mean 𝑌̅ = 𝑁 (𝑌1 + 𝑌2 + 𝑌3 + ⋯ … … . . +𝑌𝑁 ) = 𝑁 ∑𝑁
𝑖=1 𝑌𝑖

Corollary 3.1.1 The estimator 𝑌̂ = N𝑦̅𝑛 is an unbiased estimator of population total Y.


𝜎𝑦2
Theorem 3.2: The variance of the estimator 𝑦̅𝑛 of the population mean 𝑌̅ is V(𝑦̅𝑛 ) = 𝑛

1 ∑𝑁 2 ̅2
𝑖=1 𝑌𝑖 −𝑁𝑌
where 𝜎𝑦2 = 𝑁 ∑𝑁 ̅ 2
𝑖=1(𝑌𝑖 − 𝑌) = 𝑁
2
𝑠𝑦
Theorem 3.3: An unbiased estimator of the variance V(𝑦̅𝑛 ) is given by ̂𝑉 (𝑦̅𝑛 ) = 𝑛

1 ∑𝑛 2
̅ 𝑛2
𝑖=1 𝑦𝑖 − 𝑛𝑦
where 𝑠𝑦2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅ 𝑛 )2 =
𝑛−1 𝑛−1

Corollary 3.4: The variance of the estimator 𝑌̂= N𝑦̅𝑛 of the population total is
V(𝑌̂) = N2 V(𝑦̅𝑛 ).
Theorem 3.5: The covariance between two sample means 𝑥̅𝑛 𝑎𝑛𝑑 𝑦̅𝑛 𝑢𝑛𝑑𝑒𝑟 SRSWR
𝜎𝑥𝑦
sampling is given by (𝑥̅ 𝑛 , 𝑦̅𝑛 ) = 𝑛
1
where 𝜎𝑥𝑦 = ∑𝑁 ̅ ̅
𝑖=1(𝑌𝑖 − 𝑌 )(𝑋𝑖 − 𝑋 )
𝑁

Theorem 3.6: An unbiased estimators of covariance between two sample means


𝑠𝑥𝑦
𝑥̅𝑛 𝑎𝑛𝑑 𝑦̅𝑛 𝑢𝑛𝑑𝑒𝑟 SRSWR sampling is given by Cov̂ (𝑥̅𝑛 , 𝑦̅𝑛 ) = 𝑛
1
where 𝑠𝑥𝑦 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)(𝑥𝑖 − 𝑥̅ )
𝑛−1

Results on simple random sampling without replacement


Suppose we select a sample of n ≥ 2 units from the population of size N by using
SRSWOR sampling. Let yi, i =1,2,...,n, denote the value of the ith unit selected in the

-23-
sample and Yi, i = 1,2,...,N, be the value of the ith unit in the population. Then we have
the following theorems:
1
Theorem 3.7: The sample mean 𝑦̅ 𝑛 = ∑𝑛𝑖=1 𝑦𝑖 is an unbiased estimator of the
𝑛
1 1
population mean 𝑌̅ = 𝑁 (𝑌1 + 𝑌2 + 𝑌3 + ⋯ … … . . +𝑌𝑁 ) = 𝑁 ∑𝑁
𝑖=1 𝑌𝑖

Theorem 3.8: The probability for any population unit to get selected in the sample at any
particular draw is equivalent to inverse of the population size, that is, Probability of
1
selecting the ith unit in a sample = 𝑁

Theorem 3.9: The variance of the estimator 𝑦̅𝑛 of the population mean 𝑌̅ is under
(1−𝑓)𝑆𝑦2 1 ∑𝑁 2 ̅2
𝑖=1 𝑌𝑖 −𝑁𝑌
SRSWOR is V(𝑦̅𝑛 ) = where 𝑆𝑦2 = 𝑁−1 ∑𝑁 ̅ 2
𝑖=1(𝑌𝑖 − 𝑌) = and
𝑛 𝑁−1

f = n/N denote the finite population correction factor (f.p.c.)


Theorem 3.10: An unbiased estimator of the variance V(𝑦̅𝑛 ) is given by
2
(1−𝑓)𝑠𝑦 1 ∑𝑛 2
̅ 𝑛2
𝑖=1 𝑦𝑖 − 𝑛𝑦
̂𝑉 (𝑦̅𝑛 ) = where 𝑠𝑦2 = 𝑛−1 ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅ 𝑛 )2 =
𝑛 𝑛−1

Theorem 3.11: The covariance between two sample means 𝑥̅ 𝑛 𝑎𝑛𝑑 𝑦̅𝑛 𝑢𝑛𝑑𝑒𝑟 SRSWOR
(1−𝑓)𝑆𝑥𝑦 1
sampling is given by Cov(𝑥̅𝑛 , 𝑦̅𝑛 ) = where 𝑆𝑥𝑦 = ∑𝑁 ̅ ̅
𝑖=1(𝑌𝑖 − 𝑌)(𝑋𝑖 − 𝑋 )
𝑛 𝑁

Theorem 3.12: An unbiased estimators of covariance between two sample means


(1−𝑓)𝑠𝑥𝑦
𝑥̅𝑛 𝑎𝑛𝑑 𝑦̅𝑛 𝑢𝑛𝑑𝑒𝑟 SRSWOR sampling is given by Cov̂ (𝑥̅𝑛 , 𝑦̅𝑛 ) = 𝑛
1
where 𝑠𝑥𝑦 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)(𝑥𝑖 − 𝑥̅ )
𝑛−1

Estimation of population proportion


Let N be the total number of units in the population Ω and Na be the number of
units possessing a certain attribute, A (say). Then population proportion is the ratio of
number of units possessing the attribute A to the total number of units in the population,
𝑁𝑎
i.e. 𝑃𝑦 = .
𝑁

Theorem 3.13: Under SRSWR the population proportion 𝑃𝑦 is special cases of the
population mean ̅𝑌.
𝑟
Theorem 3.14: An unbiased estimator of population proportion 𝑃𝑦 is given by py = 𝑛

where r is the number of units in the sample that possesses the attribute. A.

-24-
Example-1: Consider a population of 4 units with values 1, 2, 3, 4.
A. Write down all possible samples of 2 using the SRSWOR and SRSWR.
B. Find the population mean, population mean squared error and population variance.
C. Sample mean, sample variance
D. Verify that sample mean and sample variance are an unbiased estimators of the
population mean under SRSWR and SRSWOR
E. Also find the variance of sample mean in case of SRSWR and SRSWOR
Solution:
A Case 1: Suppose we are drawing all possible samples of size n=2 by using
SRWOR sampling.
The total number of possible samples is s(Ω) = 𝑁𝐶 𝑛 = 4C2 = 6 and pt = 1/6

Hence under the SRSWOR the sampling units are given by = (1,2), (1,3), (1,4),
(2,3), (2,4) and (3,4)
Case II: Suppose we are drawing all possible samples of size n=2 by using
SRSWR sampling
The total number of possible sample is s(Ω) = 𝑁 𝑛 = 42 = 16 and pt = 1/6 for all t =
1,2,3,…..16
Hence under the SRSWR the sampling units are given by = (1,1), (1,2), (1,3),
(1,4), (2,1), (2,2), (2,3), (2,4), (3,1), (3,2), (3,3), (3,4), (4,1), (4,2), (4,3) and (4,4)
B The population mean is given by
1 1
𝑌̅ = ∑𝑁 𝑌 = (1 + 2 + 3 + 4) = 2.5
𝑁 𝑖=1 𝑖 4

The population mean squared error is given by


1
𝑆𝑦2 = ∑𝑁 ̅ 2
𝑖=1(𝑌𝑖 − 𝑌 )
𝑁−1
1 5
= [(1 − 2.5)2 + (2 − 2.5)2 + (3 − 2.5)2 + (4 − 2.5)2 ] = = 1.66
4−1 3

The population variance is given by


1
𝜎𝑦2 = ∑𝑁 ̅ 2
𝑖=1(𝑌𝑖 − 𝑌 )
𝑁

1 5
= [(1 − 2.5)2 + (2 − 2.5)2 + (3 − 2.5)2 + (4 − 2.5)2 ] = = 1.25
4 4

-25-
C Under the SRSWOR the total number of possible samples are given in table.
The sample mean is given by
1 1
𝑦̅ = ∑𝑛𝑖=1 𝑦̅𝑖 = (1.5 + 2.0 + 4.5 + 0.5 + 2.0 + 0.5) = 2.5
6 6

The sample variance is given by


1
𝑠𝑦2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)
𝑖
2
𝑛

1 5
= [(0.5 + 2.0 + 4.5 + 0.5 + 2.0 + 0.5] = = 1.66
6 3
D Under the SRSWOR
The expected value of the sample mean, 𝑦̅𝑖 is given by
𝑁𝐶 𝑛
1 1
𝐸(𝑦̅𝑖 ) = ∑ 𝑦̅𝑖 = (1.5 + 2.0 + 2.5 + 2.5 + 3.0 + 3.5) = 2.5 = 𝑌̅
𝑁𝐶 𝑛 6
𝑖=1

which implies that sample mean is unbiased estimate of population mean and that
of sample variance, 𝑠𝑦2 , is given by
2 1 𝑁
𝐶𝑛 2 1 5
𝐸(𝑠𝑦(𝑡) )= ∑𝑡=1 𝑠𝑦(𝑡) = 6 (0.5 + 2.0 + 4.5 + 0.5 + 2.0 + 0.5) = = 𝑆𝑦2
𝑁𝐶 𝑛 3

Under the SRSWOR the total number of possible samples, sample means and
sample variance are given the table

Sample Sampled Sample means Sample variance Probability of


No. Units ̅𝒊
𝒚 𝒔𝟐𝒚(𝒕) selecting the sample
1 (1,2) 𝑦̅1 = 1.5 0.5 1/6
2 (1,3) 𝑦̅2 =2.0 2.0 1/6
3 (1,4) 𝑦̅3 =2.5 4.5 1/6
4 (2,3) 𝑦̅4 =2.5 0.5 1/6
5 (2,4) 𝑦̅5 =3.0 2.0 1/6
6 (3,4) 𝑦̅6 =3.5 0.5 1/6
Total

Under the SRSWR


The expected value of the sample mean, 𝑦̅𝑖 is given by
𝑁𝑛
1
𝐸(𝑦̅𝑖 ) = 𝑛 ∑ 𝑦̅𝑖
𝑁
𝑖=1

-26-
16
1 1 40
= ∑ 𝑦
̅𝑖 = (1 + 1.5 + ⋯ … … + 4) = = 2.5 = 𝑌̅
𝑁𝑛 16 16
𝑖=1

which implies that sample mean is unbiased estimate of population mean and that
of sample variance, 𝑠𝑦2 , is given by
2 1 𝑛2 1 20
𝐸(𝑠𝑦(𝑡) )= ∑𝑁
𝑡=1 𝑠𝑦(𝑡) = (0 + 0.5 + ⋯ … . . +2.0 + 0.5) = = 𝜎𝑦2
𝑁𝑛 16 16

Under the SRWR the total number of possible samples, sample mean and sample
variance are given the table

Sample Sampled Sample means Sample variance Probability of


No. Units ̅𝒊
𝒚 𝒔𝟐𝒚(𝒕) selecting the sample
1 (1,1) 𝑦̅1 = 1.0 0.0 1/6
2 (1,2) 𝑦̅2 = 1.5 0.5 1/6
3 (1,3) 𝑦̅3 =2.0 2.0 1/6
4 (1,4) 𝑦̅4 =2.5 4.5 1/6
5 (2,1) 𝑦̅5 = 1.5 0.5 1/6
6 (2,2) 𝑦̅6 = 2.0 0.0 1/6
7 (2,3) 𝑦̅7 =2.5 0.5 1/6
8 (2,4) 𝑦̅8 =3.0 2.0 1/6
9 (3,1) 𝑦̅9 =3.0 0.5 1/6
10 (3,2) 𝑦̅10 = 2.5 0.0 1/6
11 (3,3) 𝑦̅11 =3.0 0.5 1/6
12 (3,4) 𝑦̅12 =3.5 0.5 1/6
13 (4,1) 𝑦̅13 =2.5 4.5 1/6
14 (4,2) 𝑦̅14 =3.0 2.0 1/6
15 (4,3) 𝑦̅15 =3.5 0.5 1/6
16 (4,4) 𝑦̅16 = 4.0 0.0 1/6
Total

E Under the SRSWOR


Under SRSWOR the variance of the sample mean is given by the formula
𝑁−𝑛 4−2 5 5
Var(𝑦̅𝑛 ) = 𝑆2 = × = = 0.41667
𝑁𝑛 4𝑥2 3 12

Under SRSWR the variance of the sample mean is given by the formula
𝜎𝑦2 5
Var(𝑦̅𝑛 ) = 𝑛
= 6
= 0.83333

-27-
Example-2: The following data give the number of trees from a population of sizes 32
households in a forest.
5 3 7 11 4 6 10 9 8 12 11
10 10 11 8 7 6 8 9 4 1 5
7 7 12 8 9 10 9 7 6 8
Draw a simple random sample with replacement of 6 draws and hence obtain
(i) Estimate of the average number of tree in the sample
(ii) Construct a 95% confidence interval for the average number of tree in the
population
(iii) Estimate the total number of tree in the population
(iv) Construct a 95% confidence interval for the total number of tree in the population.
Solution: Considering a random number table, take two-digited random numbers,
starting from the left-hand top or any suitable starting point and proceed horizontally (or
vertically), considering only the random numbers between 01 and 96 (the largest two
digited random number which is an integer multiple of 32). Divide the random number
selected by 32 and the remainder gives the label of the tree selected. If we consider only
the random numbers between 01 and 32 there will be many rejection of numbers.

Random Household Value of yi (y i - y n ) 2


Number selected
08 08 9 1.37
83 19 9 1.37
23 23 7 0.69
49 17 6 3.35
13 13 10 4.71
63 31 6 3.35
Total 47 2.47

(a) The average number of tree in the sample is given by


1 (9+9+7+6+10+6)
𝑦̅𝑛 = 6 ∑𝑛𝑖=1 𝑦𝑖 = = 7.83
6

(b) A (1-α) 100% confidence interval for the population mean 𝑌̅ is given by the
average number of tree in the sample is given by

-28-
𝑦̅𝑛 ± 𝑡𝛼/2 (𝑑𝑓 = 𝑛 − 1)√𝑉̂ (𝑦̅𝑛 ) where 𝑉̂ (𝑦̅𝑛 ) = 𝑠𝑦2 /𝑛
1
Now we have 𝑠𝑦2 = ∑𝑛𝑖=1(𝑦𝑖 − ̅̅̅)
𝑦𝑛 2
𝑛−1
1
= [(9 − 7.83)2 + (9 − 7.83)2 + (7 − 7.83)2 + (6−7.83)2 + (10 − 7.83)2 +
5

(6 − 7.83)2 ]
= 2.96
2.96
Thus 𝑉̂ (𝑦̅𝑛 ) = 𝑠𝑦2 /𝑛 = = 0 .494
6

√𝑉̂ (𝑦̅𝑛 = 0.703


Therefore 95% confidence interval for the average number of trees is given by

𝑦̅𝑛 ± 𝑡.05/2 (𝑑𝑓 = 6 − 1)√𝑉̂ (𝑦̅𝑛 ) or 7.83±2.57×.0703 or (9.640,6.026)

(c) An estimator of total number of trees is given by estimator


𝑌̂ = N𝑦̅𝑛 = 32 ×7.83 = 250.56
(d) The 95% confidence interval for the total number of trees given by
32 × (9.640, 6.026) = (308.48, 192.832)
Example-3: The following data give the geographical area (in acres) under paddy for 58
villages. Draw an SRSWOR of eight villages,
98 270 79 273 130 158 116 194 41 33 78
56 58 19 64 81 141 58 29 46 93 127
114 88 108 58 47 69 44 56 102 102 187
161 179 76 137 179 76 137 127 104 117 170
210 101 222 223 96 114 318 272 155 292 240
201 261 189
Hence Obtain
(i) Estimate of the average area of the paddy in the sample
(ii) Construct a 95% confidence interval for the average area of the paddy in the
population
(iii) Estimate the total area of the paddy in the population
(iv) Construct a 95% confidence interval for the total area of paddy in the
population.

-29-
Solution: For drawing a sample by SRWOR, the random number table method is used.
Only care is to be taken that if a random number results in the selection of a unit already
selected, the random number is to be rejected and the next random number tried.
Suppose the sample so selected arranged in increasing order of magnitude of
labels is (5, 9, 17, 19, 21, 37, 41, 53) i.e. 44, 56, 78, 81, 93, 137, 161 and 261

Household selected Value of y (𝒚𝒊 − 𝒚̅𝒊 )𝟐


5 44 4882.516
9 56 3349.516
17 78 1287.016
19 81 1080.766
21 93 435.7656
37 137 534.7656
41 161 2220.766
53 261 21645.77
Total 911 35436.88

(i) The average area of the paddy, the sample is given by


1 (44+56+78+81+93+137+161+261)
𝑦̅𝑛 = 8 ∑𝑛𝑖=1 𝑦̅𝑖 = = 113.875
8

(ii) (1-α) 100% confidence interval for the population mean 𝑌̅ is given by the average
area of paddy in the sample is given by

𝑦̅𝑛 ± 𝑡𝛼/2 (𝑑𝑓 = 𝑛 − 1)√𝑉̂ (𝑦̅𝑛 ) where 𝑉̂ (𝑦̅𝑛 ) = 𝑠𝑦2 /𝑛


1
Now we have 𝑠𝑦2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)
𝑖
2
𝑛−1
1
= [(44 − 113.875)2 + (56 − 113.875)2 + (78 − 113.875)2 + ⋯ … +
7

+ (261 − 113.875)2 = 5062.41


5062.41
Thus 𝑉̂ (𝑦̅𝑛 ) = 𝑠𝑦2 /𝑛 = 8 = 632.80

√𝑉̂ (𝑦̅𝑛 ) = 25.16


Therefore 95% confidence interval for the average number of trees is given by

𝑦̅𝑛 ± 𝑡.05/2 (𝑑𝑓 = 6 − 1)√𝑉̂ (𝑦̅𝑛 ) or 113.875 ±2.57×25.16 or (49.21, 178.54)

(iii) An estimator of total number of trees is given by estimator


𝑌̂ = N𝑦̅𝑛 = 58 ×113.875 = 6604.75

-30-
(iv) The 95% confidence interval for the total number of trees given by
58 × (49.21, 178.54) = (2854.18, 10355.32)
Size of sample
To avoid making the sample so small, the estimate is too inaccurate to be useful.
Equally, we want to avoid taking a sample that is too large, in that the estimate is more
accurate than we require. Consequently, the first step is to decide how large an error, we
can tolerate in the estimate. This demands careful thinking about the use to be made of
the estimate and about the consequences of a sizeable error. The figure finally reached
may be to some extent arbitrary, yet after some thought samplers often find themselves
less hesitant about naming a figure than they expected to be.
The next step is to express the allowable error in terms of confidence limits.
Suppose that L is the allowable error in the sample mean and that we are willing to take a
5% chance that the error will exceed L. In other words, we want to be reasonably certain
that the error will not exceed L. Remembering that the 95% confidence limits computed
from a sample mean, assumed approximately normally distributed, are
σ
Y  Z α/2 . when σ is known
n
σ
Here we take E  Z α/2 . as marginal error
n

(Z α/2 ) 2 σ 2
So, sample size n 
E2
The value of σ 2 is known and Zα/2 at 5% = 1.960 and Zα/2 at 1% = 2.85

where if σ is unknown,

(t (n -1), ) 2 s 2 1 n
Sample size n  where s 2   (y1  y)
2

E 2
(n  1) i 1

t (n -1), is the tabulated value of t at (n-1) d.f. and  level of significance

Example-4: An experimental sample was taken in 1938 to estimate the yield per acre of
wheat. For a sample of 222 fields, the variance of the field per acre from field to field was
s2 = 90.3 (quintal)2. How many fields are indicated if we wish to estimate the true mean
yield within +1 quintal, with a 5% risk that the error will exceed one quintal?

-31-
Then

(t (n -1), ) 2 s 2
n
E2
Where
E = Marginal error = 1
t221, .05 = 1.96
and s2 = 90.3
So,

(1.96) 2 x (90.3)
n =
(1) 2

(3.8416 x (90.3)
=
(1)

= 346.89 fields
= 347 fields

-32-
UNIT-III

PROBABILITY PROPORTIONAL TO SIZE WITH REPLACEMENT

Sometimes we have the situation where the units vary considerably in size and
simple random sampling does not take into account the possible importance of the larger
units in the population. Such ancillary information about the size of the unit can be
utilized in selecting the sample so as to get more efficient estimator of the population
parameters. One such method is to assign unequal probabilities for selection to different
units of the population. For example, villages with larger geographical area are likely to
have larger area under food crops and in estimating the production; it would be desirable
to adopt a sampling scheme in which villages are selected with probability proportional
to geographical area. When units vary in their size and the variable under study is directly
related with the size of the unit, the probabilities may be assigned proportional to the size
of the unit. This type of sampling where the probability of selection is proportional to the
size of the unit is known as ‘PPS Sampling’.
Procedure of selecting a PPS with replacement sample
There are two methods of selection which are as follows:
(a) Cumulative Total Method
Let the size of the ith unit be Xi, (i = 1, 2, …. N). We associate the number 1 to X1
with the first unit, the number (X1 + 1) to (X1 + X2) with the second unit and so on such
that the total of the numbers so associated is X = X1 + X2 ….. + XN. Then a random
number ‘r’ is chosen at random from 1 to X and the unit with which this number is
associated is selected.
Example-1: A village has 8 orchards containing 50, 30, 25, 40 26, 44, 20 and 35 trees,
respectively. Select a sample of 3 orchards with replacement and with probability
proportional to number of trees in the orchards.
Solution:
We prepare the following cumulative total table:

-33-
Sr. No. of the Size (Xi) Cumulative size No’s associated
orchard
1 50 50 1-50
2 30 80 51-80
3 25 105 81-105
4 40 145 106-145
5 26 171 146-171
6 44 215 172-215
7 20 235 216-235
8 35 270 236-270

Now, we select three random numbers between 1 to 270. The random numbers
selected are 200, 116 and 047. The units associated with these three numbers are 6 th, 4th
and 1st, respectively. And hence the sample so selected contains units with serial numbers
1, 4 and 6.
(b) Lahiri’s Method
We have noticed that the cumulative total method involves writing down the
successive cumulative total which is time consuming and tedious, specially, in large
population. Lahiri suggested an alternative procedure which avoids the necessity of
writing down the cumulative totals. Lahiri’s method consists in selecting a pair of random
numbers, say (i, j) such that 1 < i < N and 1 < j < M, where M is the maximum of the
sizes of the N units of the population. If j < Xi, the ith unit is selected; otherwise, the pair
of random number is rejected and another pair is chosen. For selecting a sample of n
units, the procedure is to be repeated till n units are selected. This procedure leads to the
required probabilities of selection.
Example-2: A village has 8 orchards containing 50, 30, 25, 40, 26, 44, 20 and 25 trees,
respectively. Select a sample of 3 orchards with replacement from the population by
Lahiri’s method by PPS with replacement.
Solution:
Here N = 8, M = 50 and n = 3

-34-
Sr. No. of the orchard Size Xi
1 50
2 30
3 25
4 40
5 26
6 44
7 20
8 35

We have to select three pairs of random numbers such that the first number is less
than or equal to 8 and the second random number is less than or equal to 50.
Referring to the random number table, three pairs selected are (2, 23), (7, 8) and
(3, 30). As in the third pair j is not less than Xi, a fresh pair has to be selected. The next
pair of random numbers from the same table is (2, 18) and hence, the sample so selected
consists of the units with serial number 2, 7 and 2.
Estimation Procedure
Let a sample of n units be drawn from a population consisting of N units by PPS
with replacement. Further, let (yi, pi) be the value and the probability of selection of the
ith unit of the sample i = 1, 2, ….., n.

1 n yi
An unbiased estimator of population total YN is given by ŶPPS  
n i 1 p i
1  n y i 
2
ˆ
The variance of above estimator is given by VPPS (YPPS )     Y 2 N 
n  i 1 p i 
Xi
where Pi 
X
An estimator of the above variance expression is given by
2
1 n
 yi 
V̂PPS (ŶPPS )     ŶPPS 
n(n - 1) i 1  p i 
The estimator for population mean YN can simply be obtained by dividing the
estimator of population total YN by N, the size of the population. The corresponding
ˆ ) and V̂ (Ŷ )
variance and variance estimator can be obtained by dividing VPPS (YPPS PPS PPS

by N2.

-35-
Example-3: A sample of 20 farms from 100 farmers selected by cumulative total method
with probability proportional to area under wheat crop with replacement was
[(5.2, 28); (5.9, 29); (3.0, 30); (4.2, 22); (4.7, 24); (4.8, 25); (4.9, 28), (6.8, 37); (4.7, 26);
(5.7, 32); (5.2, 25); (5.2, 38); (4.9, 31); (4.0, 16); (1.3, 6); (7.4, 61); (7.4, 61); (4.8, 29);
(6.2, 47; (6.2, 47)]
And the sample selected by Lahiri method was
[(4.8, 22); (4.1, 19); (1.3, 6); (5.2, 25); (6.9, 54); (6.0, 43); (2.0, 4); (6.3, 40); (5.2, 28);
(4.2, 29); (4.8, 22); (5.9, 39); (5.8, 44); (5.1, 30); (4.7, 27); (5.6, 34); (5.2, 31); (5.8, 45);
(4.0, 18); (4.6, 31)]
The figures in brackets are the area under crops (x) in hectares and the yield of
crop (y) in quintals/ha.
The total area under crop was X = 484.5 hectares.
On the basis of the selected samples, estimate the average yield per farm along
with its standard error.
Solution: (1) Cumulative Total Method
(i) The estimate of average yield/farm is given by

ˆ X n yi
Y PPS  
nN i 1 x i
We have N = 100, X = 484.5, n = 20

ˆ 1
Therefore, Y PPS  x 484.50 x 120.5930
20 x 100
= 29.2136

(ii) ˆ
The estimate of VPPS Y  
PPS is given by

1  2 n  yi 
2
ˆ ) 
V̂PPS (Y X     nŶ 2 PPS 
n(n - 1)  i 1  x i
PPS
  

=
1
20 x 19

(484.5) 2 x728.1481  20(29.2136) 2 
= 0.06
ˆ
Standard error of Y PPS  0.06  0.2449

-36-
(2) Lahiri’s Method
(i) The estimate of average yield/farm is

ˆ 1 484.5
Y PPS  x x 115.7091
100 20
= 28.0307

(ii) ˆ
The estimate of V̂PPS Y 
PPS will be

 
ˆ
V̂PPS Y PPS 
1
20 x 19
[23.4740 x 708.6056  20(28.0307) 2 ]

= 2.43
ˆ
Standard error of Y PPS  2.43  1.5588

-37-
UNIT-IV

STRATIFIED SAMPLING

Stratified Random Sampling


The simplest method of selection of sample is simple random sampling (SRS) in
which every sample gets an equal chance of selection. In SRS, units are selected with
equal probability at every draw. It is well known that the precision of a sample estimate
of the population mean depends not only upon the size of the sample and the sampling
fraction but also on the population variability. Selection of a simple random sample from
the entire population may be desirable when we do not have any knowledge about the
nature of population, such as, population variability etc. However, if it is known that the
population has got differential behaviour regarding variability, in different pockets, this
information can be used in providing a control in the selection. The approach through
which such a controlled selection can be exercised is called stratified sampling.
In stratified sampling, the population consisting of N units is first divided into K
sub-populations of N1, N2,…, NK units, respectively. These sub-populations are non-
overlapping and together they comprise the whole of the population i.e. ∑𝐾
𝑖=1 𝑁𝑖 = N.

These sub-populations are called strata.


The population of N units is first subdivided into K homogeneous subgroups
called strata, such that the hth stratum consists of Nh units, where h = 1, 2, ..., K and
∑𝐾 th th
ℎ=1 𝑁ℎ = 𝑁 . Let 𝑌ℎ𝑖 be the i population value of the study variable in h stratum, i = 1,

2,…….. Nh such that the hth stratum population mean is given by


1 𝑁ℎ
𝑌̅ℎ = 𝑁 ∑𝑖=1 𝑌ℎ𝑖 for h = 1,2,3,….., K.

Obviously, using the concept of weighted average the true population mean, the
whole population can be written as:
̅ ̅ ̅
𝑁 𝑌 +𝑁2 𝑌2 +𝑁3 𝑌3 +⋯…..+𝑁𝐾 𝑌𝐾̅ ̅ ̅ ̅ ̅
𝑁1 𝑌1 +𝑁2 𝑌2 +𝑁3 𝑌3 +⋯…..+𝑁𝐾 𝑌𝐾
𝑌̅ = 1 1 𝑁 +𝑁 =
1 +𝑁 +⋯….+𝑁
2 3 𝑘 𝑁

𝑁 𝑁 𝑁 𝑁
= ( 𝑁1 ) 𝑌̅1 + ( 𝑁2 ) 𝑌̅2 + ( 𝑁3 ) 𝑌̅3 + ⋯ … . + ( 𝑁𝐾) 𝑌̅𝐾

= 𝑊1 𝑌̅1 + 𝑊2 𝑌̅2 + 𝑊3 𝑌̅3 + ⋯ … . +𝑊𝐾 𝑌̅𝐾


= ∑𝐾 ̅
ℎ=1 𝑊ℎ 𝑌ℎ

-38-
Consider a sample of size nh is drawn using SRSWOR sampling from the
population stratum consisting of Nh units such that ∑𝐾
ℎ=1 𝑛ℎ = n, the required sample size.

Assume the value of the ith unit of the study variable selected from the hth stratum is
denoted by yhi where i =1, 2, ...,nh and Wh =Nh/N is the known proportion of population
units falling in the hth stratum.
In this sampling scheme we have the following results.
Theorem 1: An unbiased estimator of the population mean is given by
𝑦̅𝑠𝑡 = ∑𝐾
ℎ=1 𝑊ℎ 𝑦
̅ℎ
𝑛ℎ 1
where 𝑦̅ℎ = 𝑛 ∑𝑖=1 𝑦ℎ𝑖 denote the hth stratum mean

Theorem 2: Under SRSWOR sampling, the variance of the estimator 𝑦̅𝑠𝑡 is given by
1−𝑓ℎ
V(𝑦̅𝑠𝑡 ) = ∑𝐾 2
ℎ=1 𝑊ℎ ( ) 𝑆ℎ𝑦 2
𝑛ℎ

1
where 𝑆ℎ𝑦 2 = ∑𝑁 ℎ ̅ 2 th
𝑖=1(𝑌ℎ𝑖 − 𝑌ℎ ) denotes the h stratum population variance,
𝑁ℎ −1

1 𝑁ℎ 𝑛ℎ
𝑌̅ℎ = 𝑁 ∑𝑖=1 𝑌ℎ𝑖 denotes the hth stratum population mean and 𝑓ℎ =
ℎ 𝑁ℎ

Theorem 3: Under SRSWOR sampling, an unbiased estimator of V( 𝑦̅𝑠𝑡 ) is given by


2 1−𝑓ℎ
𝑉̂ (𝑦̅𝑠𝑡 ) = ∑𝐾
ℎ=1 𝑊ℎ ( 𝑛 ) 𝑠ℎ𝑦
2

1
where 𝑠ℎ𝑦 2 = ∑𝑛𝑖=1

(𝑦ℎ𝑖 − 𝑦̅ℎ )2 denotes the hth stratum sample variance,
𝑛ℎ −1

1 ℎ 𝑛 𝑛ℎ
𝑦̅ℎ = 𝑛 ∑𝑖=1 𝑦ℎ𝑖 denotes the hth stratum sample mean and 𝑓ℎ = 𝑁ℎ

Note: The (1−α) 100% confidence interval estimate is stratified random sampling will be

given by 𝑦̅𝑠𝑡 ± 𝑡𝛼 (𝑑𝑓 = 𝑛 − 𝐾)√𝑉̂ (𝑦̅𝑠𝑡


2

Generally the stratification is done according to administrative groupings,


geographical regions and on the basis of auxiliary characters correlated with the character
under study
In stratified sampling, having decided the strata and the sample size, the next question
which a survey statistician has to face is regarding the method of selection within each
stratum and the allocation of sample to different strata. The allocation of the sample to
different strata made according to various methods.

-39-
Allocation of Sample Size
The allocation of sample size to different strata is affected by three factors, viz.
I. The total number of elements in each stratum,
II. The variability of observations within each stratum, and
III. The cost of obtaining an observation from each stratum.
A good allocation is one where maximum precision is obtained with minimum
cost, or in other words, the criterion for allocation is to minimize the cost for a given
variance or minimize the variance for a given cost.
The cost function is stratified sampling may be taken as:
L
C  Co   n h c h
h 1

where Co is the overhead cost, which is constant for certain broad ranges of the total
sample size, ch is the average cost of surveying one unit in the hth stratum which may
depend on the nature and size of the units in the stratum and L is the total number of
strata.
There are four methods of allocation of sample sizes to different strata which are
given below:
(i) Equal samples from each stratum: In this method the total sample size n is
divided equally among all the strata i.e. for the hth stratum
n
nh 
L
Under this location the variance of the estimator 𝑦̅𝑠𝑡 is given by
1−𝑓ℎ 𝑁 −𝑛
V(𝑦̅𝑠𝑡 )E = ∑𝐾 2
ℎ=1 𝑊ℎ ( ) 𝑆ℎ𝑦 2 = ∑𝐾 2 ℎ ℎ
ℎ=1 𝑊ℎ ( 𝑁 𝑛 ) 𝑆ℎ𝑦
2
𝑛ℎ ℎ ℎ

1 2
= 𝑛𝑁2 ∑𝐾
ℎ=1 𝑁ℎ (𝐿𝑁ℎ − 𝑛) 𝑆ℎ𝑦

1
where 𝑆ℎ𝑦 2 = ∑𝑁 ℎ ̅ 2 th
𝑖=1(𝑌ℎ𝑖 − 𝑌ℎ ) denotes the h stratum population variance,
𝑁ℎ −1

1 𝑁ℎ 𝑛ℎ
𝑌̅ℎ = 𝑁 ∑𝑖=1 𝑌ℎ𝑖 denotes the hth stratum population mean and 𝑓ℎ =
ℎ 𝑁ℎ

An unbiased estimator V(𝑦̅𝑠𝑡 )E of estimator V̂(y st ) is given by


1−𝑓ℎ 𝑁 −𝑛
V(𝑦̅𝑠𝑡 )E = ∑𝐾 2
ℎ=1 𝑊ℎ ( ) 𝑆ℎ𝑦 2 = ∑𝐾 2 ℎ ℎ
ℎ=1 𝑊ℎ ( 𝑁 𝑛 ) 𝑆ℎ𝑦
2
𝑛ℎ ℎ ℎ

-40-
1
= 𝑛𝑁2 ∑𝐾
ℎ=1 𝑁ℎ (𝐿𝑁ℎ − 𝑛) 𝑠ℎ𝑦
2

(ii) Proportional allocation: This system of allocation is very common because of its
simplicity. In this, the items are selected from each stratum in the same proportion
as they exit in the population. The allocation of the sample sizes is termed as
proportional if the sample fraction i.e. the ratio of the sample size to the
population size, remains the same in all the strata, mathematically
𝑛1 𝑛2 𝑛 𝑛
= = 𝑁3 = ⋯ … … = 𝑁𝑘
𝑁1 𝑁2 3 𝑘
By the property of ratio and proportions, each of these ratios is equal to the ratio
of the sum of numerator to the sum of denominators, i.e.
𝑛1 𝑛2 𝑛3 𝑛𝑘 𝑛1 + 𝑛2 + 𝑛3 … … . +𝑛𝑘
= = = ⋯…… = =
𝑁1 𝑁2 𝑁3 𝑁𝑘 𝑁
Since the total sample size n, and the population size N are fixed, hence
𝑛 𝑛 𝑛
𝑛1 = 𝑁1 ( ) , 𝑛2 = 𝑁2 ( ) , … … … . . 𝑛ℎ = 𝑁ℎ ( )
𝑁 𝑁 𝑁
𝑛 𝑁
𝑛ℎ = 𝑁ℎ (𝑁) = 𝑛 ( 𝑁ℎ )= n𝑊ℎ

̅𝒔𝒕 is given by
Under this location the variance of the estimator 𝒚
𝑁ℎ −n𝑊ℎ
V(𝑦̅𝑠𝑡 )p = ∑𝐾
ℎ=1 𝑁ℎ ( ) 𝑆ℎ𝑦 2
n𝑊ℎ

1 2 𝑛 (1−𝑓) 2
= 𝑛𝑁 ∑𝐾
ℎ=1 𝑁ℎ (1 − 𝑁 ) 𝑆ℎ𝑦 = ∑𝐾
ℎ=1 𝑊ℎ 𝑆ℎ𝑦
𝑛
𝑛
If 𝑓 = is negligible
𝑁
1 2
V(𝑦̅𝑠𝑡 )p = ∑𝐾
ℎ=1 𝑊ℎ 𝑆ℎ𝑦
𝑛
1
where 𝑆ℎ𝑦 2 = ∑𝑁 ℎ
(𝑌 − 𝑌̅ℎ )2 denotes the hth stratum population variance,
𝑁ℎ −1 𝑖=1 ℎ𝑖

1 𝑁ℎ 𝑛ℎ
𝑌̅ℎ = 𝑁 ∑𝑖=1 𝑌ℎ𝑖 denotes the hth stratum population mean and 𝑓ℎ =
ℎ 𝑁ℎ

Example-1: At a private college the students may be classified according to the


following scheme:
Classification Number of students
Senior 150
Junior 163
Sophomore 195
Freshman 220

-41-
If we use proportional allocation to select a stratified random sample of size n=40,
how large a sample must be taken from each stratum?
150 163
𝑛1 = 728 40 = 8 𝑛2 = 728 40 = 9
195 220
𝑛3 = 728 40 = 11 𝑛4 = 728 40 = 12

Stratification is efficient when the units within the stratum are homogeneous and
between strata are heterogeneous.
(iii) Optimum allocation: By optimum allocation, we mean the best utilization of
resources. In this method the allocation of sample size to various strata is made in
such a way so as to minimize (i) the sampling variance for a given cost, or (ii) the
cost of the survey for a specified value of sampling variance.
This method is based on the cost aspect of the survey. Let Ch be the cost of
observing the variable y in the hth stratum and let Ct be the total fixed cost of the survey,
then
C t = C 0 + ∑𝐾
ℎ=1 𝑛ℎ 𝐶ℎ

where C0 stand for the overhead cost since we know that the variance of the estimator 𝑦̅𝑠𝑡
is given by
1−𝑓ℎ
V(𝑦̅𝑠𝑡 ) = ∑𝐾 2
ℎ=1 𝑊ℎ ( ) 𝑆ℎ𝑦 2
𝑛ℎ

Here we shall discuss two cases


Case 1: Total cost is fixed: If the total cost is fixed then the V(𝑦̅𝑠𝑡 ) is minimum if
𝑊ℎ 𝑆ℎ𝑦
𝑛
√𝐶ℎ
𝑛ℎ = ⁄ 𝐾 𝑊ℎ 𝑆ℎ𝑦
∑ℎ=1
√𝐶ℎ
If we put the above value of 𝑛ℎ in V(𝑦̅𝑠𝑡 ) the variance of estimator 𝑦̅𝑠𝑡 under
optimum allocation is given as
1 𝑊ℎ 𝑆ℎ𝑦
V(𝑦̅𝑠𝑡 )𝑜𝑝𝑡 = 𝑛 (∑𝐾 𝐾
ℎ=1 𝑊ℎ 𝑆ℎ𝑦 √𝐶ℎ )(∑ℎ=1 ⁄
√𝐶ℎ
f is considered negligible.
Example-2: From the following data, draw a sample of 100 farms using optimum
allocation in stratified sample.

-42-
Categories No. of farms (Nh) Stratum variance Cost per unit (Ch)
(shy2)
Small 900 4 25
Medium 1500 25 64
Large 600 9 100

Solution: Given n = 100,


Categories No. of Nh Stratum Shy Cost Wh S h Sample
farms (Nh) Wh  Ch size
variance per unit
N (shy2) (Ch) Ch
Small 900 900 4 2 25 5 0.12 25
 0.3
3000
Medium 1500 1500 25 5 64 8 0.31 63
 0.5
3000
Large 600 600 9 3 100 10 0.06 12
 0.2
N  3000 3000
0.49 100
Sample size of strata under optimum allocation
nWn S hy K Wn S hy
nh  
Ch n 1 Ch
K Wn S hy
  0.12  0.31  0.06  0.49
i 1 Ch

K Wn S hy
nh for small categories = nWn S hy 
n 1 Ch
10 x 0.12
=  25
0.49
10 x 0.31
nh for medium categories =  63
0.49
10 x 0.06
nh for large categories =  12
0.49
(iv) Neyman allocation (J. Neyman, 1937): The allocation of sample among
different strata based on the joint consideration of the standard deviation of the
stratum and the stratum size is known as Neyman allocation. This method of
allocation is considered to be more efficient than proportional allocation
particularly when the stratum standard deviations vary considerably among

-43-
different strata. Neyman allocation is used when sample size is fixed and
sampling cost among strata is equal. Sample sizes are determined by:
𝑛𝑊ℎ 𝑆ℎ𝑦
𝑛ℎ = ⁄∑𝐾 𝑊 𝑆
ℎ=1 ℎ ℎ𝑦
If we put the above value of 𝑛ℎ in V(𝑦̅𝑠𝑡 ) the variance of estimator 𝑦̅𝑠𝑡 under
Neyman allocation is given as
1
V(𝑦̅𝑠𝑡 )𝑁𝑒𝑦 = ∑𝐾
ℎ=1(𝑊ℎ 𝑆ℎ𝑦 )
2
𝑛

Here f is considered negligible


We have the following result in stratified sampling. If we compare the variance of
estimators of population mean under simple random sampling, proportion allocation and
Neyman allocation respectively then we have an interesting result
̅) ≥ 𝐕(𝑦̅𝑠𝑡 )𝑝 ≥ 𝐕(𝑦̅𝑠𝑡 )𝑁𝑒𝑦
V(𝒚
Example-3: An assignment as given to four students attending a sample survey course.
The problem was to estimate the average time per week denoted to study in Punjab
Agricultural University Library by the students of this University. The University is
running undergraduate, master’s degree and doctoral programmes. Number of students
registered for the three programmes is 1300, 450 and 250, respectively. Since the value of
the study variable is likely to differ considerably with the programme, the investigator
divided the population of students into three strata: undergraduate programme (stratum I),
master’s programme (stratum II) and doctoral programme (stratum III). First of four
students selected WOR simple random samples of sizes 20, 10 and 12 students from
strata I, II and III so that the total sample is of size 42. The information about weekly
time devoted in library is given below:
Time devoted to study in the university library during a week
Stratum I Stratum II Stratum III
0 1 9 12 6 10 24
4 4 4 9 10 14 15
3 3 6 11 9 20 14
5 6 1 13 11 11 18
2 8 2 8 7 16 19
0 10 3 13 20
3 2

-44-
Estimate the average time per week devoted to study by a student in PAU library.
Also, build up the confidence interval for this average.
Solution: Calculated values of strata weights, sample means and sample mean squares

Stratum I Stratum II Stratum III


n1 = 20 n2 = 10 n3 = 12
N1 = 1300 N2 = 450 N3 = 250
W1 = 0.650 W2 = 0.225 W3 = 0.125
y1 = 3.800 y2 = 9.600 y3 = 16.167
s 12 = 7.958 s 22 = 4.933 s 32 = 17.049

The stratum weight Wh, sample mean y h and sample mean square s 2h have already

these values. For calculation of y h and s 2h , one is to proceed in the same way as for y and

s 2 . Now the estimate of average time per week devoted to study by a student in the
university library, is
1
y st  N1 y1  N 2 y 2  N 3 y 3 
N
1
 1300 (3.800)  450(9.600)  250(16.167 )
2000
= 6.651
Also the estimate of variance is computed as:

W12 (N1  n 1 )S12 W22 (N 2  n 2 )S 22 W32 (N 3  n 3 )S32


Vy st    
N1n 1 N2n 2 N3n 3

(0.650) 2 (1300 - 20) (7.958) (0.225) 2 (450 - 10) (4.933) (0.125) 2 (250 - 12) (17.049)
  
(1300) (20) (450) (10) (2500) (12)

= 0.16553 + 0.2442 + 0.02113


= 0.21108
Also we obtained the limits of confidence interval as:

y st  2 V (y st )

= 6.651  2 0.21108
= 5.732, 7.570

-45-
Thus, the average time per week devoted to study by a student in PAU Library,
falls in the closed interval [5.732, 7.570] hours, with probability approximately equal to
0.95.
Example-4: The following data shows daily temperature in Chandigarh and New Delhi
in oF as follows:
S. No. City Temperature
1 New Delhi 48
2 Chandigarh 54
3 New Delhi 52
4 New Delhi 47
5 Chandigarh 47
6 New Delhi 54
7 New Delhi 49
8 Chandigarh 59
9 New Delhi 53
10 New Delhi 50
11 New Delhi 50
12 New Delhi 57
13 Chandigarh 55
14 New Delhi 54
15 Chandigarh 68
16 New Delhi 49
17 New Delhi 51
18 Chandigarh 61
19 New Delhi 55
20 New Delhi 53
21 Chandigarh 50

(a) Select a SRSWOR sample of 4 units. Find the variance of estimator of population
mean.
(b) Stratified the population on the basis of location or cities, and then select two
units from each city. Find the variance of the estimator of mean in stratified
sampling.
(c) Find the relative efficiency of the stratified sampling over the simple random
sampling.

-46-
(d) Select an SRSWOR sample of four units from the above population and construct
95% confidence interval estimate of the population mean.
(e) Select two units using SRSWOR sampling from Chandigarh and two units from
New Delhi using stratified random sampling and construct 95% confidence
interval estimate of the population mean.
Solution:

S. No. City Y1 Y2
1 New Delhi 48 2304
2 Chandigarh 54 2916
3 New Delhi 52 2704
4 New Delhi 47 2209
5 Chandigarh 57 3249
6 New Delhi 54 2916
7 New Delhi 49 2401
8 Chandigarh 59 3481
9 New Delhi 53 2809
10 New Delhi 50 2500
11 New Delhi 52 2704
12 New Delhi 57 3249
13 Chandigarh 55 3025
14 New Delhi 54 2916
15 Chandigarh 68 4624
16 New Delhi 49 2401
17 New Delhi 51 2601
18 Chandigarh 61 3721
19 New Delhi 55 3025
20 New Delhi 53 2809
21 Chandigarh 50 2500
Total 1128 61064

 N  
2

N   Y1  
1   i 1  
and S 2y  2
 Y1 
N  1 i 1 N 
 
 

1  1128 2   23.71
= 61064  
21  1  21 

-47-
Thus the variance of the sample mean estimator under SRSWOR is

Vsrswor (y) 
1 - f  S2  (1  4 21) x 23.71
y
n 4
= 4.715
(b) Stratified Sampling: Stratify the population into two strata based on locations as
follows:

Stratum I Stratum II

Chandigarh New Delhi

Y1i Y1i2 Y2i Y2i2

54 2916 48 2304

57 3249 52 2704

59 3481 47 2209

55 3025 54 2916

68 4624 49 2401

61 3721 53 2809

50 2800 50 2500

Total: 404 23516 52 2704

57 3249

54 2916

49 2401

51 2601

55 3025

53 2809

Total: 724 37548

From stratum I, we have

N1 7 n1 2
N1  7 W1    0.333 f1  
N 21 N1 7

-48-
N1 N1

 Y1i  404
i 1
Y
i 1
2
1i  23516

  N1  
2

   Y1i  
1  N i 2  i 1  
S1y 
2
 
N 1  1 i 1
Y1i 
N 
 
 
 

1  404 2   33.24
= 23516  
7  1  7 

From stratum II, we have

N2 14 n2 2
N 2  14 W2    0.666 f2  
N 21 N 2 14
N2 N2

Y
i 1
2i  724 Y
i 1
2
2i  37548

  N2  
2

   Y2i  
1  N2 2  i 1  
S 22y   
N 2  1 i 1
Y21 
N2 
 
 
 
1  724  
2
= 37548    8.22
14  1  14 
Thus the variance of the estimator of the population mean in the stratified random
sampling is given by:
1 1 - f n 
V (y st )   Wh2 S 2hy
h 1 nh
2
1 - f n  1 - f1  1 - f 2 
= Wh 1
2
h
nh
S 2hy  W12
n1
2
S1y  W22
n2
S 22y

2 2  2 2 
 7  1  7   14   1  14 
=   x 33.24     x 8.22
 21   2   21   2 
   
= 1.32 + 1.57 = 2.89
(c) Relative efficiency: The per cent relative efficiency of the stratified random
sampling over SRSWOR is given by:

-49-
Vsrswor (y) 4.715
RE  x 100  x 100
V(y st ) 2.89
= 163.45
(d) 95% CI estimate using SRSWOR sampling: Use the pseudo random number
table to select 4-distinct random numbers between 01 and N = 21 as follows:

Random No. City selected Temperature (oF) yi Y12


01 New Delhi 48 2304
04 New Delhi 47 2209
05 Chandigarh 57 3249
03 New Delhi 52 2704
Total: 204 10466
Thus the sample mean estimate of the population mean based on SRSWOR is:
1 n 204
y  yi   51
n i 1 4

The sample variance is given by:

2042
1  n 2  y1 
2
 10466 
s 2y   y 1   4
n  1  i 1 n  4 1

= 20.66
An estimator of the variance of the sample mean is given by:

 4 
ˆ  1 - f  2  1 - 21 
V srswor (y)   s y   4  x 20.66
 n 
 

= 4.182
A (1 -  ) % confidence interval estimate of the population mean is given by:

y  t α/2 d.f.  n  1 V̂srswor (y)

Thus a 95% confidence interval estimate of the population mean is given by:

51  t .025 (d.f.  4 - 1) 4.182

51 3.182 4.182

[44.49, 57.51]

-50-
This implies that population mean y  53.71 lies in the 95% confidence interval.

This interval provides that we are 95% sure that true population mean lies
between 44.49oF to 57.51oF.
(e) 95% confidence estimate using stratified random sampling: Using lottery
method, select two units from stratum I and two units from stratum II, we have:

Stratum I Stratum II
Chandigarh New Delhi
Y1i Y1i2 Y2i Y2i2
57 3249 52 2704
55 3025 57 3249
Total: 112 6274 Total: 109 5953

From sample stratum I:


n1
1 n1 112
 y1i  112
i 1
y1  
n 1 i 1
y1i 
2
 56

n1

y
i 1
2
1i  6274 and

  n1  
2

   y1i  
1  n i 2  i 1    1 6274  (112) 
2
2
s1y   1i
n 1  1  i 1
y   27  

n1
  2 
 
 
= 2.0
From sample stratum II:
n2 n2
1 109
 y 2i  109
i 1
y2 
n2
y
i 1
2i 
2
 54.5

n2

y
i 1
2
2i  5953 and

1  n 2 2  y 2i  
2
1  (109) 2 
s 2
  y 2i   5952  
n 2  1  i 1
2y
n 2  2 -1  2 
 
= 12.5

-51-
Thus an estimator of the variance of the estimator of the population mean in
stratified random sampling is:
L
1 - f h 
V̂ (y st )   Wh2 s 2hy
h 1 nh
2
1 - f h  1 - f1  1 - f 2 
= W
h 1
2
h
nh
s 2hy  W12
n1
2
s1y  W22
n2
s 2y

2 2  2 2 
 7  1  7   14   1  14 
=   x 2.0     x 12.5
 21   2   21   2 
   

= 0.079 + 2.390 = 2.469


Also the stratified estimate of the population mean is given by:
L 2
y st   Wn y n   Wh y h  W1 y1  W2 y 2
h 1 h 1

7  14 
=  x 56    x 54.5  55
 21   21 

Now a (1 -  ) 100% confidence interval estimate of the population mean Y using


stratified random sampling is given by:

y st  t α/2 d.f.  n  1 V̂(y st ) or

y st  t .025 d.f.  4 - 2 V̂(y st )

55  4.303 2.469

55  4.303 x 1.5685

[48.251, 61.749]

This implied that true population mean Y = 53.71 lies in 95% confidence interval.
The interpretation of this confidence interval estimate is that we are 95% sure that
the true population mean lies between 48.251oF to 61.749oF.

-52-
UNIT-V

CLUSTER SAMPLING

In survey sampling the basic assumption is that the population consists of a finite
number of distinct and identifiable units. A group of such units is called a cluster. If,
instead of randomly selecting a unit for sample, a group of units is selected as a single
unit in the sample, it is called cluster sampling. If the entire area containing the
population under study is divided into smaller segments, and if each unit of the
population belongs to only one segment, the procedure is called area sampling or non-
overlapping cluster sampling. If one or a few units appear in more than one segment or
cluster, then such a procedure is called overlapping cluster sampling. The main purpose
of cluster sampling is to divide the population into small groups with each group serving
as a sample unit. Clusters are generally made up of neighbouring elements; therefore the
elements within a cluster tend to be homogeneous. However, at some stage in the
research we become interested in heterogeneous clusters rather than homogeneous. More
broadly, the concept of forming strata was to form homogeneous groups, whereas the
concept of forming clusters will be to form groups of a heterogeneous nature. After
dividing the population into clusters the sample of clusters can be selected with either
equal or unequal probability. The concept of unequal probability may be based on the
size of the cluster, that is, the larger the cluster, the larger the probability of its being
selected in the sample. All the units in the selected cluster will be enumerated. As a
simple rule the number of units in a cluster should be small and the number of clusters
should be large. The main advantage of cluster sampling is that it is cheaper, since the
collection of data for neighbouring units is easier and faster. It is also useful when the
frame for selecting the sample is not available at the unit level. For example, a list of
persons may not be available, whereas a list at a dwelling level may be available. For a
given sample size cluster sampling is less efficient than simple random sampling.
However, in most situations the loss in efficiency can be balanced by the reduction in
cost. Any sampling procedure, for example simple random sampling, stratified sampling,
or systematic sampling , may be applied to cluster sampling by using the clusters

-53-
themselves as sampling units. The efficiency of cluster sampling decreases with the
increase in the size of the cluster
The well known examples of the clusters and their units are given below:

Cluster School Dwelling Employer Bus Day


Students Persons Employee Passenger Hours
1
Theorem 1: The sample mean per unit 𝑦̿𝐶 = ∑𝑛𝑖=1 ∑𝑀
𝑗=1 𝑦𝑖𝑗 is unbiased estimate of
𝑛𝑀

population mean 𝑌̿.


1 1
Theorem 2: The variance of the estimator y c is given by 𝑉(𝑦̿𝐶 ) = (𝑛 − 𝑁) 𝑆𝑏2

Corollary: An unbiased estimate of variance of the estimator y c V̂(y c ) is given by


1 1
V̂(y c ) = (𝑛 − 𝑁) 𝑠𝑏2
1
where 𝑠𝑏2 = ∑𝑛𝑖=1(𝑦̅𝑖 − 𝑦̿)2
𝑛−1

Note: The (1−α) 100% confidence interval estimate of the population mean 𝑌̿ will be

given by y c ± 𝑡𝛼 (𝑑𝑓 = 𝑛(𝑀 − 1) V̂(y c )


2

Theorem 3: The relative efficiency of Cluster sampling with respect to SRS is given by
2
RE = 𝑆 ⁄ 2
𝑀𝑆𝑏
Theorem 4: The relative efficiency in terms of intraclass correlation coefficient ρ is
given by RE = [1 + (𝑀 − 1)𝜌]−1
Thus cluster sampling is more efficient than SRS if ρ < 0 and M > 1
Important Note:
 Cluster sampling is more efficient than SRS if intraclass correlation coefficient
ρ < 0 and M > 1. If ρ =0, then the cluster sampling and SRS are equally efficient.
 In practice units which are near one another are more similar than units which are
apart, therefore p is positive and hence in general the efficiency of cluster
sampling is less than that of SRS.
Example-1: In a developing country, a certain company has 25 centres located at
different places in a state. Each centre has been provided with four telephones. A student
attending a sample survey course was given an assignment to estimate the average

-54-
number of calls per telephone made on a typical day for this company. The student did
not have the telephone facility, and was also short of funds. Because of this, he selected
five centres using SRS without replacement. The numbers of calls made on a typical
working day from each telephone of the sample centres were recorded personally. The
data so obtained are summarized in table.

Number of calls made from selected centres


Centre Mi Calls made yi yi
1 4 26 34 27 25 112 28
2 4 44 33 28 31 136 34
3 4 18 33 25 28 104 26
4 4 37 21 22 40 120 30
5 4 23 34 42 29 128 32

Estimate the average number of daily calls per telephone made from all the 25
centres, by using estimator. Also, estimate the relative efficiency of the estimator used
with respect to the usual simple mean estimator, from the sample selected above.
Solution:
Here N = 25, n = 5 and Mi = M = 4. The sample cluster means are given in the
last column of table (given above). The estimate of average number of daily calls is
computed using estimator ycl .

Thus for Mi = M,

1 n
yc   yi
n i 1

1
 yc  (28  34  26  30  32)
5
= 30
For Mi = M, the variance estimator V( y c ) becomes

Nn  n 2 
V(y c )    y i  ny c2 
Nn (n  1)  i 1 
On making substitutions, one gets:

-55-
V(y c ) 
25 - 5
(25) (5) (4)

(28) 2  (34) 2      (32) 2  5(30) 2 
= 1.6
 The variance estimator of the simple mean estimator y from the selected cluster
sample, will be

Nn  1 n M 2 
V(y c )   
(NM  1)n  nM i 1 j1
y ij  V(y c )  y c2 

We first compute the term involving sum of squares of all the individual
observations.
Thus,
n M

 y
i 1 j1
2
ij  (26) 2  (34) 2      (29) 2

If an equivalent sample of nM elements were selected from the population NM


elements by SRS, the variance of the mean per element would be:

25 - 5  18692 
V(y nm )    1.6  (30) 2 
[(25) (4) - 1] (5)  (5) (4) 
= 2.0081
 The estimate of per cent relative efficiency will be:
2.0081
RE  (100)
1.6
= 1.2550625 (100)
= 125.50625
= 125.5
Example-2: Using the map of Haryana, construct on cluster from the 20 districts of
Haryana. Each cluster contains equal number of districts.
Solution: There are several possible ways to construct on cluster based on their locations.
We prepare the cluster on the basis of distance travelled:

-56-
Cluster Number Name of the States
1. Ambala Division Kaithal, Ambala, Panchkula, Kurukshetra, Yamunanagar
2. Gurgaon Division Faridabad, Mahendergarh, Gurgaon, Rewari
3. Hisar Division Bhiwani, Fatehabad, Jind, Hisar, Sirsa
4. Karnal Division Karnal, Jhajjar, Rohtak, Panipat and Sonepat

Example-3: Select two clusters from above Example-2 using SRSWOR sampling.
Record the values of the sex ratio of the selected districts. Estimate the average sex ratio
in Haryana using cluster sampling. Estimate the variance of the estimators used for
estimating the average sex ratio. Also construct 95% confidence interval.
Solution:
Use lottery method. We select two clusters number 01 and 03. Thus the following
clusters are included in SRSWOR sample of 5 units.

Name of the districts Value of the sex ratio yij yi


01 Kaithal, Ambala, Panchkula, 881, 885, 873, 888, 877 4404
Kurukshetra, Yamuna Nagar
03 Bhiwani, Fatehabad, Jind, Hisar, 902, 886, 871, 872, 897 4428
Sirsa
Total: 8832
Here m = 5 and n = 2
Thus an estimate of the average sex ratio in Haryana is given by:

1 n m 1
yc  
nm t 1 j1
y ij 
2x5
x 8832

= 883.2
Now

Cluster Number yi y i.  y2


01 880.8 5.76
03 885.6 5.76
Sum: 11.52

 y  y
n
2
i.
1152
Thus s 2b  t 1
  11.52
n 1 2 1
Thus an estimator of V(y c ) , we have

-57-
1 1  1 1
V̂(y c )    s 2b     x 11.52
n N 2 4

4-2
=  x 11.52  5.76
 4 

(1-α) 100% confidence interval of the population mean Y is given by:

y c  t α/2 (d.f.  n(m  1) V̂(y)

Thus 95% confidence interval for the average sex ratio in Haryana is given by:

y c  t 0.25  d.f.  2 x 4 V̂(y) 


 

y c  t 0.25  d.f.  8 V̂(y)   883.2  2.306 5.76


 
= (888.73, 877.66)

-58-
UNIT-VI

MULTI-STAGE SAMPLING

In this type of sampling, we first select the clusters and then selecting only some
elements of the selected cluster is known as sub sampling or two stage sampling. Cluster
which form the units of the sampling at the first stage are called the first stage units (fsu)
or primary sampling units (psu) and the elements within cluster are called second stage
units (ssu). This procedure can be generalized to three or more stages and it termed as
multi-stage sampling.
For example, in crop surveys for estimating yield of a crop in a district, a block
may be considered a primary sampling unit, the villages the second stage unit, the crop
fields the third stage units and plot of fixed sizes the ultimate unit of sampling.
Multi-stage sampling has been found to be very useful in practice and this
procedure is being commonly used in large scale surveys.
Advantage
The main advantage of this sampling procedure is that, at the first stage, the frame
of psu’s is required, which can be prepared easily. At the second stage, the frame of ssu’s
is required only for the selected psu’s and so on.
Two stage sampling (Equal first stage unit)
We consider the case of equal cluster and assume that the population is composed
of NM elements grouped into N first stage units of M second stage units each. Let n
denote the number of first stage units in the sample and m the number of second stage
unit. Further, we suppose that the units at each stage are selected with equal probability.
Now let,
Yij = The value of the jth second stage unit in the ith first stage units
j = 1, 2, …… N, i = 1, 2, …. M)

Yi = The mean per second stage units in the first stage unit in the population

(j = 1, 2, …. N)

1 M
Yi =  Yij
M j 1

-59-
1 N M 1 M
Y = 
NM i 1 j 1
Yij   Yi
N i 1
= Population mean

and let
yij = the value of the jth ssu in ith psu of sample (j = 1, 2, …..m, i = 1, 2, ….n)

1 m
yi =  y ij = Mean per second stage unit of the ith first stage unit in the
m i 1
sample
and

1 n m 1 n
y ts =  ij n 
nm i 1 j1
y 
i 1
y i = Mean per second stage unit in the sample

1 n
Theorem-1: An unbiased estimator of population mean is given by y ts   yi
n i 1

Theorem-II: Estimated variance of y ts is given by:

1 1  1 1 1 
V̂(y ts )    s 2b    s 2w
n N nm M
n
1
where s 2b  
(n  1) i 1
(yi  y) 2

n m
1
and s 2w  
n(m  1) i 1 j1
(y ij - y i ) 2

Example-1: At an experimental station, there were 100 fields sown with wheat. Each
field was divided into 16 plots of equal size (1/16th hectare). Out of 100 fields, 10 were
selected by simple random sampling WOR. From each selected field 4 plots were chosen
by simple random sampling WOR. The yields in kg/plot are given below:
Selected fields Plots
1 2 3 4
1 4.32 4.84 3.96 4.04
2 4.16 4.36 3.50 5.00
3 3.06 4.24 4.76 3.12
4 4.00 4.84 4.32 3.72
5 4.12 4.68 3.46 4.02
6 4.08 3.96 3.42 3.08
7 5.16 4.24 4.96 3.84
8 4.40 4.72 4.04 3.98
9 4.20 4.66 3.64 5.00
10 4.28 4.36 3.00 3.52
-60-
Estimate the wheat yield per hectare for the experimental station along with its
standard error.
Solution: We have N = 100, M = 16, n = 10, m = 4
Calculations have been made as shown below:
Sr. m
y ij (y i - y) 2 m
y i2 2
n m 
m

No.  y ij
j1
yi  
4
 y i2
j1
 y ij  y i 
j1
 i 1 j1 
1 2 3 4 5 6 7
1. 17.16 4.290 0.0267 74.091 18.404 0.475
2. 17.02 4.255 0.0165 73.565 18.105 1.145
3. 15.18 3.795 0.4469 59.733 14.402 2.125
4. 16.88 4.220 0.0087 71.925 71.808 0.694
5. 16.28 4.070 0.0143 67.009 16.545 0.749
6. 14.54 3.635 0.2586 53.511 13.213 0.659
7. 18.20 4.550 0.1794 83.950 20.703 1.138
8. 17.14 4.285 0.0251 73.800 17.361 0.356
9. 17.50 4.375 0.0618 77.605 19.141 1.041
10. 15.16 3.790 0.4402 58.718 14.364 1.262
Total: 41.265 1.4782 693.908 9.644

An estimate of the average wheat yield is given as:


1 n
y ts   yi
n i 1
41.265
=  4.1265
10
Estimate of variance of y ts is:
1 1  1 1 1 
V̂(y ts )    s 2b    s 2w
n N nm M
Calculating these values, we get:
n
1 1.4782
s 2b  
(n  1) i 1
(y i  y) 2 
9
 0.1642

n m
1 9.644
and s 2w  
n(m  1) i 1 j1
(y ij - y i ) 2 
30
 0.3215

1 1  1 1 1 
So V̂(y ts )     x 0.1642     x 0.3215
 10 100  100  4 16 
= 0.0145
and standard error of y ts  V̂(y ts )  0.0145  0.120

-61-
UNIT-VII

SYSTEMATIC SAMPLING

In systematic sampling only the first unit is selected at random while rests of the
units are selected automatically according to a pre-determined pattern
There are two possibilities:
(i) When N = nk and k is an integer (ii) When N ± nk .
Let us discuss each of these cases in detail.
Case-I: If N = nk. The N units in the population can be arranged in n rows as shown in
below table and can be named as a sequential list of population units. Such a list can be
prepared only if we have a finite number of units in the population.
Sequential list of the population units

Row 1 1 2 3 r k
Row 2 k+1 k+2 k+3 k+r 2k
Row 3 2k+1 2k+2 2k+3 2k+r 3k
……. ….. … … … … … …
Row n (n-1) k+1 (n-1)k+2 (n-1)k+3 (n-1)k+r Nk=N
Mean 𝑦1
̅̅̅ 𝑦2
̅̅̅ 𝑦3
̅̅̅ 𝑦̅𝑟 𝑦𝑘
̅̅̅

The first step is to select a random number from 1 to k that is in the range of
integers listed in the first row. Let the first selected random number is 2. Then first unit
selected in the sample is number 2 in the sequential list. After selecting the second unit
from the population, every kth unit is automatically included in the sample. Thus the units
in the sample of size n are at the serial numbers 2, k +2, 2k + 2, ..., (n-1)k + 2. The
random number selected from 1 to k is called a random start. The number k is called
sampling interval. Corresponding to each random number from 1 to k, there is only one
possible sample of size n. Thus in systematic sampling the total number of samples will
be k. If r denotes the random start, then the systematic sample consists of the units at the
serial numbers given by the sequence {r+ik, i = 0,1,2....., (n-1)k} .
For example, we a population of 30 units then how to select a sample of 6 units:

-62-
Since 30 = 6 × 5, if we select a sample of unit number 2 then after that we select
the sample the sample of 2, 7, 12, 17, 22 and 27. In this way we exhaust all the units of
the population. The sequential list of population units can either be made using known
magnitude of auxiliary information or by just numbering the list of population units.
If N = nk then sample mean is unbiased estimator of population mean.
Case II: If N≠ nk
The two cases arise
(i) Sample size is a random variable, that sample size is not fixed. This difficulty is
removed by modified sampling scheme
(ii) Sample mean is not an unbiased estimator of population mean. This difficulty is
removed by Circular Systematic Sampling
Circular Systematic Sampling: Murthy (1961), Sukhatme and Sukhatme (1970), and
Konijn (1973) have suggested using the circular systematic sampling (CSS) design in the
situations when N is not a multiple of n.
The main steps involved in selecting a sample using CSS scheme are as follows:
(a) Select a random number from 1 to N and name it as 'random start';
(b) Chose some integer value of k =N/n or rounded to nearest integer and name it as
skip;
(c) Select all units in the sample with serial numbers
r+ jk if r +jk ≤ N,
(r+jk-N) if r+jk>N; j= 0,1,2, ..., (n-l).
This can be illustrated with the following example.
Example-1: Suppose a population consists of 20 trees. We wish to select a sample of 3
trees, to study quantity of fruits, using CSS scheme with skip k = 20/3 6. Now if the
random start is r = 6, then the tree included in the sample would be with serial numbers,
6+0 x 6 = 6, 6+1 x 6 = 12 and 6+2 x 6 = 18. Similarly, if r = 16 then the trees in the
sample would be with serial numbers, 16+0 x 6 = 16, 16+ 1 x 6−20 =2, and 16+ 2 x
6−20 = 8.

-63-
Advantages and Disadvantages
The advantages of systematic sampling are mainly the simplicity of selection, the
operational convenience and evenly spread of the sample over the population. Systematic
sampling provides efficient estimate as compared to simple random sampling for many
populations such as population, with linear trend, populations with auto correlation and
population with negative sum of serial correlations.
In case of periodicity, systematic sampling has to be used with considerable care.
If for periodic population sampling interval is an odd multiple of half the period of cycle,
systematic sampling provides zero variance and in another situation when sampling
interval is a simple multiple of the period of cycle, the systematic sampling is not better
than selecting one unit at random.
There is, however, a serious disadvantage of systematic sampling that it is not
possible to estimate the sampling variance unbiasedly.
Estimation Procedure
We consider now the problem of estimating the population mean under the two
situations, when N = nk and N ≠ nk. In the first situation, when N = nk and the selection
is by linear systematic sampling, the sample mean provides an unbiased estimate of the
population mean.
The variance of the sample mean estimate is given by

ˆ ) 1 k
V( Ysy 
k r 1
(y r  Y ) 2

where y r is the mean of the rth systematic sample.

An approximate and biased estimate of variance is given by:


ˆ
ˆ (Y N - n n -1
V sy )   (y i 1  y i ) 2
2 Nn (n - 1) i 1

In the second case, when N ≠ nk and the selection of sample is by linear


systematic sampling, an unbiased estimate of mean Y is given by:
n or (n -1)
ˆ  k
Ysy*
N
y
i 1
i

-64-
Example-2: Out of 225 holdings, 45 holdings in the village were selected by systematic
sampling (with 5 as sampling interval). Below are given the total areable land for 45
selected holdings:

Sr. No. Total areable land Sr. No. Total areable Sr. No. Total areable
land land
1 60 16 25 31 30
2 50 17 192 32 70
3 14 18 25 33 30
4 10 19 0 34 35
5 1 20 13 35 0
6 0 21 0 36 30
7 0 22 0 37 0
8 0 23 50 38 0
9 0 24 0 39 10
10 0 25 10 40 0
11 150 26 0 41 20
12 150 27 0 42 70
13 100 28 0 43 16
14 22 29 85 44 15
15 0 30 30 45 35

Give an estimate of the total areable land in the villages also an approximately
standard error of the estimate.

Solution: An estimate of the total areable land, Ŷ is

N n
Ŷ   yi
n i 1
225
= x 1348.20
45

= 29.96

Estimate of the variance of Y is given by:

N 2 (k - 1)  n -1 
V̂(Ŷ)   
2nk (n - 1)  i 1
(y i 1  y i ) 2 

-65-
(225) 2 x (5  1)
= x 116845
2 x 45 x 5 x (45 - 1)

= 11950031.25

Thus, the standard error of the estimate Ŷ is given by:


ˆ )  V̂(Ŷ)
S.E. (Y

= 11950031.25
= 3456.00
Example-3: Select a sample of 7 districts from a population consisting of following 21
districts of Haryana by using the systematic sampling scheme. Collect the information on
the sex ratio from the selected districts. Use an appropriate method for estimating the
variance of the estimator of population mean.

Sr. No. District Sex ratio Sr. No. District Sex ratio
1 Ambala 885 12 Mahendergarh 895
2 Bhiwani 886 13 Mewat 907
3 Faridabad 873 14 Palwal 880
4 Fatehabad 902 15 Panchkula 873
5 Gurgaon 854 16 Panipat 864
6 Hisar 872 17 Rewari 898
7 Jhajjar 862 18 Rohtak 867
8 Jind 871 19 Sirsa 897
9 Kaithal 881 20 Sonepat 856
10 Karnal 881 21 Yamunanagar 877
11 Kurukshetra 888

Solution: We have N = 21, and n = 7, therefore, k  N n  21 7  3 f n


N
7
21

We use random number table to select the random number between 1 and 3.
Firstly, we observed random number 2. Thus the systematic sample consists of the
following 7 distinct units as: 02, 05, 08, 11, 14, 17 and 20.

-66-
Random No. District Sex ratio (yi) yi+1 - yi (yi+1 - yi)2

02 Bhiwani 886 -32 1024

05 Gurgaon 854 17 289

08 Jind 871 17 289

11 Kurukshetra 888 -8 64

14 Palwal 880 18 324

17 Rewari 898 -42 1764

20 Sonepat 856

3754

Thus an estimate of variance of the estimators of mean for the systematic


sampling is:

ˆ ) (1  f) n 1
V̂(Ysy 
2n (n  1) i 1
(y i 1  y i ) 2

(1  n ) n 1
= N
 (y i1  y i ) 2
2n (n  1) i 1

(1  0.33) 0.666
= x 3754  x 3754
2 x 7 (7  1) 84

= 29.763

-67-
UNIT-VIII

USE OF AUXILIARY INFORMATION IN SAMPLE SURVEYS

Utilization of auxiliary information in sample survey plays an important role to


provide efficient estimate of population parameter. In most of the surveys, auxiliary
information is available in one form or may be made available by diverting a part of the
resources.

Use of auxiliary information in sample survey is done at three stages viz. at (i)
pre-selection stage; (ii) selection stage and (iii) estimation stage.
(i) At the pre-selection stage, to construct strata as in stratified sampling
according to the frequency distribution of auxiliary variable or to arrange
the population units in increasing/decreasing order of their magnitude as in
systematic sampling.
(ii) At the selection stage (or design stage), to select the sampling units with
unequal probabilities, the inclusion probabilities being proportional to the
known auxiliary variable values e.g. Probability Proportional to Size (PPS
sampling).
(iii) At the estimation stage by constructing improved estimators e.g. ratio,
ratio type, regression, difference and product estimators etc.
At estimation stage auxiliary information can be utilized after the draw of sample
for building the estimator. When there is a positive correlation between the characteristics
under study ‘y’ and the auxiliary characteristics ‘x’, the ratio method of estimation is
quite effective. On the other hand, if the correlation is negative, the product method of
estimation can be employed.
In practice, knowledge of ratio of population totals of two characters is more
important than that of population total and means. For instance, in socio-economic
surveys, one may be interested in ratio such as per household, per capita income or
expenditure, proportion of expenditure on different items, proportion of unemployed
persons, sex ratio, birth rate, death rate etc.

-68-
Ratio Estimators: In ratio method an auxiliary variate xi correlated with yi is obtained
for each unit in the sample. The population total X of the xi must be known. In practice, xi
is often the value of yi at some previous time when a complete census was taken. The
ratio estimates of the population total Y, the population mean 𝑌̅ and the population ratio
𝑌⁄ are, respectively,
𝑋
𝑦̅ 𝑦̅ 𝑦̅
𝑌̂𝑅 = 𝑋 𝑦̅𝑅 = 𝑌̅̂𝑅 = 𝑋̅ 𝑅̂ =
𝑥̅ 𝑥̅ 𝑥̅
𝑦̅
The ratio estimators of population mean is given by 𝑦̅𝑅 = 𝑌̅̂𝑅 = 𝑋̅
𝑥̅

Results on the ratio estimators


Theorem-1: The bias of the ratio estimator 𝑦̅𝑅 of the population mean 𝑌̅ to the first order
of approximation is given by
1−𝑓
Bias (𝑦̅𝑅 ) = ( ) 𝑌̅[𝐶𝑥2 − 𝜌𝑥𝑦 𝐶𝑥 𝐶𝑦 ]
𝑛
Theorem-2: The Mean square error of the ratio estimator 𝑦̅𝑅 of the population mean 𝑌̅ to
the first order of approximation is given by
1−𝑓
MSE (𝑦̅𝑅 ) = ( ) 𝑌̅ 2 [𝐶𝑦2 + 𝐶𝑥2 − 2𝜌𝑥𝑦 𝐶𝑥 𝐶𝑦 ]
𝑛
1
where 𝜇𝑟𝑠 = ∑𝑁 ̅ 𝑟 ̅ 𝑠
𝑖=1(𝑌𝑖 − 𝑌) (𝑋𝑖 − 𝑋 ) , 𝑟, 𝑠 = 2,3,4, … … … … .
𝑁−1
1 1
𝜇2𝑜 = 𝑆𝑦2 = ∑𝑁 ̅ 2 2
𝑖=1(𝑌𝑖 − 𝑌 ) , 𝜇02 = 𝑆𝑥 = ∑𝑁 ̅ 2
𝑖=1(𝑋𝑖 − 𝑋 ) ,
𝑁−1 𝑁−1
1
𝜇11 = 𝑆𝑥𝑦 = ∑𝑁 ̅ ̅
𝑖=1(𝑌𝑖 − 𝑌 ) (𝑋𝑖 − 𝑋 )
𝑁−1

𝑆2 𝑆𝑦2 𝑺𝒚𝒙
𝐶𝑥2 = 𝑋̅𝑋2 𝐶𝑦2 = 𝑌̅ 2 ` 𝜌𝑥𝑦 = 𝑺𝒙 𝑺𝒚

𝑦̅
Second form: The variance of the estimator 𝑌̂𝑅 = 𝑋 is also given by
𝑥̅

𝑁 2 (1−𝑓)
V(𝑌̂𝑅 ) = (𝑆𝑦2 + 𝑅 2 𝑆𝑥2 − 2𝑅𝜌𝑆𝑦 𝑆𝑥 ) Since 𝑌̅ = R𝑋̅ and
𝑛
𝑆𝑦𝑥
𝜌= 𝑆𝑥 𝑆𝑦

This can also be written as:


𝑌 2
V(𝑌̂𝑅 ) = (1-f) 𝑛 (𝐶𝑦2 + 𝐶𝑥2 − 2𝜌𝑥𝑦 𝐶𝑥 𝐶𝑦 )

-69-
𝑦̅ 𝑦̅ 𝑦̅
Third form: The variance of the estimators 𝑌̂𝑅 = 𝑋 𝑌̅̂𝑅 = 𝑋̅ 𝑅̂ = is given by
𝑥̅ 𝑥̅ 𝑥̅

𝑁 2 (1−𝑓) ∑𝑁
𝑖=1(𝑦𝑖 −𝑅𝑥𝑖 )
2
V(𝑌̂𝑅 ) = [ ]
𝑛 𝑁−1

𝑁 2
(1−𝑓) ∑ (𝑦𝑖 −𝑅𝑥𝑖 )
V(𝑌̅̂𝑅 ) = 𝑛 [ 𝑖=1 𝑁−1 ]

𝑁 2
(1−𝑓) ∑ (𝑦𝑖 −𝑅𝑥𝑖 )
V(𝑌̅̂𝑅 ) = 𝑛𝑋̅ 2 [ 𝑖=1 𝑁−1 ]

Theorem-3: An estimator of the Mean square error of the ratio estimator 𝑦̅𝑅 of the
population mean 𝑌̅ to the first order of approximation is given by

̂ (𝑦̅𝑅 ) = (1−𝑓) [𝑠𝑦2 + 𝑟 2 𝑠𝑥2 − 2𝑟𝑠𝑥𝑦 ]


𝑀𝑆𝐸 𝑛
𝑦̅
where r = 𝑥̅ estimator of the ratio of two sample means
1
Theorem-4: If the sample size n is sufficiently large so that terms of O(𝑛2 ) can be

ignored, the ratio estimator 𝑦̅𝑅 is more efficient than the sample mean 𝑦̅
1𝐶 1𝐶
if 𝜌𝑥𝑦 > 2.𝐶𝑥 , In the condition 𝜌𝑥𝑦 > 2.𝐶𝑥 , if we assume that 𝐶𝑦 ≅ 𝐶𝑥 then it holds
𝑦 𝑦

for all values of the correlation coefficient 𝜌𝑥𝑦 in the range (0.5, 1.0].

Theorem-5: The ratio estimator 𝑦̅𝑅 is more efficient than the sample mean 𝑦̅ if 𝜌𝑥𝑦 >
0.5, i.e. if the correlation between X and Y is positive and high.
The ratio estimate may be called the best among a wide class of estimates if the
(i) The relation between yi and xi is a straight line through the origin
(ii) The variance of yi about this line is proportional to xi
Product estimator:
Murthy (1964) considered another estimator of population mean 𝑌̅ using known
population mean 𝑋̅ of the auxiliary variable as a product estimator
𝑋̅
The product estimators of population mean is given by 𝑦̅𝑝 = 𝑌̅̂𝑝 = 𝑦̅( 𝑥̅ )

Theorem-1: The bias of the ratio estimator 𝑦̅𝑝 of the population mean 𝑌̅ to the first order
of approximation is given by
1−𝑓
Bias (𝑦̅𝑝 ) = ( 𝑛
) 𝑌̅𝜌𝑥𝑦 𝐶𝑥 𝐶𝑦

-70-
Theorem-2: The Mean square error of the ratio estimator 𝑦̅𝑝 of the population mean 𝑌̅ to
the first order of approximation is given by
1−𝑓
MSE(𝑦̅𝑝 ) = ( ) 𝑌̅ 2 [𝐶𝑦2 + 𝐶𝑥2 + 2𝜌𝑥𝑦 𝐶𝑥 𝐶𝑦 ]
𝑛

Theorem-3: An estimator of the Mean square error of the ratio estimator 𝑦̅𝑝 of the
̅ to the first order of approximation is given by
population mean 𝒀

̂ (𝑦̅𝑅 ) = (1−𝑓) [𝑠𝑦2 + 𝑟 2 𝑠𝑥2 + 2𝑟𝑠𝑥𝑦 ] 𝑟 2


𝑀𝑆𝐸 𝑛

Theorem-4: The product estimator 𝑦̅𝑝 is more efficient than the sample mean 𝑦̅
𝐶 1 𝐶 1
if 𝜌𝑥𝑦 𝐶𝑥 <- 2. , In the condition𝜌𝑥𝑦 𝐶𝑥 <- 2, if we assume that 𝐶𝑦 ≅ 𝐶𝑥 ' then it holds for
𝑦 𝑦

all values of the correlation coefficient 𝜌𝑥𝑦 in the range [-1.0, -0.5].

Theorem-5: The product estimator 𝑦̅𝑝 is more efficient than the sample mean 𝑦̅ if 𝜌𝑥𝑦 <-
0.5, i.e. if the correlation between X and Y is negative and high.
Important note:
We observed that the product and ratio estimators are better than sample mean if
the value of 𝜌𝑥𝑦 lies in the interval [-1.0, -0.5) and (+0.5, +1.0], respectively. Thus the
sample mean estimator remains better than both the ratio and product estimators of the
population mean if 𝜌𝑥𝑦 lies in the range [-0 .5, + 0.5].

Regression estimator: We considered ratio type estimators which use data on auxiliary
characteristic X correlated with character under study Y. It was found that ratio type
estimators results in increased precision if the regression of Y on X is linear and passes
through the origin.
If the regression of Y on X is linear and does not passes through the origin under
such conditions, it is more appropriate to use the regression type estimators.
Suppose an estimator of population mean 𝑌̅ as
𝑦̅𝑑𝑖𝑓𝑓 = 𝑦̅ + 𝑑(𝑋̅ − 𝑥̅ )

where d is a constant to be chosen such that the variance of the estimator V(𝑦̅𝑑𝑖𝑓𝑓 ) is
minimum.

-71-
𝐶𝑦 𝑌̅ 𝑆𝑥𝑦
For the optimum value of d = 𝜌𝑥𝑦 𝐶 = = β (regression coefficient), then the
𝑥𝑋 ̅ 𝑆𝑥2

difference estimator become


𝑆𝑥𝑦
𝑦̅𝑑𝑖𝑓𝑓 = 𝑦̅ + (𝑋̅ − 𝑥̅ )
𝑆𝑥2

Thus the difference estimator becomes non-functional if the value of the


𝑆𝑥𝑦
regression coefficient β = is unknown. In such situations, Hansen, Hurwitz, and
𝑆𝑥2

Madow (1953) consider the linear regression estimator of the population mean 𝑌̅ as

𝑦̅𝐿𝑅 = 𝑦̅ + 𝛽̂ (𝑋̅ − 𝑥̅ )
𝑠𝑥𝑦 𝑆𝑥𝑦
where 𝛽̂ = denotes the estimator of the regression coefficient β = .
𝑠𝑥2 𝑆𝑥2

Then we have the following theorems:


̅𝑳𝑹 , to the first
Theorem-1: The mean squared error of the linear regression estimator 𝒚
order of approximation, is
1−𝑓
MSE (𝑦̅𝐿𝑅 ) = ( ) 𝑆𝑦2 [1 − 𝜌𝑥𝑦
2
]
𝑛

Theorem-2: An estimator of the mean squared error of the linear regression, estimator
y̅LR , to the first order of approximation, is given by

̂ (𝑦̅𝐿𝑅 ) = (1−𝑓) 𝑠𝑦2 [1 − 𝑟𝑥𝑦


𝑀𝑆𝐸 2
]
𝑛

Theorem-3: The linear regression estimator 𝑦̅𝐿𝑅 is always more efficient than the sample
mean 𝑦̅ if 𝜌𝑥𝑦 ≠0.

-72-
Difference between Ratio, Product Regression Estimators
S. Ratio estimator Product estimator Regression estimator
No.
1 The correlation between x The correlation between x The correlation
and y must be positive and and y must be negative between x and y must
high(within+0.5 and+1.0) and high (within-1.0 and - be non-zero within the
0.5) range (within-1.0 and
+1.0)
2 The regression lines The regression lines The regression line
between y and x should between y and x may or may have both
passes through the origin not passes through the parameters namely
origin. intercept and slope
Note : If the regression
line with two negatively
co-related variables will
pass through the origin,
then one of the variables
among y and x will be
negative, which may not
be practicable
3 The usual estimators of the The usual estimators of The usual estimators of
approximate mean square the approximate mean the approximate mean
error may be low if sample square error may be low if square error may be
size is large, which may sample size is large, low if sample size is
provide us the smaller which may provide us the large, which may
confidence interval smaller confidence provide us the smaller
estimate then the actual interval estimate then the confidence interval
one. actual one. estimate then the actual
one.
4 We have to estimate only If both the variable are Here we have two
one model parameter, so positive but the unknown parameters
the degree of freedom for correlation is negative, namely intercept and
constructing the confidence then we have both slope thus we must use
interval estimate will be df intercept and slope, and df = (n-2)
= (n-1) then we should must use
df =(n-2)
Ratio Method of Estimation
Example-1: The data on study variable (y) and auxiliary variable (x) is given below, are
from a hypothetical population of 8 units:

X : 10 12 15 17 18 22 24 30
Y : 4 6 9 10 13 14 16 20

-73-
Work out the efficiency of ratio estimation of mean (yr) in relation to usual
estimator y for SRSWOR of size 3.

Solution:
y
Ratio estimator of Mean (yr)  X
x

N - n 2
MSE (y r )    
S y  R 2 S x2  2 RS x S y 
 Nn 

N - n 2
And V(y) SRSWOR   S y
 Nn 
1
Now, y  [10  12  15  17  18  22  24  30]
8
= 18.5
1
x  [4  6  9  10  13  14  16  20]
8
= 11.5

Y 18.5
R   1.6087
X 11.5

1 N 
S y2   
N  1  i 1
(Yi 2 )  N(Y ) 2 

1
= [(10) 2  (12) 2  (15) 2  (17) 2  (18) 2  (22) 2  (24) 2  (30) 2  (18.5) 2 ]
8 1
1
= [3042  2738]
7
= 43.4286

1 N 
S x2   
N  1  i 1
( X i2 )  N( X ) 2 

1
= [( 4 2  6 2  9 2  10 2  13 2  14 2  16 2  20) 2  8 x (11.5) 2 ]
8 1
1
= [1254  1058]
7

-74-
= 28.0000

1 N 
S xy   
N  1  i 1
X i Yi  N(X Y)

1
[(10 x 4)  (12 x 6)  (15 x 9)  (17 x 10)  (18 x 13) 
= 8 1
(22 x 14)  (24 x 16)  (30 x 20)  8 (18.5 x 11.5)]
1
= [40  72  135  170  234  308  384  60 - 1702]
7
1
= [1943 - 1702]
7
= 34.4286
S XY
Correlation coefficient (ρ) between x and y is =
SxSy
34.4286
=
43.4286 x 28.0000
34.4286
=
6.5900 x5.2916
= 0.9873

N - n 2
Now V(y) SRSWOR   S y
 Nn 

8-3
=   43.4286
8 x 3
5
= x 43.4286
24
= 9.0476

N - n 2
MSE (y r )    
S y  R 2 S x2  2 RS x S y 
 Nn 

8-3
=  
 43.4286  (1.6087) 28  2 x 1.6087 x 0.9873 x 6.5900 x 5.2916
2

 8 x 3 

=
5
43.4286  72.4616  110.7711
24
5
= x 5.1191  1.066
24

-75-
The MSE (y r ) seems to be less than V( y) . It implies that the ratio estimator of
population mean is more efficient than the usual SRS based estimator (y) .

Relative efficiency of estimator y r with reference to y r is:

V(y)
RE  x 100
MSE(y r )
9.0476
= x 100
1.066
= 848.27
Regression Method of Estimation
Example-2: A physiologist has undertaken a project to estimate the average leaf area of a
newly developed strain of wheat of which 120 plants were grown. In all, there were 2106
leaves and their total weight was 240.078 gm. Since measuring of all the 2106 leaves area
is difficult, a WORSRS of 33 leaves was drawn. The area and weight of each sampled
leaves were recorded. These are given in table. Estimate the average leaf area for the
population under consideration. Also work out the confidence interval for the actual
value of this population parameter.
Table: Area (cm2) and weight (mg) of leaves in the sample

Leaf Area (y) Weight (x) Leaf Area (y) Weight (x) Leaf Area (y) Weight (x)
1 27.37 105 12 24.18 106 23 14.31 78
2 30.21 109 13 35.72 125 24 39.28 128
3 22.08 100 14 19.76 97 25 24.16 102
4 36.76 125 15 33.46 125 26 26.51 114
5 28.51 116 16 43.62 131 27 29.69 119
6 30.34 118 17 16.11 85 28 20.03 101
7 21.81 104 18 21.07 112 29 18.41 93
8 29.11 108 19 26.71 117 30 35.72 124
9 38.90 129 20 18.51 96 31 29.33 117
10 17.21 90 21 23.43 103 32 21.88 107
11 42.44 130 22 31.66 121 33 21.29 103
∑y= ∑x=
899.679 3647.985

-76-
Solution: N = 2106, n = 33 x = 240.78 gm
240.078
X  0.1149 gm  114.9 mg
2106

Y
y i

899.679
 27.263
n 33

X
x i

3647.985
 110.545
n 33

1 n 
sx   (x i ) 2  n( x) 2 
2

n - 1  i 1 
1
32

(105) 2  (109) 2  .......(103) 2  33(110.545) 2 
= 187.631

1 n 
sy   (y i ) 2  n( y) 2 
2

n - 1  i 1 
1
32

(27.37) 2  (30.21) 2  .......( 21.29) 2  33(27.263) 2 
= 61.003
1 n 
s xy    x i y i  n x y)
n - 1 i 1 

1
(105 x 27.37)  (109 x 30.21)  .......(103 x 21.29)  33 (110.545 x 27.263)
32
= 100.312
Thus
s xy 100.312
r   0.9376
sxsy (187.631)(61.003)

s xy 100.312
β̂  b  2
  0.5346
s x 187.631

Estimate of average leaf area

ŷ lr  y  b (X  x)

= 27.263 + (0.5346) (114.9 – 110.545) = 29.591

-77-
Estimation of mean square error:

N  n
MSE (ŷ lr )   
 1 r sy
2 2

 Nn 

 2106 - 33 
=    
1  (0.9376) 2 61.003
 2106 x 33 
= 0.2200
The range of average leaf area for the population would be calculated by
using confidence interval.

Confidence interval  y lr  2 SE (y lr )

= y lr  2 MSE (y lr )

= 29.591 2 0.2200
= 28.65, 30.53

N - n 2
υ (y) SRSWOR   s y
 Nn 

 2106 - 33 
  61.003
 2106 x 33 
= 1.8196
V (y)
Relative Efficiency = x 100
MSE (y lr )

1.8196
= x 100
0.2200
= 827.09

-78-

You might also like