Lecture 2-Data Collection & Sample Design
Lecture 2-Data Collection & Sample Design
• Statistical Inference
The advantages of using a Sample rather
than the Population
• Time-If the data are needed quickly, then there may not be enough time
to cover the whole population. For example, if we are concerned with
checking the quality of goods produced by a mass production process,
the delay in delivery whilst every item is checked, will be unacceptable.
• Cost-The cost of collecting data from the whole population may be
prohibitively high. In the example above, the cost of checking every item
could make the mass produced item excessively expensive.
The advantages of using a Sample rather
than the Population
• Errors-If data were collected from a large population, then the actual
task of collecting, handling and processing the data would involve a
large number of people and the risk of error increases rapidly. Hence,
the use of a sample, with its smaller data set will often result in fewer
errors.
• Tests-Which case, it is obviously undesirable to deal with the entire
population. For example, a manufacturer wishes to make a claim
about the durability of a particular type of battery. He runs tests on
some batteries until they fail to determine what would be a
reasonable claim about all the batteries.
When would we use a population rather
than a sample?
• Small populations-If the population is small, so that any sample taken would
be large relative to the size of the population, then the time, cost and
accuracy involved in using the population rather than the sample will not be
significantly different.
• Accuracy-If it is essential that the information gained from the data is
accurate, then statistical inference from sample data may not be sufficiently
reliable. For example, it is necessary for a shop to know exactly how much
money has been taken over the counter in the course of a year. It is not
sufficient, for the owner to record takings on a sample of days out of the year.
• The problem of errors is still relevant here, but any errors in the data will be
ones of arithmetic rather than unreliability of statistical estimates.
The role of Sampling in Statistical Analysis
• The use of data from a sample instead of a population has important
implications for the statistical investigation since it leads us into the
realms of statistical inference; it becomes necessary to know what
may be inferred about the population from the sample.
• What does the sample statistic tell us about the population
parameter, or what does the evidence of the sample allow us to
conclude about our belief with respect to the population?
The role of Sampling in Statistical Analysis
• How does the mean age of a sample of professional accountants
relate to the mean age of the population of professional accountants?
• If, in a sample of adults, the joggers are fitter, can we claim for the
population as a whole that jogging keeps you fit?
• With an appropriately chosen sample, it is possible to estimate
population parameters from sample statistics, and, to use sample
evidence to test beliefs held about the population.
Statistical Inference
• Statistical inference is a large and important aspect of statistics.
Information is gathered from a sample and this information is used to
make deductions about some aspect of the population.
• For example, an auditor may check a sample of a company's
transactions and, if the sample is satisfactory, he will assume, or infer,
that all the company's transactions are satisfactory.
• He uses a sample because it is cheaper, quicker and more practical
than checking all of the transactions carried out in the company.
Sample Selection
• It is extremely important that the members of a sample are selected
so that the sample is as representative of the population as possible,
given the constraints of availability, time and money.
• A biased sample will give a misleading impression about the
population.
Methods of Selecting a Sample
1. Probability/Random Sample Designs
• There are several potential ways to decide upon the size of your sample, but one
of the simplest involves using a formula with your desired confidence interval
and confidence level, estimated size of the population you are working with, and
the standard deviation of whatever you want to measure in your population.
• The most common confidence interval and levels used are 0.05 and 0.95,
respectively. Since you may not know the standard deviation of the population
you are studying, you should choose a number high enough to account for a
variety of possibilities (such as 0.5).
Randomly selecting a sample
• This can be done in one of two ways: the lottery or random number method.
• In the lottery method, you choose the sample at random by “drawing from a
hat” or by using a computer program that will simulate the same action.
• As the name implies, the selection of the sample by this method requires several stages.
• The method is most used when the population is distributed over a wide geographical area.
• For example, the population might be the world-wide membership of the SHU Alumni Society.
• The first stage is to divide the population into a few clearly defined areas.
• In our example, these could be individual countries. The proportion of the sample allocated to these
areas is determined by the proportion of the population in each area.
• Hence, if 80% of the Alumni membership was in England and Wales, and we were looking for a
sample of 1000 members, then we allocate 800 sample members to England and Wales, as with
stratified sampling. The next stage is to define some smaller areas, these might be local government
districts and then companies within these districts. A sample is taken of the local government
districts. Within the chosen districts, a sample of the companies is selected. Finally individual
members are sampled from the chosen companies in the selected districts. With this method the
actual sampling is quicker and more convenient.
Cluster sampling design