StatisticaL Methods
StatisticaL Methods
of
Distance
&
Methods
Online Education
Bachelor of
Arts in
Economics
Semester -II
Preface
The importance of Business Statistics, as a field of study and practice, is being
increasingly realized in schools, colleges, universities, commercial and industrial
organizations both in India and abroad.
It is a technical and practical subject and learning of it means familiarizing oneself with
many new terms and concepts. As the Students Study Material is intended to serve the
beginners in the field, I have given it the quality of simplicity. This Study Material is
intended to serve as a Study Material for students of BBA course of Amity University.
This Study Material of Business Statistics, is student oriented and written in teach
yourself style.
The primary objective of this study material is to facilitate clear understanding of the
subject of Business Statistics. This Material contains a wide range of theoretical and
practical questions varying in content, length and complexity. Most of the illustrations
and exercise problems have been taken from the various university examinations. This
material contains a sufficiently large number of illustrations to assist better grasp and
understanding of the subject. The reader will find perfect accuracy with regard to
formulae and answers of the exercise questions. For the convenience of the students I
have also included multiple questions and case study in this Study Material for better
understanding of the subject.
I hope that this Material will prove useful to both students and teachers. The contents of
this Study Material are divided into eight chapters covering various aspects of the
syllabus of BBA and other related courses. At the end of this Material three assignments
have been provided which are related with the subject matter.
I have taken considerable amount of help from various literatures, journals and medias. I
express my gratitude to all those personalities who have devoted their life to knowledge
specially Statistics, from whom I could learn and on the basis of those learnings now, I
am trying to deliver my knowledge to others through this material.
It is by Gods loving grace that he brought me in to this world and blessed me with loving
and caring parents, my respected father Mr. Manohar Lal Arora and my loving mother
Mrs. Kamla Arora, who have supported me in this Study Material.
I am thankful to my beloved wife Mrs. Deepti Arora, without whose constant
encouragement, advice and material sacrifice, this achievement would have been a far of
dream.
BUSINESS STATISTICS
Course Contents:
Module I: Introduction to Statistics
Definitions, Functions of Statistics, Statistics and Computers, Limitation of Statistics,
Application of Statistics.
Module II: Data Collection and Analysis
Methods of Data Collection, Primary And Secondary Data, Measures of DispersionRange, Quartile Deviation, Mean Deviation, Standard Deviation, Coefficient of
Variation.(Absolute & Relative Measure of Dispersion), Skewness-Karl-Pearsons
Coefficient of Skewness, Bowleys Coefficient of Skewness, Kurtosis.
Module III: Correlation Analysis And Regression Analysis
Introduction-Importance of Correlation, Types of Correlation, Scatter Diagram Method,
Karl Pearsons coefficient of Correlation (Grouped and Ungrouped). Spearmans
Coefficient of Rank, Correlation, Rank Correlation for Tied Ranks, Regression AnalysisConcepts of Regression, Difference b/w Correlation and Regression, Regression Lines.
Module IV: Time Series Analysis
Meaning and Significance, Components of Time Series, Trend Measurement, Moving
Average Method, Least Square Method (Fitting of Straight Line Only).
Module V: Probability And Probability Distribution
Introduction, Terminology used in Probability, Definitions of Probability, Mathematical,
Statistical and Axiomatic Approach to Probability, Probability Rules-Addition Rule,
Multiplication Rule of Probability, Conditional Probability- Bayes Theoram, Problems on
Bayes Theoram. Discrete Probability Distributions-Binomial Probability Distribution,
Poission Probability Distribution, Properties, Applications, Continuous Probability
Distributions-Normal Probability, distribution, Properties of the Normal Curve,
Applications, Relation b/w distributions.
Module VI: Sampling Design
Introduction: Some Fundamental Definitions, Census and Sample Survey, Steps in
Sampling Design, Criteria for Selecting a Sampling Procedure, Characteristics of a Good
Sample Design, Different Types of a Sample Design.
Module VII: Testing Of Hypothesis
What is a Hypothesis? Basics Concepts concerning a Hypothesis. Procedure for
Hypothesis Testing. Tests of Hypothesis. Parametric Test: Z-Test, T-Test.
Index
S. No.
1.
2.
3.
4.
5.
6.
7.
8.
1
2
3
4
5
6
7
8
Introduction to Statistics
Primary and secondary data
Measures of Dispersion
Measures of Skewness
Correlation Analysis
Regression Analysis
Time Series Analysis
Probability
CHAPTER ONE
INTRODUCTION TO STATISTICS
1.1 Introduction
In the modern world of computers and information technology, the
importance of statistics is very well recogonised by all the disciplines.
Statistics has originated as a science of statehood and found applications
slowly and steadily in Agriculture, Economics, Commerce, Biology,
Medicine, Industry, planning, education and so on. As on date there is no
other human walk of life, where statistics cannot be applied.
Statistics is a discipline which is concerned with:
the analysis and interpretation of the data and drawing valid worthwhile
conclusions from the same.
It is in the second sense that we are writing this guide on statistics.
Lastly the word statistics is used in a specialized sense. It describes various
numerical items which are produced by using statistics ( in the second sense
) to statistics ( in the first sense ). Averages, standard deviation etc. are all
statistics in this specialized third sense.
1.4 Definitions :
Statistics is defined differently by different authors over a period of
time. In the olden days statistics was confined to only state affairs but in
modern days it embraces almost every sphere of human activity. Therefore a
number of old definitions, which was confined to narrow field of enquiry
were replaced by more definitions, which are much more comprehensive and
exhaustive. Secondly, statistics has been defined in two different ways
Statistical data and statistical methods. The following are some of the
definitions of statistics as numerical data.
each other. The above definition seems to be the most comprehensive and
exhaustive.
Collection
3) Analysis of data
of
data
2)
Presentation
of
4) Interpretation of data
data
1.6.1 Condensation:
Generally speaking by the word to condense , we mean to reduce or
to lessen. Condensation is mainly applied at embracing the understanding of
a huge mass of data by providing only few observations. If in a particular
class in Chennai School, only marks4
1.6.2 Comparison:
Classification and tabulation are the two methods that are used to
condense the data. They help us to compare data collected from different
sources. Grand totals, measures of central tendency measures of dispersion,
graphs and diagrams, coefficient of correlation etc provide ample scope for
comparison.
If we have one group of data, we can compare within itself. If the rice
production (in Tonnes) in Tanjore district is known, then we can compare
one region with another region within the district. Or if the rice production
(in Tonnes) of two different districts within Tamilnadu is known, then also a
comparative study can be made. As statistics is an aggregate of facts and
figures, comparison is always possible and in fact comparison helps us to
understand the data in a better way.
1.6.3 Forecasting:
By the word forecasting, we mean to predict or to estimate before
hand. Given the data of the last ten years connected to rainfall of a particular
district in Tamilnadu, it is possible to predict or forecast the rainfall for the
1.6.4 Estimation:
One of the main objectives of statistics is drawn inference about a
population from the analysis for the sample drawn from that population. The
four major branches of statistical inference are
1. Estimation theory
2. Tests of Hypothesis
3. Non Parametric tests
4. Sequential analysis
In estimation theory, we estimate the unknown value of the population
parameter based on the sample observations. Suppose we are given a sample
of heights of hundred students in a school, based upon the heights of these
100 students, it is possible to estimate the average height of all students in
that school.
purpose will be served. Instead if we are given the average mark in that
particular examination, definitely it serves the better purpose. Similarly the
range of marks is also another measure of the data. Thus, Statistical
measures help to reduce the complexity of the data and consequently to
understand any huge mass of data. connection, market survey plays an
important role to exhibit the present conditions and to forecast the likely
changes in future.
1.8.2 Statistics does not study individuals: Statistics does not give
any specific importance to the individual items, in fact it deals with an
aggregate of objects. Individual items, when they are taken individually do
not constitute any statistical data and do not serve any purpose for any
statistical enquiry.
hands of the inexpert. The use of statistical tools by the inexperienced and
untraced persons might lead to wrong conclusions. Statistics can be easily
misused by quoting wrong figures of data. As King says aptly statistics are
like clay of which one can make a God or Devil as one pleases .
Chapter One
Introduction to Statistics
End Chapter Quizzes
1) The statement, Statistics is both a science and an art, was given by
a- R. A. Fisher
c- L. R. Connor
b- Tippet
d- A. L. Bowley
3) Statistics provides tools and techniques for research workers, was stated by
a- John I. Griffin
b- W. I. King
c-A. M. Mood
d- A. L. Boddington
5) Who stated that there are three kinds of lies: lies, dammed lies and statistics.
a- Mark Twin
b- Disraeili
c- Darrell Huff
d- G. W. Snedecor
b- quantitative information
d- none of (a) and (b)
9) The statement, Designing of an appropriate questionnaire itself wins half the battle,
was given by
a- A. R. Ilersic
b- W. I. King
c- H. Huge
d- H. Secrist
10) Who originally gave the formula for the estimation of errors of the type
a- L. R. Connor
b- W. I. King
c- A. L. Bowley
d- A. L. Boddington
CHAPTER TWO
PRIMARY AND SECONDARY DATA
2.1 Primary Data
The foundation of statistical investigation lies on data so utmost care must
be taken while collecting data. If the collected data are inaccurate and
inadequate, the whole analysis and interpretation will also become
misleading and unreliable. The method of collection of data depends upon
the nature, object and scope of statistical enquiry on the one hand and the
availability of time and money on the other hand.
Data, or facts, may be derived from several sources. Data can be
classified as primary data and secondary data. Primary data is data gathered
for the first time by the researcher. So if the investigator himself prefers to
collect the data for the purpose of purpose and enquiry and uses the data, it
is called collection of primary data. These data are original in nature.
According to Horace Secrist, primary data are meant that data
which are original, that is, those in which little or no grouping has been
made, for instance being recorded or itemized as encountered. They are
essentially raw material.
designed and executed surveys when these are based on relatively small
sample sizes.
It should not be forgotten that secondary data can play a substantial
role in the exploratory phase of the research when the task at hand is to
define the research problem and to generate hypotheses. The assembly and
analysis of secondary data almost invariably improves the researcher's
understanding of the marketing problem, the various lines of inquiry that
could or should be followed and the alternative courses of action which
might be pursued.
Secondary sources help define the population. Secondary data can
be extremely useful both in defining the population and in structuring the
sample to be taken. For instance, government statistics on a country's
agriculture will help decide how to stratify a sample and, once sample
estimates have been calculated, these can be used to project those estimates
to the population.
of their everyday operations. Orders are received and delivered, costs are
recorded, sales personnel submit visit reports, invoices are sent out, returned
goods are recorded and so on. Much of this information is of potential use in
marketing research but a surprising amount of it is actually used.
Organisations frequently overlook this valuable resource by not beginning
its files on the cost of producing, storing, transporting and marketing each of
its products and product lines. Such data has many uses in marketing
research including allowing measurement of the efficiency of marketing
operations. It can also be used to estimate the costs attached to new products
under consideration, of particular utilisation (in production, storage and
transportation) at which an organisation's unit costs begin to fall.
their transport operations are well placed to establish which are the most
profitable routes, and loads, as well as the most cost effective routing
patterns. Good data on transport operations enables the enterprise to perform
trade-off analysis and thereby establish whether it makes economic sense to
own or hire vehicles, or the point at which a balance of the two gives the
best financial outcome.
These sorts of publications rarely provide the data in which the researcher is
interested but serve in helping him/her locate potentially useful data sources.
The main sources of external secondary sources are :
(1)
(2)
Trade associations
(3)
Commercial services
(4)
ent statistics
surveys,
censuses
family
expenditure
surveys
Import/export
statistics
Production
statistics
Agricultural statistics.
Trade
associations
Commerc
ial services
and
international
institutions
published, statistics and figures are available on the internet either free or for
a fee.
data base of the Business India publications had been publishing the Delhi
Pages directory.
application of the new products and tells what is available and from whom.
Most manufacturers of industrial products ensure that a description of their
product is published in IPF before they hit the market.
have also come up in major cities in recent times Melior Communication for
example, offers a tele-data service. Basic data on a number of
subjects/products can be had through call to the agency. The service is
termed Tell me Business through phone service. Its main aim, like that of
yellow pages, is to bring buyers and sellers of products together. It also
provides some elementary databank support to researchers.
Measur
ement error
Source
bias
they
consult
secondary
sources.
Those
Time
scale
statistics.
Secondary data though old may be the only possible source of the
desired data on the subjects, which cannot have primary data at all. For
example, survey reports or secret records already collected by a business
group can offer information that cannot be obtained from original sources.
Firm in which secondary data are accumulated and delivered may
not accommodate the exact needs and particular requirements of the current
research study. Many a time, alteration or modifications to the exact needs
of the investigator may not be sufficient. To that amount usefulness of
secondary data will be lost. Primary data is completely tailor-made and there
is no problem of adjustments.
Secondary data is available effortlessly, rapidly and inexpensively.
Primary data takes a lot of time and the unit cost of such data is relatively
high.
Chapter Two
Primary and Secondary Data
End Chapter Quizzes
1.
c- always incorrect
d- misleading
2.
will be considered as
a primary data
b- secondary data
3.
respondents
a-
live in cities
b-
c-
are educated
d-
are known
b-
a given purpose
c-
any purpose
d-
b-
c-
d-
b-
c-
d-
b-
c-
d-
b-
c-
d-
sample survey
b-
pilot survey
c-
census survey
d-
absolutely correct
b-
not true
c-
true on average
d-
universally true
CHAPTER THREE
MEASURES OF DISPERSION
3.1 Meaning
There may be variations in the items of different distributions from
average despite the fact that they have value of mean. Hence, the measure of
central tendency alone are incapable of taking complete decisions about the
decisions. It has to be supplemented by some other measures.
3.2 Definitions :
they may differ widely in the scatter or in their values about the measures of
central tendencies."
ii) Simpson and Kafka said, "An average alone does not tell the full
story. It is hardly fully representative of a mass, unless we know the manner
in which the individual item. Scatter around it. A further description of a
series is necessary, if we are to gauge how representative the average is."
From this discussion we now focus our attention on the scatter or
variability which is known as dispersion. Let us take the following three
sets.
Students
G
roup X
roup Y
5
0
2
4
5
5
5
3
0
G
roup Z
mean
7
5
5
0
5
0
Thus, the three groups have same mean i.e. 50. In fact the median of
group X and Y are also equal. Now if one would say that the students from
the three groups are of equal capabilities, it is totally a wrong conclusion
then. Close examination reveals that in group X students have equal marks
as the mean, students from group Y are very close to the mean but in the
third group Z, the marks are widely scattered. It is thus clear that the
measures of the central tendency is alone not sufficient to describe the data.
Simple to understand
Easy to compute
3.5.1 Range
In any statistical series, the difference between the largest and the
smallest values is called as the range.
Solution: R = L - S = 100 - 10 = 90
Co-efficient of range =
Example ( Discrete Series ) Find the range and the co-efficient of the
range of the following items :
x
10
12
13
14
17
12
10
Solution
X
10
12
12
13
10
14
17
0-10
10-20
20-30
30-40
40-50
arks)
F(St
12
udents)
Solution
X(Marks)
F(Students)
0-10
10-20
20-30
12
30-40
40-50
Range = L-S
= 50-0
50
Coefficient of Range = (L-S) / (L+S)
Relative Range = (50-0) / (50+0)
= 50/50
=1
Now the lower quartile ( Q1 ) is the 25th percentile and the upper
quartile ( Q3 ) is the 75th percentile. It is interesting to note that the 50th
percentile is the middle quartile ( Q2 ) which is in fact what you have studied
under the title Median ". Thus symbolically
If we divide ( Q3 - Q1 ) by 2 we get what is known as Semi-Iinter
quartile range.
Q.D. = (Q3-Q1)/2, where Q1 = First Quartile and Q3 = Third quartile
Relative or Coefficient of Q.D. :
To find the coefficient of Q. D., we divide the semi interquartile
range by the sum of semi interquartiles. Symbolically :
Coefficient of Q.D. = (Q3 Q1) / (Q3 + Q1)
Example ( Individual Series ) Find the quartile deviation and its coefficient from the following items :
X(marks)
10
12
15
11
12
Solution
S. No.
X(Marks)
Revised X (In
ascending order)
15
20
10
12
10
15
11
12
11
12
12
15
15
15
10
20
20
Q1 = ( N+1)/4th item
Where N = No. of items in the data
Q1 = (10+1)/4
= 11/4
= 2.75th item
and 2.75th item = 2nd item + ( 3rd 2nd item) 75/100
= 8 + (9-8)
= 8 + 0.75
= 8.75
Q3 = 3 (N+1)/4th item
= 3 ( 10+1)/4
= 33/4
= 8.25th item
and 8.25th item 8th = (9th 8th item) 25/100
= 15+(15-15)/4
= 15+ 0
= 15
Frequency(f)
c.f.
items(x)
2
10
16
24
12
36
16
52
59
10
64
11
68
N = 68
Q1 = ( N+1) /4th item
= (68+1)/ 4th item
= (69)/4
= 17.25th item
17.25th item lies in c.f. 24 and against value of X = 6
Q1 = 6
Q3 = 3(N+1)/4th item
= 3(68+1)/4 th item
= (3*69)/4
= 51.75th item
51.75th item lies in c.f. 52 and against it value of X = 8
Q3 = 8
Q.D. = (Q3-Q1)/2
= (8-6)/2
=1
Coefficient of Q.D. = (Q3-Q1)/(Q3+Q1)
= (8-6)/(8+6)
= 2 / 14
= 0.143
(2) Using and one of three, find the deviations ( differences ) of the
items of the series from them.
i.e. xi - x, xi - Me and xi - Mo.
Me = Median and Mo = Mode.
(3) Find the absolute values of these deviations i.e. ignore there
positive (+) and negative (-) signs.
i.e. | xi - x | , | xi - Me | and xi - Mo |.
(4) Find the sum of these absolute deviations.
i.e. | xi - x | + , | xi - Me | , and | xi - Mo | .
Note that :
(i) generally M. D. obtained from the median is the best for the
practical purpose.
(ii) co-efficient of M. D. =
quartile deviation.
2.
3.
Demerits
1. This method lacks algebraic treatment as signs are ignored while
taking deviation from an average.
2. Mean deviation can not be considered as a scientific methods as it
ignores signs.
Calculations :
No.of students
0-5
449
5 10
705
10 15
507
15 20
281
20 25
109
25 30
52
30 35
16
35 40
Calculation:
1) X
2) M. D.
Thus, s.d. ( x ) =
where n = fi
and s. d. ( x ) =
Then V ( x ) =
and
10
12
16
25
30
14
11
13
11
Solution :
No.
xi
(xi - x)
( xi - x )2
10
-5
25
12
-3
16
+1
-7
49
25
+10
100
30
+15
225
14
-1
11
-5
16
13
-2
11
-4
16
n= 10
xi = 150
Calculations :
i)
ii)
iii)
Example Calculate s.d. of the marks of 100 students.
|xi -x |2 =
446
fi xi
fi xi2
10
10
20
60
180
4-6
35
175
875
6-8
30
210
1470
8-10
45
405
fi xi =
500
fi xi2 =
2940
Marks
No. of
students
Midvalues
(fi)
(xi)
0-2
10
2-4
n = 100
Solution
1)
2)
Chapter Three
Measures of Dispersion
End Chapter Quizzes
1. Which of the following is not a measure of dispersion?
a-
mean deviation
b-
quartile deviation
c-
standard deviation
d-
standard deviation
b-
mean deviation
c-
coefficient of variation
d-
range
range
f-
mean deviation
g-
standard deviation
h-
mean
b.
median
c.
mode
d.
zero
mean
b.
meadian
c.
mode
d.
zero
range
b.
mean deviation
c.
standard deviation
d.
quartile deviation
9.
is called
a.
variance
b.
absolute deviation
c.
standard deviation
d.
mean deviation
10.
a.
interquartile range
b.
c.
d.
CHAPTER FOUR
MEASURES OF SKEWNESS
4.1 Skewness
The voluminous raw data cannot be easily understood, Hence, we
calculate the measures of central tendencies and obtain a representative
figure. From the measures of variability, we can know that whether most of
the items of the data are close to our away from these central tendencies. But
these statical means and measures of variation are not enough to draw
sufficient inferences about the data. Another aspect of the data is to know its
symmetry. in the chapter "Graphic display" we have seen that a frequency
may be symmetrical about mode or may not be. This symmetry is well
studied by the knowledge of the "skewness." Still one more aspect of the
curve that we need to know is its flatness or otherwise its top. This is
understood by what is known as " Kurtosis."
It may happen that two distributions have the same mean and standard
deviations.
For
example,
see
the
following
diagram.
Although the two distributions have the same means and standard
deviations they are not identical. Where do they differ ?
They differ in symmetry. The left-hand side distribution is
symmetrical one where as the distribution on the right-hand is asymmetrical
or skewed. For a symmetrical distribution, the values, of equal distances on
either side of the mode, have equal frequencies. Thus, the mode, median and
mean - all coincide. Its curve rises slowly, reaches a maximum ( peak ) and
falls equally slowly (Fig. 1). But for a skewed distribution, the mean, mode
and median do not coincide. Skewness is positive or negative as per the
positions of the mean and median on the right or the left of the mode.
A positively skewed distribution ( Fig.2 ) curve rises rapidly, reaches
the maximum and falls slowly. In other words, the tail as well as median on
the right-hand side. A negatively skewed distribution curve (Fig.3) rises
slowly reaches its maximum and falls rapidly. In other words, the tail as well
as the median are on the left-hand side.
Size
Frequency
Size
Frequency
Size
Frequency
12
13
14
12
15
10
10
14
12
13
12
3 The sum of positive deviations from the median is not equal to the
sum of the negative deviations.
4. Frequencies are not equally distributed at points of equal deviation
from the mode.
5. When the data is plotted on a graph they do not give the normal
bell-shaped form.
J=
i.e. Skp = - Mo
Pearson has suggested the use of this formula if it is not possible to
determine the mode (Mo) of any distribution,
( Mean - Mode ) = 3 ( mean - median )
Skp = 3 ( - Mo ) Thus J =
Note : i) Although the co-efficient of skewness is always within 1,
but Karl Pearsons co-efficient lies within 3.
ii) If J = 0, then there is no skewness
iii) If J is positive, the skewness is also positive.
iv) If J is negative, the skewness is also negative.
Unless and until no indication is given, you must use only Karl
Pearsons formula.
No.of students
150
10
140
20
100
30
80
40
80
50
70
60
30
70
14
80
Note: You will always find the different values of J when calculated by Karl
Pearsons and Bowleys formula. But the value of J by Bowleys formula
always lies with 1.
No.of workers
Below 50
50-70
16
70-90
39
90-110
58
110-130
60
130-150
46
150-170
22
170-190
15
190-210
15
210-230
10
Solution:
Income
group
c.f.
Below 50
50 70
16
17
70 90
39
56
90 - 110
58
114
110 - 130
60
174
130 - 150
46
220
150 - 170
22
242
170 - 190
15
257
190 - 210
15
252
210 - 230
281
230 &
above
10
291
n = f = 291
Calculations :
1) Median = Size of
= Size of
item
item
=
=
Chapter Four
Measures of Skewness
End Chapter Quizzes
1. For a positive skewed distribution, which of the following
inequally is
a-
b-
c-
d-
b-
c-
d-
b-
c-
d-
Q1 +Q3 >2Q2
b-
Q1 + Q2 > 2Q3
c-
Q1 + Q3 > Q2
d-
Q3 Q1 > Q2
10
b-
35
c-
20
d-
zero
6. First and third quartile of a frequency distribution are 30 and 75. Also
its coefficient of skewness is 0.6. The median of the frequency
distribution is
a- 40
b- 39
c- 41
d- 38
7. For negatively skewed distribution, the correct relation between mean,
median and mode is
a-
b-
c-
d-
left tail
b-
right tail
c-
middle
d-
any where
middle
b-
right tail
c-
left tail
d-
whole curve
b-
c-
d-
CHAPTER FIVE
CORRELATION
5.1 Introduction
So far we have considered only univariate distributions. By the
averages, dispersion and skewness of distribution, we get a complete idea
about the structure of the distribution. Many a time, we come across
problems which involve two or more variables. If we carefully study the
figures of rain fall and production of paddy, figures of accidents and motor
cars in a city, of demand and supply of a commodity, of sales and profit, we
may find that there is some relationship between the two variables. On the
other hand, if we compare the figures of rainfall in America and the
production of cars in Japan, we may find that there is no relationship
between the two variables. If there is any relation between two variables i.e.
when one variable changes the other also changes in the same or in the
opposite direction, we say that the two variables are correlated.
W. J. King : If it is proved that in a large number of instances two
variables, tend always to fluctuate in the same or in the opposite direction
then it is established that a relationship exists between the variables. This is
called a "Correlation."
The correlation is one of the most common and most useful statistics.
A correlation is a single number that describes the degree of relationship
between two variables. Let's work through an example to show you how this
statistic is computed.
Correlation is a statistical technique that can show whether and how
strongly pairs of variables are related. For example, height and weight are
related; taller people tend to be heavier than shorter people. The relationship
isn't perfect. People of the same height vary in weight, and you can easily
think of two people you know where the shorter one is heavier than the taller
one. Nonetheless, the average weight of people 5'5'' is less than the average
weight of people 5'6'', and their average weight is less than that of people
5'7'', etc. Correlation can tell you just how much of the variation in peoples'
weights is related to their heights.
Although this correlation is fairly obvious your data may contain
unsuspected correlations. You may also suspect there are correlations, but
don't know which are the strongest. An intelligent correlation analysis can
lead to a greater understanding of your data.
It means the study of existence, magnitude and direction of the
relation between two or more variables. in technology and in statistics.
Correlation is very important. The famous astronomist Bravais, Prof. Sir
Fancis Galton, Karl Pearson (who used this concept in Biology and in
Genetics). Prof. Neiswanger and so many others have contributed to this
great subject.
5.2 Definitions :
An analysis of the covariation of two or more variables is usually
called correlation.
A. M. Tuttle
Correlation analysis attempts to determine the degree of relationship
between variables.
Ya Lun Chou
The effect of correlation is to reduce the range of uncertainty of
ones prediction.
Tippett
them, but you cannot assume that buying computers causes people to buy
athletic shoes (or vice versa).
The second caveat is that the Pearson correlation technique works best
with linear relationships: as one variable gets larger, the other gets larger (or
smaller) in direct proportion. It does not work well with curvilinear
relationships (in which the relationship does not follow a straight line). An
example of a curvilinear relationship is age and health care. They are
related, but the relationship doesn't follow a straight line. Young children
and older people both tend to use much more health care than teenagers or
young adults. Multiple regression (also included in the Statistics Module)
can be used to examine curvilinear relationships, but it is beyond the scope
of this article.
Correlation Example
Let's assume that we want to look at the relationship between two
variables, height (in inches) and self esteem. Perhaps we have a hypothesis
that how tall you are effects your self esteem (incidentally, I don't think we
have to worry about the direction of causality here -- it's not likely that self
esteem causes your height!). Let's say we collect some information on
twenty individuals (all male -- we know that the average height differs for
males and females so, to keep this example simple we'll just use males).
Height is measured in inches. Self esteem is measured based on the average
of 10 1-to-5 rating items (where higher scores mean higher self esteem).
Here's the data for the 20 cases (don't take this too seriously -- I made this
data up to illustrate what a correlation is):
Person
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Height
68
71
62
75
58
60
67
68
71
69
68
67
63
62
60
63
65
67
63
61
Self Esteem
4.1
4.6
3.8
4.4
3.2
3.1
3.8
4.1
4.3
3.7
3.5
3.2
3.7
3.3
3.4
4.0
4.1
3.8
3.4
3.6
Now, let's take a quick look at the histogram for each variable:
StDev
4.40574
Self
Esteem
3.755
Variance Sum
19.4105 1308
4.6
1.5
You should immediately see in the bivariate plot that the relationship
between the variables is a positive one (if you can't see that, review the
section on types of relationships) because if you were to fit a single straight
line through the dots it would have a positive slope or move up from left to
right. Since the correlation is nothing more than a quantitative estimate of
the relationship, we would expect a positive correlation.
What does a "positive relationship" mean in this context? It means
that, in general, higher scores on one variable tend to be paired with higher
scores on the other and that lower scores on one variable tend to be paired
with lower scores on the other. You should confirm visually that this is
generally true in the plot above.
The nature of the graph gives us the idea of the linear type of
High degree, moderate degree or low degree are the three categories
of this kind of correlation. The following table reveals the effect ( or degree )
of coefficient or correlation.
Degrees
Positive
Negative
Absence of correlation
Zero
Perfect correlation
+1
-1
High degree
+ 0.75 to +
1
- 0.75 to 1
Moderate degree
+ 0.25 to +
0.75
- 0.25 to 0.75
Low degree
0 to 0.25
0 to - 0.25
We use the symbol r to stand for the correlation. Through the magic
of mathematics it turns out that r will always be between -1.0 and +1.0. if the
correlation is negative, we have a negative relationship; if it's positive, the
relationship is positive. You don't need to know how we came up with this
formula unless you want to be a statistician. But you probably will need to
know how the formula relates to real data -- how you can use the formula to
compute the correlation. Let's look at the data we need for the formula.
Here's the original data with the other necessary columns:
Heig
Person
ht (x)
Self
Esteem (y)
x*y
x*x
y*y
68
4.1
278.8
4624
16.81
71
4.6
326.6
5041
21.16
62
3.8
235.6
3844
14.44
75
4.4
330
5625
19.36
58
3.2
185.6
3364
10.24
60
3.1
186
3600
9.61
67
3.8
254.6
4489
14.44
68
4.1
278.8
4624
16.81
71
4.3
305.3
5041
18.49
10
69
3.7
255.3
4761
13.69
11
68
3.5
238
4624
12.25
12
67
3.2
214.4
4489
10.24
13
63
3.7
233.1
3969
13.69
14
62
3.3
204.6
3844
10.89
15
60
3.4
204
3600
11.56
16
63
252
3969
16
17
65
4.1
266.5
4225
16.81
18
67
3.8
254.6
4489
14.44
19
63
3.4
214.2
3969
11.56
20
61
3.6
219.6
3721
12.96
1308
75.1
4937.
8591
285.4
Sum
=
The first three columns are the same as in the table above. The next
three columns are simple computations based on the height and self esteem
data. The bottom row consists of the sum of each column. This is all the
information we need to compute the correlation. Here are the values from
the bottom row of the table (where N is 20 people) as they are related to the
symbols in the formula:
Now, when we plug these values into the formula given above, we get
the following (I show it here tediously, one step at a time):
So, the correlation for our twenty cases is .73, which is a fairly strong
positive relationship. I guess there is a relationship between height and self
esteem, at least in this made up data!
vi) If the points are spread widely over a broad strip, falling
downward, the correlation is low degree negative (see fig.6)
vii) If the points are spread (scattered) without any specific pattern,
the correlation is absent. i.e. r = 0. (see fig.7)
Though this method is simple and is a rough idea about the existence
and the degree of correlation, it is not reliable. As it is not a mathematical
method, it cannot measure the degree of correlation.
where
N = Number of pairs of observation
Note : r is also known as product-moment coefficient of correlation.
OR r =
OR r =
Now covariance of x and y is defined as
Height
of son
(cm):
165
166
167
168
167
169
170
172
167
168
165
172
168
172
169
171
Height of
father
son
xi
yi
165
x
=
xix
y=
yi-y
xy
x2
y2
167
-3
-2
166
168
-2
-1
167
165
-1
-4
16
167
168
-1
-1
168
172
169
172
170
169
172
171
xi=1344
yi=1352
2
0
16
xy=24
x2=36
y2=44
Calculation:
Now,
Since r is positive and 0.6. This shows that the correlation is positive
and moderate (i.e. direct and reasonably good).
Example From the following data compute the coefficient of
correlation between x and y.
R=
where R = Rank correlation coefficient
D = Difference between the ranks of two items
N = The number of observations.
Note: -1 R 1.
i)
ii)
Computation:
i.Give ranks to the values of items. Generally the item with the highest
value is ranked 1 and then the others are given ranks 2, 3, 4, .... according to
their values in the decreasing order.
4.5th rank. If three items are of equal rank say 4th then they are given
= 5th rank each. If m be the number of items of equal ranks, the
is added to S D2. If there are more than one of such cases
factor
then this factor added as many times as the number of such cases, then
10
Rank
in
Maths :
10
Rank
in
Stats:
10
Solution :
Student
No.
Rank
in
Maths
(R1)
Rank
in
Stats
(R2)
R1 - R2
D
(R1 - R2 )2
D2
-2
-2
-3
-5
25
10
10
-1
10
36
SD=0
S D2 = 96
N = 10
Calculation of R :
40
42
45
35
36
39
Marks
in
English
:
46
43
44
39
40
43
Solution:
Marks
in
Stats
R1
Marks
in
English
R2
R1 - R2
(R1 -R2)2
=D2
40
46
42
43
3.5
-1.5
2.25
45
44
-1
35
39
36
40
39
43
3.5
0.5
0.25
SD=0
S D2 = 7.50
N=6
by
independently
awarding
marks
as
follows:
Solution:
Chapter Five
Correlation Analysis
End Chapter Quizzes
1.
abcd-
2.
abcd-
3.
abcd-
4.
abcd-
5.
abcd-
6.
a-
ratio is
bcd7.
association is
abcd-
0
between 1 and 0
between 1and 1
From a given (2*c) contingency table, the appropriate measure of
correlation ratio
biserial correlation
intracless correlation
tetrachoric correlation
8.
abcd-
9.
abcd-
far apart
coincident
near to each other
none of the above
10.
abcd-
that
CHAPTER SIX
REGRESSION ANALYSIS
6.1 Meaning
In statistics, regression analysis is a collective name for techniques for the modeling and
analysis of numerical data consisting of values of a dependent variable (also called
response variable or measurement) and of one or more independent variables (also known
as explanatory variables or predictors). The dependent variable in the regression
equation is modeled as a function of the independent variables, corresponding
parameters ("constants"), and an error term.
So Regression analysis is any statistical method where the mean of one or more
random variables is predicted based on other measured random variables. There are two
types of regression analysis, chosen according to whether the data approximate a straight
line, when linear regression is used, or not, when non-linear regression is used.
Regression can be used for prediction (including forecasting of time-series data),
inference, hypothesis testing, and modeling of causal relationships. These uses of
regression rely heavily on the underlying assumptions being satisfied. Regression
analysis has been criticized as being misused for these purposes in many cases where the
appropriate assumptions cannot be verified to hold one factor contributing to the misuse
of regression is that it can take considerably more skill to critique a model than to fit a
model.
6.2 Definitions :
Regression is the measure of the average relationship between two or more variables
and terms of the original units of the data.
Morris M. Blair
One of the most frequently used techniques in economics and business research,
to find a relation between two or more variables that are related casually, is regression
analysis.
Taro Yamane
It is often more important to find out what the relation actually is, in order to
estimate or predict one variable and the statistical technique appropriate to such a case is
called regression analysis.
Wallis and Roberts
(i)
XY = aX + bX2
(ii)
Correlation
1. Correlation measures the
relationship between the two
variables which vary in the same or
opposite direction.
2. Here both X and Y variables are
random variables.
Regression Analysis
1. Regression means going back or
act of return. It is a mathematical
measure which shows the average
relationship between the two
variables.
2. Here X is a random variable and Y
is a fixed variable. However, both
Chapter Six
Regression Analysis
End Chapter Quizzes
1.
abcd-
2.
abcd-
3.
abcd-
4.
abcd-
5.
abcd-
Scatter diagram of the variate values (X, Y) gives the idea about
functional relationship
regression model
distribution of errors
none of the above
6.
abc-
d-
7.
abcd-
8.
abcd-
9.
abcd-
10.
abcd-
CHAPTER SEVEN
TIME SERIES ANALYSIS
7.1 Meaning
In statistics, signal processing, and many other fields, a time series is a sequence of data
points, measured typically at successive times, spaced at (often uniform) time intervals.
Time series analysis comprises methods that attempt to understand such time series,
often either to understand the underlying context of the data points (where did they come
from? what generated them?), or to make forecasts (predictions). Time series forecasting
is the use of a model to forecast future events based on known past events: to forecast
future data points before they are measured. A standard example in econometrics is the
opening price of a share of stock based on its past performance.
The term time series analysis is used to distinguish a problem, firstly from more
ordinary data analysis problems (where there is no natural ordering of the context of
individual observations), and secondly from spatial data analysis where there is a context
that observations (often) relate to geographical locations. There are additional
possibilities in the form of space-time models (often called spatial-temporal analysis). A
time series model will generally reflect the fact that observations close together in time
will be more closely related than observations further apart. In addition, time series
models will often make use of the natural one-way ordering of time so that values in a
series for a given time will be expressed as deriving in some way from past values, rather
than from future values (see time reversibility.)
So a time series is a sequence of observations which are ordered in time (or
space). If observations are made on some phenomenon throughout time, it is most
sensible to display the data in the order in which they arose, particularly since successive
observations will probably be dependent. Time series are best displayed in a scatter plot.
The series value X is plotted on the vertical axis and time t on the horizontal axis. Time is
called the independent variable (in this case however, something over which you have
little control). There are two kinds of time series data:
1.
Continuous, where we have an observation at every instant of time, e.g. lie
detectors, electrocardiograms. We denote this using observation X at time t, X(t).
2.
Discrete, where we have an observation at (usually regularly) spaced
intervals. We denote this as Xt.
7.2 Definitions
A set of data depending on the time is called a time series.
------- Kenny and Keeping
A time series consists of data arranged chronologically.
------- Croxton and Cowden
A time series may be defined as a sequence or repeated measurements of a variable
made periodically through time.
------- C.H.Mayers
7.3 Applications of time series: The application of time series models is two fold :
Economic Forecasting
Sales Forecasting
Budgetary Analysis
Yield Projections
Inventory Studies
Workload Projections
Utility Studies
Census Analysis
7.4.4 Evaluation of actual data: On the basis of deviation analysis of actual data and
estimated data obtained from analysis of time series, we can come to know about the
causes of this change.
7.4.5 Prediction of trade cycle: We can know about the factors of cyclical variations
like boom, depression, recession and recovery which are very important to business
community.
7.4.6 Universal utility: The analysis of time series is not only useful to business
community and economists but it is equally to agriculturist, government, researchers,
political and social institutions, scientists etc.
7.7.3 Moving Average method: This method is a better technique of knowing trend in
relation to semi average method. The trend values are obtained with a fair degree of
accuracy by eliminating cyclical fluctuations. In this method we calculate average on the
basis of moving technique. This period of moving average is determined on the basis of
length of cyclical fluctuations which varies from 3 to 11 years.
Merits:
-This technique is easier in relation to method of least square.
-This technique is effective if the trend of series is irregular.
Demerits:
-In this method we can not obtain the trend values for all the years as we leave the
first and last year value of data while computing three years moving average and
so on.
-The basic purpose of trend value is to predict the trend of future. In this method
we can not extend the trend line on both direction, so this method cannot be used
for prediction purposes.
7.7.4 Method of least square: This is the best method of measuring secular trend. It
is the mathematical as well as analytical tool. This method can be fitted to economic and
business time series to make future predictions.
The trend line may be linear or non linear.
Merits :
-The method of least square does not suffer from subjectivity or personal
judgement as it is a mathematical method.
-We can compute the trend value of all the given years by this method.
Demerits:
-The method is based on mathematical technique, so it is not easily
understandable to a non mathematical person.
-If we add or delete some observations in the data, the value of constants a and
b will change and new trend line will follow.
rainy, autumn etc. Thus, seasonal variations refer to annual repetitive pattern in economic
and business activity.
Following measures are used to measure the seasonal variations:
4.
Chain relatives are adjusted for each quarters by subtracting (Quarterly
effect * 1, quarterly effect * 2, quarterly effect * 3). quarterly effect from II, III, IV
quarter.
5.
Seasonal index is finally computed . since the total of quarterly index
should be 400, while the real total will be much more, so seasonal index is computed as
Seasonal index = (Chain index of quarter * 400) / Actual total of chain index of
four quarters.
7.9 Practical Problems:
Illustration: Find 3- years moving average from the following data :
Year
Sales(in lakh Rs.) Year
1990
3
1995
1991
8
1996
1992
10
1997
1993
9
1998
1994
12
1999
Chapter Seven
Time Series Analysis
End Chapter Quizzes
1.
abcd-
2.
abcd-
3.
abcd-
4.
abcd-
5.
abcd-
6.
terms as
abc-
d-
7.
abcd8.
abcd-
9.
abcd-
10.
abcd-
CHAPTER EIGHT
PROBABILITY
8.1 Introduction
The theory of probability was developed towards the end of the 18th century and its
history suggests that it developed with the study of games and chance, such as rolling a
dice, drawing a card, flipping a coin etc. Apart from these, uncertainty prevailed in every
sphere of life. For instance, one often predicts: "It will probably rain tonight." "It is quite
likely that there will be a good yield of cereals this year" and so on. This indicates that, in
laymans terminology the word probability thus connotes that there is an uncertainty
about the happening of events. To put probability on a better footing we define it. But
before doing so, we have to explain a few terms."
8.2.1 Trial
A procedure or an experiment to collect any statistical data such as rolling a dice or
flipping a coin is called a trial.
8.2.4 Event
Any subset of a sample space is called an event. A sample space S serves as the universal
set for all questions related to an experiment 'S' and an event A w.r.t it is a set of all
possible outcomes favorable to the even t A
For example,
A random experiment :- flipping a coin twice
Sample space :- or S = {(HH), (HT), (TH), (TT)}
The question : "both the flipps show same face"
Therefore, the event A : { (HH), (TT) }
8.3 Definitions
We shall now consider two definitions of probability :
Therefore, p (A) =
or 0.5
Example Find the probability of getting 3 or 5 in throwing a die.
Solution : Experiment : Throwing a dice
Sample space : S = {1, 2, 3, 4, 5, 6 } n (S) = 2
Event A : getting 3 or 6
A = {3, 6} n (A) = 2
Therefore, p (A) =
Example Two dice are rolled. Find the probability that the score on the second
die is greater than the score on the first die.
Solution : Experiment : Two dice are rolled
Sample space : S = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6) (2, 1), (2, 2), (2, 3),
(2,
4),
(2,
6)}...
(6, 1), (6, 2) (, 3), (6, 4), (6, 5), (6, 6) }
n (S) = 6 6 = 36 Event A : The score on the second die > the score on the 1st
die.
i.e. A = { (1, 2), (1, 3), (1, 4), (1, 5), (1, 6) (2, 3), (2, 4), (2, 5), (2, 6) (3, 4), (3, 5),
(3, 6) (4, 5), (4, 6) (5, 6)}
n (A) = 15
Therefore, p (A) =
Example A coin is tossed three times. Find the probability of getting at least one
head.
Solution : Experiment : A coin is tossed three times.
Sample space : S = {(H H H), (H H T), (HTH), (HTT), (THT), (TTH), (THH),
(TTT) }
n (S) = 8
Event A : getting at least one head
so that A : getting no head at all
= { (TTT) n ( ) = 1
P( )=
Therefore, P (A) = 1 - P ( A ) =
Example A ball is drawn at random from a box containing 6 red balls, 4 white
balls and 5 blue balls. Determine the probability that the ball drawn is (i) red (ii) white
(iii) blue (iv) not red (v) red or white.
Solution : Let R, W and B denote the events of drawing a red ball, a white ball
and a blue ball respectively.
(i)
2 / 7 = 0.29 (Approximately)
Example If four ladies and six gentlemen sit for a photograph in a row at random,
what is the probability that no two ladies will sit together ?
Solution :
Now if no two ladies are
to be together, the ladies have 7 positions, 2 at ends and 5 between the gentlemen
Arrangement L, G1, L, G2, L, G3, L, G4, L, G5, L, G6, L
Example In a class there are 13 students. 5 of them are boys and the rest are girls.
Find the probability that two students selected at random wil be both girls.
Solution : Two students out of 13 can be selected in
of 8 can be selected in
ways.
both)
In general, if the letters A and B stands for any two events, then
Example Two dice are rolled. Find the probability that the score is an even number or
multiple of 3.
Solution : Two dice are rolled.
Sample space = {(1, 1), (1, 2), ............, (6, 6)}
n(S) = 6 6 = 36
Event E : The score is an even number or multiple of 3.
Note here score means the sum of the numbers on both the dice when they land. For
example (1, 1) has score 1 + 1 = 2.
It is clear that the least score is 2 and the highest score (6, 6) 6 + 6 = 12
i.e. score 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
Let Event A : Score is an even numbers
A = {(1, 1), (1, 3), (1, 5), (2,2), (2, 4), (2, 6), (3, 1), (3, 3) (3, 5), (4, 2), (4, 4), (4, 6), (5,
1), (5, 3), (5, 5), (6, 2), (6, 4), (6, 6) }
Therefore n (A) = 18
Let Event B: The score is the multiple of 3
i.e. 3, 6, 9, 12
B = {(1, 2), (1, 5), (2, 4), (2, 1) (3, 6) (3, 3) (4,2), (4, 5), (5, 1), (5,4), (6, 3), (6, 6) }
n (B) = 12
Let Event A B:The score is an even number and multiple of 3 or (i.e. common to both
A and B) AB
AB = {(2, 4), (4, 2), (33,3), (4,2), (5, 1), (6,6)}
n (AB) = 6
for a nickle
probability of heads is
for a dime
for a penny
Example Three machines I, II and III manufacture respectively 0.4, 0.5 and 0.1 of the
total production. The percentage of defective items produced by I, II and III is 2, 4 and 1
percent respectively for an item randomly chosen, what is the probability it is defective?
Solution:
Example In shuffling a pack of cards, 4 are accidentally dropped one after another. Find
the chance that the missing cards should be one from each suit.
Solution: Probability of 4 missing cards from different suits are as follows:
Let H, D, C and S denote heart, diamond, club and spade cards respectively
Conditional Probability
In many situations you get more information than simply the total outcomes and
favorable outcomes you already have and, hence you are in position to make yourself
more informed to make judgements regarding the probabilities of such situations. For
example, suppose a card is drawn at random from a deck of 52 cards. Let B denotes the
event the card is a diamond and A denotes the event the card is red. We may then
consider the following probabilities.
Since there are 26 red cards of which 13 are diamonds, the probability that the card is
diamond is
is .
The probability of B under the condition that A has occurred is known as condition
. It should be observed that
probability and it is denoted by P (B/A) . Thus P (B/A) =
the probability of the event B is increased due to the additional information that the event
A has occurred.
Conditional probability found using the formula P (B/A) =
Justification :- P (A/B) =
Similarly P(A/B) =
In both the cases if A and B are independent events then P (A/B) = P (A) and P(B/A) =
P(B)
Therefore P(A) =
or P(B) =
2. The various test of significance like Z test, F test, Chi suare test, are derived from
the theory of probability.
3. This theory gives solution to the problems relating to the game of chance.
4. The decision theories are based on the fundamental laws of probability.
5. The theory is generally used in economic and business decision making. The theory is
very useful in the situations where risk and uncertainty prevails.
6. The subjective probability is widely used in those situations where actual measurement
of probability is not feasible. It has, thus, added new dimension to the theory of
probability. These probability can be revised at a later stage on the basis of experience.
Illustration:
Find the probability of having at least one son in a family if there are two children in a
family on an average.
Solution:
Two children in a family may be either :
(1) Both sons
or (2) Son and daughter
or (3) Daughter and son
or (4) Both daughters
Thus, total number of equally likely cases = n = 4
At least one son implies that a family may have one son or two sons.
Thus, favourable number of cases = m = 3 (i.e., option, nos 1,2,3,)
P(A) =
Illustration: Find the chance of getting an ace in a draw from a pack of 52 cards.
Solution:
Total number of cards = n = 52
Number of favourable cases = m = 4 (number of aces)
P(A)
Illustration: Suppose an ideal die is tossed twice. What is the probability of getting a sum
of 10 in the two tosses?
Solution:
A die can be tossed first time in = 6 ways
Adie can be tossed second time in = 6 ways
A die can be tossed twice in = 6 6 = 36 ways (as per rule of counting)
Number of ways in which we can through two die to get a sum of 10 are = m = 3 ways
(i.e., dot number 4+6+5and 6+4)
P(A)
Classical Probability
Classical Definition of Probability
The classical definition of probability is the proportion of times that an event will occur,
assuming that all outcomes in a sample space are equally likely to occur. The probability
of an event is determined by counting the number of outcomes in the sample space that
satisfy the event and dividing by the total number of outcomes in the sample space. The
probability of an event A is
P(A) = NA/N
Where NA is the number of outcomes that satisfy the condition of event A and N is the
total number of outcomes in the sample space. The important idea here is that one can
develop a probability from fundamental reasoning about the process.
Example:
In a pack of cards, we have N=52 equally likely outcomes. Now have to determine the
probability that the card is King, Queen and card is not a King.
Solution:
Probability of being King = 4/52
= 1/13
Probability Rules
Complement Rule
Let A be an event and its complement. Then the complement rule is:
=1-P(A)
The Addition Rule of Probabilities
Let A and B be two events. The probability of their union is
P(A U B ) = P ( A ) + P( B ) - P( A B )
Conditional Probability
Let A and B be two events. The conditional probability of event A, given that event B has
occurred, is denoted by the symbol P( A|B ) and is found to be:
P(A/B) = P(AB)/P(B)
The Multiplication Rule of Probabilities
Let A and B be two events. The probability of their intersection can be derived from
conditional probability as
P( A B) = P( A|B ) P( B )
Statistical Independence
Let A and B be two events. These events are said to be statistically independent if and
only if
P( A / B ) = P( A ) P( B )
From the multiplication rule it also follows that
P( A|B ) = P( A ) ( if P( B ) > 0 )
More generally, the events E1, E2, ., EK are mutually statistically independent if and
only if
P( E1 E2 .. EK ) = P( E1 ) P( E2 )..P( EK )
Probability Distribution
Probability distribution is related to frequency distributions. Probability distribution is
like theoretical frequency distribution. A theoretical frequency distribution is a
probability distribution that describes how outcomes are expected to vary. Because these
distributions deal with expectations, they are useful models in making inferences and
decisions under conditions of uncertainty.
Generally, statisticians use a capital letter to represent a random variable and a lower-case
letter, to represent one of its values. For example,
The relationship between random variables and probability distributions can be easily
understood by example. Suppose you flip a coin two times. This simple statistical
experiment can have four possible outcomes: HH, HT, TH, and TT. Now, let the variable
X represent the number of Heads that result from this experiment. The variable X can
take on the values 0, 1, or 2. In this example, X is a random variable; because its value is
determined by the outcome of a statistical experiment.
A probability distribution is a table or an equation that links each outcome of a
statistical experiment with its probability of occurrence. Consider the coin flip
experiment described above. The table below, which associates each outcome
with its probability, is an example of a probability distribution.
Number of Heads
0
1
2
Probability
0.25
0.50
0.25
The above table represents the probability distribution of the random variable X.
Cumulative Probability Distributions
A cumulative probability refers to the probability that the value of a random variable
falls within a specified range.
Let us return to the coin flip experiment. If we flip a coin two times, we might ask: What
is the probability that the coin flips would result in one or fewer heads? The answer
would be a cumulative probability. It would be the probability that the coin flip
experiment results in zero heads plus the probability that the experiment results in one
head.
P(X < 1) = P(X = 0) + P(X = 1) = 0.25 + 0.50 = 0.75
Like a probability distribution, a cumulative probability distribution can be represented
by a table or an equation. In the table below, the cumulative probability refers to the
probability than the random variable X is less than or equal to x.
Number of heads: x
0
1
2
Probability: P(X = x)
0.25
0.50
0.25
Example:
Suppose a die is tossed. What is the probability that the die will land on 6 ?
Solution: When a die is tossed, there are 6 possible outcomes represented by: S = { 1, 2,
3, 4, 5, 6 }. Each possible outcome is a random variable (X), and each outcome is equally
likely to occur. Thus, we have a uniform distribution. Therefore, the P(X = 6) = 1/6.
Example:
2
Suppose we repeat the dice tossing experiment described in Example 1. This time, we ask
what is the probability that the die will land on a number that is smaller than 5 ?
Solution: When a die is tossed, there are 6 possible outcomes represented by: S = { 1, 2,
3, 4, 5, 6 }. Each possible outcome is equally likely to occur. Thus, we have a uniform
distribution.
This problem involves a cumulative probability. The probability that the die will land on
a number smaller than 5 is equal to:
P( X < 5 ) = P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) = 1/6 + 1/6 + 1/6 + 1/6 = 2/3
Discrete and Continuous Probability Distributions
If a variable can take on any value between two specified values, it is called a continuous
variable; otherwise, it is called a discrete variable.
Some examples will clarify the difference between discrete and continuous variables.
Suppose the fire department mandates that all fire fighters must weigh between
150 and 250 pounds. The weight of a fire fighter would be an example of a
continuous variable; since a fire fighter's weight could take on any value between
150 and 250 pounds.
Suppose we flip a coin and count the number of heads. The number of heads
could be any integer value between 0 and plus infinity. However, it could not be
any number between 0 and plus infinity. We could not, for example, get 2.5
heads. Therefore, the number of heads must be a discrete variable.
Probability
0.25
0.50
0.25
The above table represents a discrete probability distribution because it relates each value
of a discrete random variable with its probability of occurrence. In subsequent lessons,
we will cover the following discrete probability distributions.
Note: With a discrete probability distribution, each possible value of the discrete random
variable can be associated with a non-zero probability. Thus, a discrete probability
distribution can always be presented in tabular form.
Continuous Probability Distributions
If a random variable is a continuous variable, its probability distribution is called a
continuous probability distribution.
A continuous probability distribution differs from a discrete probability distribution in
several ways.
The probability that a continuous random variable will assume a particular value
is zero.
As a result, a continuous probability distribution cannot be expressed in tabular
form.
Instead, an equation or formula is used to describe a continuous probability
distribution.
Most often, the equation used to describe a continuous probability distribution is called a
probability density function. Sometimes, it is referred to as a density function, a PDF,
or a pdf. For a continuous probability distribution, the density function has the following
properties:
Since the continuous random variable is defined over a continuous range of values
(called the domain of the variable), the graph of the density function will also be
continuous over that range.
The area bounded by the curve of the density function and the x-axis is equal to 1,
when computed over the domain of the variable.
The probability that a random variable assumes a value between a and b is equal
to the area under the density function bounded by a and b.
For example, consider the probability density function shown in the graph below.
Suppose we wanted to know the probability that the random variable X was less than or
equal to a. The probability that X is less than or equal to a is equal to the area under the
curve bounded by a and minus infinity as indicated by the shaded area.
Note: The shaded area in the graph represents the probability that the random variable X
is less than or equal to a. This is a cumulative probability. However, the probability that
X is exactly equal to a would be zero. A continuous random variable can take on an
infinite number of values. The probability that it will equal a specific value (such as a) is
always zero.
Later we will discuss following distribution in that chapter:
Binomial Distribution
To understand binomial distributions and binomial probability, it helps to understand
binomial experiments and some associated notation; so we cover those topics first.
Binomial Experiment
A binomial experiment (also known as a Bernoulli trial) is a statistical experiment that
has the following properties:
Consider the following statistical experiment. You flip a coin 2 times and count the
number of times the coin lands on heads. This is a binomial experiment because:
Notation
The following notation is helpful, when we talk about binomial probability.
Binomial Distribution
A binomial random variable is the number of successes x in n repeated trials of a
binomial experiment. The probability distribution of a binomial random variable is called
a binomial distribution (also known as a Bernoulli distribution).
Suppose we flip a coin two times and count the number of heads (successes). The
binomial random variable is the number of heads, which can take on values of 0, 1, or 2.
The binomial distribution is presented below.
Number of heads
0
1
2
Probability
0.25
0.50
0.25
Binomial Probability
The binomial probability refers to the probability that a binomial experiment results in
exactly x successes. For example, in the above table, we see that the binomial probability
of getting exactly one head in two coin flips is 0.50.
Given x, n, and P, we can compute the binomial probability based on the following
formula:
Binomial FormulaSuppose a binomial experiment consists of n trials and results in x successes. If the
probability of success on an individual trial is P, then the binomial probability is:
B(x; n,P) = nCx * Px * (1-P)n-x
Example
Suppose a die is tossed 5 times. What is the probability of getting exactly 2 fours?
Solution: This is a binomial experiment in which the number of trials is equal to 5, the
number of successes is equal to 2, and the probability of success on a single trial is 1/6 or
about 0.167. Therefore, the binomial probability is:
b(2; 5, 0.167) = 5C2 * (0.167)2 * (0.833)3
b(2; 5, 0.167) = 0.161
Cumulative Binomial Probability
A cumulative binomial probability refers to the probability that the binomial random
variable falls within a specified range (e.g., is greater than or equal to a stated lower limit
and less than or equal to a stated upper limit).
Now let's tackle the question of finding probability that the world series ends in 5 games.
The trick in finding this solution is to recognize that the series can only end in 5 games, if
one team has won 3 out of the first 4 games. So let's first find the probability that the
American League team wins exactly 3 of the first 4 games.
b(3; 4, 0.5) = 4C3 * (0.5)3 * (0.5)1 = 0.25
Okay, here comes some more tricky stuff, so listen up. Given that the American League
team has won 3 of the first 4 games, the American League team has a 50/50 chance of
winning the fifth game to end the series. Therefore, the probability of the American
League team winning the series in 5 games is 0.25 * 0.50 = 0.125. Since the National
League team could also win the series in 5 games, the probability that the series ends in 5
games would be 0.125 + 0.125 = 0.25.
The rest of the problem would be solved in the same way. You should find that the
probability of the series ending in 6 games is 0.3125; and the probability of the series
ending in 7 games is also 0.3125.
Normal Distribution
The normal distribution refers to a family of continuous probability distributions
described by the normal equation.
The Normal Equation
The normal distribution is defined by the following equation:
Normal equation
The value of the random variable Y is:
Y= [1/ * sqrt(2)] * e-(x-)2/22
Where X is a normal random variable, is the mean, is the standard deviation, is
approximately 3.14159, and e is approximately 2.71828.
The random variable X in the normal equation is called the normal random variable.
The normal equation is the probability density function for the normal distribution.
The Normal Curve
The graph of the normal distribution depends on two factors - the mean and the standard
deviation. The mean of the distribution determines the location of the center of the graph,
and the standard deviation determines the height and width of the graph. When the
standard deviation is large, the curve is short and wide; when the standard deviation is
small, the curve is tall and narrow. All normal distributions look like a symmetric, bellshaped curve, as shown below.
The curve on the left is shorter and wider than the curve on the right, because the curve
on the left has a bigger standard deviation.
Probability and the Normal Curve
The normal distribution is a continuous probability distribution. This has several
implications for probability.
Additionally, every normal curve (regardless of its mean or standard deviation) conforms
to the following "rule".
About 68% of the area under the curve falls within 1 standard deviation of the
mean.
About 95% of the area under the curve falls within 2 standard deviations of the
mean.
About 99.7% of the area under the curve falls within 3 standard deviations of the
mean.
Collectively, these points are known as the empirical rule or the 68-95-99.7 rule.
Clearly, given a normal distribution, most outcomes will be within 3 standard deviations
of the mean.
Example:
An average light bulb manufactured by the Acme Corporation lasts 300 days with a
standard deviation of 50 days. Assuming that bulb life is normally distributed, what is the
probability that an Acme light bulb will last at most 365 days?
Solution: Given a mean score of 300 days and a standard deviation of 50 days, we want
to find the cumulative probability that bulb life is less than or equal to 365 days. Thus, we
know the following:
We enter these values into the Normal Distribution Calculator and compute the
cumulative probability. The answer is: P( X < 365) = 0.90. Hence, there is a 90% chance
that a light bulb will burn out within 365 days.
Example:
Suppose scores on an IQ test are normally distributed. If the test has a mean of 100 and a
standard deviation of 10, what is the probability that a person who takes the test will
score between 90 and 110?
Solution: Here, we want to know the probability that the test score falls between 90 and
110. The "trick" to solving this problem is to realize the following:
P( 90 < X < 110 ) = P( X < 110 ) - P( X < 90 )
We use the Normal Distribution Calculator to compute both probabilities on the right side
of the above equation.
To compute P( X < 110 ), we enter the following inputs into the calculator: The
value of the normal random variable is 110, the mean is 100, and the standard
deviation is 10. We find that P( X < 110 ) is 0.84.
To compute P( X < 90 ), we enter the following inputs into the calculator: The
value of the normal random variable is 90, the mean is 100, and the standard
deviation is 10. We find that P( X < 90 ) is 0.16.
=
<
P( X
110
<
110
)
)
=
P(
0.84
<
-
90 )
0.16
Thus, about 68% of the test scores will fall between 90 and 110.
Standard Normal Distribution
The standard normal distribution is a special case of the normal distribution. It is the
distribution that occurs when a normal random variable has a mean of zero and a standard
deviation of one.
The normal random variable of a standard normal distribution is called a standard score
or a z-score. Every normal random variable X can be transformed into a z score via the
following equation:
z = (X - ) /
where X is a normal random variable, is the mean mean of X, and is the standard
deviation of X.
Standard Normal Distribution Table
A standard normal distribution table shows a cumulative probability associated with a
particular z-score. Table rows show the whole number and tenths place of the z-score.
Table columns show the hundredths place. The cumulative probability (often from minus
infinity to the z-score) appears in the cell of the table.
For example, a section of the standard normal table is reproduced below. To find the
cumulative probability of a z-score equal to -1.31, cross-reference the row of the table
containing -1.3 with the column containing 0.01. The table shows that the probability that
a standard normal random variable will be less than -1.31 is 0.0951; that is, P(Z < -1.31)
= 0.0951.
z
3.
0
...
1.
4
1.
3
1.
2
...
3.
0.00
0.001
3
0.01
0.001
3
0.02
0.001
3
0.03
0.001
2
0.04
0.001
2
0.05
0.001
1
0.06
0.001
1
0.07
0.001
1
0.08
0.001
0
0.09
0.001
0
...
0.080
8
...
0.079
3
...
0.077
8
...
0.076
4
...
0.074
9
...
0.073
5
...
0.072
2
...
0.070
8
...
0.069
4
...
0.068
1
0.096
8
0.095
1
0.093
4
0.091
8
0.090
1
0.088
5
0.086
9
0.085
3
0.083
8
0.082
3
0.115
1
0.113
1
0.111
2
0.109
3
0.107
5
0.105
6
0.103
8
0.102
0
0.100
3
0.098
5
...
0.998
...
0.998
...
0.998
...
0.998
...
0.998
...
0.998
...
0.998
...
0.998
...
0.999
...
0.999
Of course, you may not be interested in the probability that a standard normal random
variable falls between minus infinity and a given value. You may want to know the
probability that it lies between a given value and plus infinity. Or you may want to know
the probability that a standard normal random variable lies between two given values.
These probabilities are easy to compute from a normal distribution table. Here's how.
Find P(Z > a). The probability that a standard normal random variable (z) is
greater than a given value (a) is easy to find. The table shows the P(Z < a). The
P(Z
>
a)
=
1
P(Z
<
a).
Suppose, for example, that we want to know the probability that a z-score will be
greater than 3.00. From the table (see above), we find that P(Z < 3.00) = 0.9987.
Therefore, P(Z > 3.00) = 1 - P(Z < 3.00) = 1 - 0.9987 = 0.0013.
Find P(a < Z < b). The probability that a standard normal random variables lies
between two values is also easy to find. The P(a < Z < b) = P(Z < b) - P(Z < a).
For example, suppose we want to know the probability that a z-score will be
greater than -1.40 and less than -1.20. From the table (see above), we find that
P(Z < -1.20) = 0.1151; and P(Z < -1.40) = 0.0808. Therefore, P(-1.40 < Z < -1.20)
= P(Z < -1.20) - P(Z < -1.40) = 0.1151 - 0.0808 = 0.0343.
In school or on the Advanced Placement Statistics Exam, you may be called upon to use
or interpret standard normal distribution tables. Standard normal tables are commonly
found in appendices of most statistics texts.
The Normal Distribution as a Model for Measurements
Often, phenomena in the real world follow a normal (or near-normal) distribution. This
allows researchers to use the normal distribution as a model for assessing probabilities
associated with real-world phenomena. Typically, the analysis involves two steps.
Transform raw data. Usually, the raw data are not in the form of z-scores. They
need to be transformed into z-scores, using the transformation equation presented
earlier: z = (X - ) / .
Find probability. Once the data have been transformed into z-scores, you can use
standard normal distribution tables, online calculators (e.g., Stat Trek's free
normal distribution calculator), or handheld graphing calculators to find
probabilities associated with the z-scores.
Example: Mr. X earned a score of 940 on a national achievement test. The mean test
score was 850 with a standard deviation of 100. What proportion of students had a higher
score than Mr. X? (Assume that test scores are normally distributed.)
(A) 0.10
(B) 0.18
(C) 0.50
(D) 0.82
(E) 0.90
Solution:
The correct answer is B. As part of the solution to this problem, we assume that test
scores are normally distributed. In this way, we use the normal distribution as a model for
measurement. Given an assumption of normality, the solution involves three steps.
First, we transform Mr. X's test score into a z-score, using the z-score
transformation equation.
z = (X - ) / = (940 - 850) / 100 = 0.90
Then from the standard normal distribution table, we find the cumulative
probability associated with the z-score. In this case, we find P(Z < 0.90) = 0.8159.
Thus, we estimate that 18.41 percent of the students tested had a higher score than Mr. X.
Chapter Eight
Probability
End Chapter Quizzes
1.
abcd-
2.
abcd-
3.
abcd-
Probability is expressed as
ratio
proportion
percentage
all the above
4.
abcd-
5.
abcd-
6.
abcd-
7.
abcd-
8.
abcd-
9.
abcd-
10.
abcd-
infinity
zero
one
none of the above
always
Chapter9
Sampling Design
Introduction
In this lesson, we shall describe the basic thing, how to collect data. We shall also discuss
a variety of methods of selecting the sample called Sampling Designs, which can be used
to generate our sample data sets.
Apopulation is commonly understood to be a natural, geographical, or political collection
of people, animals, plants, or objects. Some statisticians use the word in the more
restricted sense of the set of measurements of some attribute of such a collection; thus
they might speak of the population of heights of male college students. Or they might
use the word to designate a set of categories of some attribute of a collection, for
example, the population of religious affiliations of U.S. government employees.
In statistical discussions, we often refer to the physical collection of interest as well as to
the collection of measurements or categories derived from the physical collection. In
order to clarify which type of collection is being discussed, in this book we use the term
population as it is used by the research scientist: The population is the physical
collection. The derived set of measurements or categories is called the set of values of the
variable of interest. Thus, in the first example above, we speak of the set of all values of
the variable height for the population of male college students.
After we have defined the population and the appropriate variable, we usually find it
impractical, if not impossible, to observe all the values of the variable. For example, all
the values of the variable miles per gallon in city driving for this years model of a certain
type of car could not be obtained since some of the cars probably are yet to be produced.
Even if they did exist, the task of obtaining a measurement from each car is not feasible.
In another example, the values of the variable condition of all packaged bandages (sterile
or contaminated) produced on a particular day by a certain firm could be obtained, but
this is not desirable since the bandages would be made useless in the process of testing.
Instead, we consider a sample (a portion of the population), obtain measurements or
observations from this sample (the sample data), and then use statistics to make an
inference about the entire set of values. To carry out this inference, the sample must be
random.
For example: In textile industry, the workers of a department whose wages may be a
sample and all the workers of the company will be considered as population.
The total number of units in the population is known as population size.
The total number of units in the sample is known as sample size.
Any characteristic of population is called parameter and that of sample is called statistic.
Sampling Frame
To select a random sample of sampling units, we need a list of all sampling units
contained in the population. Such a list is called a Sampling Frame
Types of Sampling
The type of enquiry you want to have and the nature of data that you want to collect
fundamentally determines the technique or method of selecting a sample.
The procedure of selecting a sample may be broadly classified under the following three
heads:
Non-Probability Sampling Methods
Probability Sampling
Mixed Sampling
Now let us discuss these in detail. We will start with the non-probability sampling then
we will move on to probability sampling.
Non-Probability Sampling Methods: The common feature in non probability sampling
methods is that subjective judgments are used to determine the population that are
contained in the sample .We classify non-probability sampling into four groups:
1. Convenience Sampling
2. Judgement Sampling
3. Quota Sampling
4. Snowball sampling
Convenience Sampling
This types of sampling is used primarily for reasons of convenience.
It is used for exploratory research and speedy situations.
It is often used for new product formulations or to provide gross-sensory
evaluations by using employees, students, peers, etc.
Convenience sampling is extensively used in marketing studies
This would be clear from the following examples:
1. Suppose a marketing research study aims at estimating the proportion of Pan (Beetle
leaf) shops in Delhi, which store a particular drink Maaza. It is decided to take a sample
of size 150. What the investigator does is to visit 150 Pan shops near his place of office
as it is very convenient to him and observe whether a Pan shop stores Maaza or not. This
is definitely not a representative sample, as most Pan shops in Delhi had no chance of
being selected. It is only those Pan shops which were near the office of the investigator
has a chance of being selected
2. A ball pen manufacturing company is interested in knowing the opinions about the ball
pen (like smooth flow of ink, resistance to breakage of the cover etc.) it is presently
manufacturing with a view to modify it to suit customers
need. The job is given to a marketing researcher who visits a college near his place of
residence and asks a few students (a convenient sample) their opinion about the ball
pen in question.
Judgement Sampling
It is that sample in which the selection criteria are based upon the researchers
personal judgment that the members of the sample are representative of the
population under study.
It is used for most test markets and many product tests conducted in shopping
malls. If personal biases are avoided, then the relevant experience and the
acquaintance of the investigator with the population may help to choose a
relatively representative sample from the population. It is not possible to make an
estimate of sampling error as we cannot determine how precise our sample
estimates are.
Judgement sampling is used in a number of cases, some of which are:
1. Suppose we have a panel of experts to decide about the launching of a new product in
the next year. If for some reason or the other, a member drops out, from the panel, the
chairman of the panel may suggest the name of another person whom he thinks has the
same expertise and experience to be a member of the said panel. This new member was
chosen deliberately - a case of Judgment sampling.
2. The method could be used in a study involving the performance of salesmen. The
salesmen could be grouped into top-grade and low-grade performer according to certain
specified qualities. Having done so, the sales manager may indicate who in his opinion,
would fall into which category. Needless to mention this is a biased method. However in
the absence of any objective data, one might have to resort to this type of sampling.
Quota Sampling
This is a very commonly used sampling method in marketing research studies. Here the
sample is selected on the basis of certain basic parameters such as age, sex, income and
occupation that describe the nature a population so as to make it representative of the
population. The Investigators or field workers are instructed to choose a sample that
conforms to these parameters. The field workers are assigned quotas of the number of
units satisfying the required characteristics on which data should be collected. However,
before collecting data on these units, the investigators are supposed to verify that the
units qualify these characteristics. Suppose we are conducting a survey to study the
buying behavior of a product and it is believed that the buying behavior is greatly
influenced by the income level of the consumers. We assume that it is possible to divide
our population into three income strata such as high-income group, middle-income group
and low-income group. Further it is known that 20% of the population is in high income
group, 35% in the middle-income group and 45% in the low-income group. Suppose it is
decided to select a sample of size 200 from the population. Therefore, samples of size 40,
70 and90 should come from high income, middle income and low income groups
respectively. Now the various field workers are assigned quotas to select the sample from
each group in such a way that a total sample of 200 is selected in the same proportion as
mentioned above.
Snowball Sampling
The sampling in which the selection of additional respondents (after the first small
group of respondents is selected) is based upon referrals from the initial set of
respondents.
It is used to sample low incidence or rare populations
It is done for the efficiency of finding the additional, hard-to-find members of the
sample.
Advantages of Non-probability Sampling
It is much cheaper to probability sampling.
It is acceptable when the level of accuracy of the research results is not of utmost
importance.
Less research time is required than probability samples.
It often produces samples quite similar to the population of interest when conducted
properly.
Disadvantages of Non-probability Sampling
You cannot calculate Sampling error. Thus, the minimum required sample size cannot
be calculated which suggests that you (researcher) may sample too few or too many
members of the population of interest.
You do not know the degree to which the sample is representative of the population
from which it was drawn.
The research results cannot be projected (generalized) to the total population of interest
with any degree of confidence.
Probability Sampling Methods
Probability sampling is the scientific method of selecting samples according to some laws
of chance in which each unit in the population has some definite pre-assigned probability
of being selected in the sample. The different types of probability sampling are:
homogeneous as possible in shape, size, colour, etc. These slips are then put in a bag and
thoroughly shuffled and then r slips are drawn one by one. The r candidates
corresponding to numbers on the slips drawn will constitute a random sample.
This method of selecting a simple random sample is independent of the properties of
population. Generally in place of slips you can use cards also. We make one card
corresponding to one unit of population by writing on it the number assigned to that
particular unit of population. The pack of cards is a miniature of population for sampling
purposes. The cards are shuffled a number of times and then a card is drawn at random
from them. This is one of the most reliable methods of selecting a random sample.
Merits and Limitations of Simple Random Sampling
Merits
1. Since sample units are selected at random providing equal chance to each and every
unit of population to be selected, the element of subjectivity or personal bias is
completely eliminated. Therefore, we can say that simple random sample is more
representative of population than purposive or judgement sampling.
2. You can ascertain the efficiency of the estimates of the parameters by considering the
sampling distribution of the statistic (estimates)
For example: One measure of calculating precision is sample size. Sample mean becomes
an unbiased mean of population mean or a more efficient estimate of population mean as
sample size increases.
Limitations
1. The selection of simple random sample requires an up-to-date frame of population
from which samples are to be drawn. Although it is impossible to have knowledge about
each and every unit of population if population happens to be very large. This restricts the
use of simple random sample.
2. A simple random sample may result in the selection of the sampling units, which are
widely spread geographically and in such a case the administrative cost of collecting the
data may be high in terms of time and money.
3. For a given precision, simple random sample usually requires larger sample size as
compared to stratified random sampling which we will be studying next.
The limitations of simple random sample will be clear from the example.
Therefore, some of the randomly allocated samples prove very non-random. This type of
problem can be eliminated by use of Stratified Random Sampling, in which the
population is divided into different strata. Now, we will move into details of stratified
random sampling.
2. Greater Accuracy
Stratified sampling provides estimates with increased precision. Moreover, stratified
sampling enables us to obtain the results of known precision for each stratum.
3. Administrative Convenience
As compared with simple random sample, the stratified random samples are more
concentrated geographically. Accordingly, the time and money involved in collecting the
data and interviewing the individuals may be considerably reduced and the supervision of
the field work could be allocated with greater ease and convenience.
Systematic Random Sampling
If you have the complete and up-to-date list of sampling units is available you can also
employ a common technique of selection of sample, which is known as systematic
sampling.
In systematic sampling you select the first unit at random, the rest being automatically
selected according to some predetermined pattern involving regular spacing of units.
Now let us assume that the population size is N. We number all the sampling units from 1
to N in some order and a sample of size n is drawn in such a way that
N = nk i.e. k = N/n , where k, usually called the sampling interval, is an integer. In
systematic random sampling we draw a number randomly, let us suppose that the number
drawn is i and selecting the unit corresponding to this number and every kth unit
subsequently. Thus the systematic sample of size n will consist of the units
i, i+k, i+2k, - - - - - - - - - - - - , i+ (n-1)k.
The random number i is called the random start and its value determines the whole
sample.
Merits and Demerits of Systematic Random Sampling
Merits
I. .Systematic sampling is operationally more convenient than simple random sampling or
stratified random sampling. It saves your time and work involved.
II. This sampling is more efficient to simple random sample, provided the frame (the list
from which you have drawn the sample units ) is arranged wholly at random
Demerits
I. The main disadvantage of systematic sampling is that systematic sampling is that
systematic samples are not in general random samples since the requirement in merit two
is rarely fulfilled.
II. If N is not a multiple of n, then the actual sample size is different from that required,
and sample mean is not an unbiased estimate of the population mean.
Cluster Sampling
In this type of sampling you divide the total population, depending upon the problem
under study, into some recognizable sub-divisions which are termed as clusters and a
simple random sample of n blocks is drawn. The individuals whom you have selected
from the blocks constitute the sample.
Notes
Clusters should be as small as possible consistent with the cost and limitations of the
survey.
The number of sampling units in each cluster should be approximately same.
Thus cluster sampling is not to be recommended if we have sampling areas in the cities
where there are private residential houses, business and industrial complexes, apartment
buildings, etc., with widely varying number of persons or households.
Multistage Sampling
One better way of selecting a sample is to resort to sub-sampling within the clusters,
instead of enumerating all the sampling units in the selected cluster. This technique is
called two-stage sampling, clusters being termed as primary units and the units within the
clusters being termed as secondary units. This technique can be generalized to multistage
sampling. We regard population as a number of primary units each of which is further
composed of secondary stage units and so on, till we ultimately reach a stage where
desired sampling units are obtained. In multi-stage sampling each stage reduces the
sample size.
Merits and Limitations
Merits:
i. Multistage sampling is more flexible as compared to other methods .It is simple to
carry out and results in administrative convenience by permitting the field work to be
concentrated and yet covering large area.
ii. It saves a lot of operational cost as we need the second stage frame only for those units
which are selected in the first stage sample.
iii. It is generally less efficient than a suitable single- stage sampling of the same
size.This brings an end on todays discussion on sampling techniques.
Thus in the nutshell we can say that Non probabilistic sampling such as Convenience
sampling, Judgement Sampling and Quota sampling are sometimes used although
representative ness of such a sample cannot be ensured. Whereas a probabilistic sampling
to each unit of the population to be included in the sample and in this sense it is a
representative sample of the population.
Points to Ponder
Sampling is based on two premises. One is that there is enough similarity among the
elements in a population that a few of these elements will adequately represent the
characteristic of the total population.
The second premises is that while some elements in a sample underestimate the
population value, others overestimate the value.
The results of these tendencies are that a sample mean is generally a good estimate of
population mean.
A good sample has both accuracy & precision. An accurate sample is one which there is
little or no bias or systematic variance. A sample with adequate precision is one that has a
sampling error that is within acceptable limits.
A variety of sampling technique is available, of which probability sampling is based on
random selection a controlled procedure that ensures that each population element is
given a known nonzero chance of selection.
In contrast non-probability selection is not random. When each sample element is drawn
individually from the population at large, it is unrestricted sampling.
Sampling Distribution
The process of generalizing the sample results of the population is referred to as
statistical inference. Here, we shall use certain sample statistics (such as the sample
mean, the sample proportion, etc.) in order to estimate and draw inferences about the true
population parameters. For example, in order to be able to use the sample mean to
estimate the population mean, we should examine every possible sample (and its mean)
that could have occurred in the process of selecting one sample of a certain size. If this
selection of all possible samples actually were to be done, the distribution of the results
would be referred to as a sampling distribution. Although, in practice, only one such
sample is actually selected, the concept of sampling distributions must be examined so
that probability theory and its distribution can be used in making inferences about the
population parameter values.
Sampling theory has made it possible to deal effectively with these problems. However,
before we discuss in detail about them from the standpoint of sampling theory, it is
necessary to understand the central limit theorem and the following three probability
distributions, their characteristics and relations:
(1) The population (universe) distribution,
(2) The sample distribution, and
(3) The sampling distribution.
Central Limit Theorem: The Central Limit Theorem, first introduced by De Moivre
during the early eighteenth century, happens to be the most important theorem in
statistics. According to this theorem, if we select a large number of simple random
samples, say, from any population distribution and determine the mean of each sample,
the distribution of these sample means will tend to be described by the normal probability
distribution with a mean and variance 2/n. This is true even if the population
distribution itself is not normal. Or, in other words, we say that the sampling distribution
of sample means approaches to a normal distribution, irrespective of the distribution of
population from where sample is taken and approximation to the normal distribution
becomes increasingly close with increase in sample size. Symbolically, the theorem can
be explained as follows:
When given n independent random variables X1, X2,X3Xn, which have the same
distribution (no matter what the distribution), then:
X= X1+X2+X3+.Xn
is a normal variate. The mean and variance 2 of X are
= 1+ 2+ 3++ n = n i
2 = 21+ 22+ 23 ++ 2n= n 2i
where i and 2i are the mean of Xi.
The utility of this theorem is that it requires virtually no conditions on distribution
patterns of the individual random variable being summed. As a result, it furnishes a
practical method of computing approximate probability values associated with sums of
arbitrarily distributed independent random variables. This theorem helps to explain why a
vast number of phenomena show approximately a normal distribution. Lets consider a
case when the population is skewed, skewness of the sampling distribution of means is
inversely proportional to the square root of the sample size. Consider the case when n=16
that means is inversely proportional to the square root of the sample size. Consider the
case when n=16 that means the sampling distribution of means will exhibit only onefourth as much skewness as the population has. Consider the case when n=100, skewness
becomes one-tenth as much, ie., as the sample size increases, the skewness will decrease.
As a practical consequence, the normal curve will serve as a satisfactory model when
samples are small and population is close to a normal distribution, or when samples are
large and population is markedly skewed. Because of its theoretical and practical
significance, this theorem is considered as most remarkable theoretical formulation of all
probability laws.
The Population (Universe) Distribution
When we talk of population distribution, we assume that we have investigated the
population and have full knowledge of its mean and standard deviation. For example, a
company might have manufactured 1, 00,000 tyres of cars in the year 2004. Suppose, it
contacts all those who had bought these tyres and gathers information about the life of
these tyres. On the basis of the information obtained, the mean of the population which is
also called true mean symbolized by and its standard deviation symbolized by can be
worked out. These Greek letters and are used for these measures to emphasise their
difference from corresponding measure taken from a sample. It may be noted such
measures characterizing a population care called population parameters.
The shape of the distribution of the life of tyres may be as follows:
the number of elements in the population. We rarely have an opportunity to use this
formula since most of the populations we study are not totally accessible; they either are
too large, perhaps even infinite, or would be destroyed in the process of measurement.
Population variance and sample variance
The population variance is a measure of the spread of the population. Suppose we want to
choose between two investment plans and are told that both have mean earnings of 10%
per annum; we might conclude that they were equally good. However, suppose we learn
that plan A has a variance twice as large as plan B. This gives us additional information
on which to base a choice. A population variance can be computed from ungrouped data
or from data that are grouped into a frequency or relative frequency distribution if the
population is of the accessible variety. For ungrouped data, a population variance is
defined to be
Chapter-10
Hypothesis Testing
Introduction
A hypothesis is an assumption about the population parameter to be tested based on
sample information. The statistical testing of hypothesis is the most important technique
in statistical inference. Hypothesis tests are widely used in business and industry for
making decisions. It is here that probability and sampling theory plays an everincreasing role in constructing the criteria on which business decisions are made. Very
often in practice we are called upon to make decisions about population on the basis of
sample information. For example, we may wish to decide on the basis of sample data
whether a new medicine is really effective in curing a disease, whether one training
procedure is better than another, etc. Such decisions are called statistical decisions. In
other words, a hypothesis is the assumption that we make about the population
parameter. This can be any assumption about a population parameter not necessarily
based on statistical data. For example it can also be based on the gut feel of a manager.
Managerial hypotheses are based on intuition; the market place decides whether the
managers intuitions were in fact correct.
In fact managers propose and test hypotheses all the time. For example:
1. If a manager says if we drop the price of this car model by Rs15000, well increase
sales by 25000 units is a hypothesis. To test it in reality we have to wait to the end of the
year to and count sales.
2. A manager estimates that sales per territory will grow on average by 30% in the next
quarter is also an assumption or hypotheses. How would the manager go about testing
this assumption? Suppose he has 70 territories under him.
One option for him is to audit the results of all 70 territories and determine whether the
average growth is greater than or less than 30%. This is a time consuming and expensive
procedure.
Another way is to take a sample of territories and audit sales results for them.
Once we have our sales growth figure, it is likely that it will differ somewhat from our
assumed rate. For example we may get a sample rate of 27%. The manager is then faced
with the problem of determining whether his assumption or hypothesized rate of growth
of sales is correct or the sample rate of growth is more representative.
To test the validity of our assumption about the population we collect sample data and
determine the sample value of the statistic. We then determine whether the sample data
supports our hypotheses assumption regarding the average sales growth.
What is Hypothesis?
In attempting to reach decisions, it is useful to make assumptions or guesses about the
populations involved. Such assumptions, which may or may not be true, are called
statistical hypothesis and in general are statements about the probability distributions of
the population. The hypothesis is made about the value of some parameter, but the only
facts available to estimate the true parameter are those provided by a sample. If the
sample statistic differs from the hypothesis made about the population parameter, a
decision must be made as to whether or not this difference is significant. If it is, the
hypothesis is rejected. If not, it must be accepted. Hence, the term "tests of hypothesis".
Now, if be the parameter of the population and is the estimate of in the random
sample drawn from the population, then the difference between and should be
small. In fact, there will be some difference between and because is based on
sample observations and is different for different samples. Such a difference is known as
difference due to sampling fluctuations. If the difference between and is large, then
the probability that it is exclusively due to sampling fluctuations is small. Difference
which is caused because of sampling fluctuations is called insignificant difference and
the difference due to some other reasons is known as significant difference. A
significant difference arises due to the fact that either the sampling procedure is not
purely random or sample is not from the given population.
the two hypothesis are constructed so that if one is true, the other is false and vice versa.
The rejection of the null hypothesis indicates that the differences have statistical
significance and the acceptance of the null hypothesis indicates that the differences are
due to chance. As against the null hypothesis, the alternative hypothesis specifies those
values that the researcher believes to hold true. The alternative hypothesis may embrace
the whole range of values rather than single point.
Set up a suitable significance level. Having set up a hypothesis, the next step is to select
a suitable level of significance. The confidence with which an experimenter rejects or
retains null hypothesis depends on the significance level adopted. The level of
significance, usually denoted by "", is generally specified before any samples are
drawn, so that results obtained will not influence our choice. Though any level of
significance can be adopted, in practice, we either take 5 per cent or 1 per cent level of
significance. When we take 5 per cent level of significance then there are about 5
chances out of 100 that we would reject the null hypothesis when it should be accepted,
i.e., we are about 95% confident that we have made the right decision. When we test a
hypothesis at a 1 per cent level of significance, there is only one chance out of 100 that
we would reject the null hypothesis when it should be accepted, i.e., we, are about 99%
confident that we have made the right decision. When the null hypothesis is rejected at
= 0.5, the test result is said to be "significant". When the null hypothesis is rejected at
= 0.01, the test result is said to be "highly significant".
Determination of a suitable test statistic. The third step is to determine a suitable test
statistic and its distribution. Many of the test statistics that we shall encounter will be of
the following form:
Determine the critical region. It is important to specify, before the sample is taken,
which values of the test statistic will lead to a rejection of Ho and which lead to
acceptance of Ho. The former is called the critical region. The value of , the level of
significance, indicates the importance that one attaches to the consequences associated
with incorrectly rejecting Ho. It can be shown that when the level of significance is ,
the optimal critical region for a two-sided test consists of that /2 per cent of the area in
the right-hand tail of the distribution plus that /2 percent in the left hand tail. Thus,
establishing a critical region is similar to determining a 100(I - )% confidence interval.
In general, one uses a level of significance of = 0.05, indicating that one willing to
The probability of committing a type I error is designated as "." and is called the level of
significance. Therefore,
= P r [Type I error]
= Pr [Rejecting Ho| Ho is true]
must be the complement of
(I - ) = Pr [Accepting Ho| Ho is true].
This probability (I - ) corresponds to the concept of 100(1- ) % confidence interval.
Our efforts would obviously be to have a small probability of making a type I error.
Hence the objective is to construct the test to minimise .
Similarly, the probability of committing a type II error is designated by . Thus
= P r [Type II error]
= Pr [Accepting HoI Ho is false]
and
The decision is :
Accept Ho
Reject Ho
Sum
(1- )
Note that the probability of each decision outcome is a conditional probability and
the elements in the same column sum to 1.0, since the events with which they are
associated are complement. However, and are not independent of each other, nor are
they independent of the sample size n. When n is fixed, if is lowered then normally
rises and vice versa. If n is increased, it is possible for both and to decrease. Since,
increasing the sample size involves money and time, therefore, one should decide how
much additional money and time, he is willing to spare on increasing the sample size in
order to reduce the size of and .
In order for any tests of hypothesis or rules of decisions to be good, they must be
designed so as to minimise errors of decision. However, this is not a simple matter, since
for a given sample size, an attempt to decrease one type of error is accompanied in
general by an increase in other type of error. The probability of making type I error is
fixed in advance by the choice of level of significance employed in the test. We can
make the type I error as small as we please, by lowering the level of significance. But by
doing so, we increase the chance of accepting a false hypothesis, i. e., of making a type
II error. It follows that it is impossible to minimise both errors simultaneously. In the
long run, errors of type I are perhaps more likely to prove serious in research
programmes in social sciences than are errors of type II. In practice, one type of error
may be more serious than the other and so a compromise should be reached in favour of
limitations of the more serious error. The only way to reduce both types of error is to
increase the sample size that may or may not be possible.
One-Tailed and Two-Tailed Tests
Basically, there are three kinds of problems of tests of hypothesis. They include:
(i) two-tailed tests, (ii) right-tailed test, and (iii) left-tailed test.
Two-tailed test is that where the hypothesis about the population mean is rejected for
value of falling into either tail of the sampling distribution. When the hypothesis about
population mean is rejected only for value of falling into one of the tails of the sampling
distribution, then it is known as one-tailed test. If, it is right tail then it is called righttailed test or one-sided alternative to the right and if it is on the left tail, then, it is onesided alternative to the left and called left-tailed test.
For example, Ho: = 100 tested against H1: > 100 or < 100 is one-tailed test since HI
specifies that lies on particular side of 100. The same null hypothesis tested against H1:
100 is a two-tailed test since can be on either side of 100. The following diagrams
would make it clearer:
***** DIAGRAMATIC REPRESENTATION ( after the table)
The following table gives critical values of Z for both one-tailed and two-tailed tests at
various levels of significance. Critical values of Z for other levels of significance are
found by use of the table of normal curve areas :
Level of Significance
Critical value of z for onetailed tests
0.10
-1.28
or 1.28
0.05
0.01
-1.645 -2.33
or 1.645 or 2.33
0.005
-2.58
or 2.58
0.0002
-2.88
or 2.88
- 1.645
and
1.645
- 1. 96 - 2.58
and I. and
96
2.58
-2.81
-3.08
Let us test the hypothesis at 100 % level of significance. From tables of area under the
standard normal curve corresponding to given , we can find an ordinate z such that
Pr[IzI>z]=
P r [ - z Z z ] = 1
If = .01, then z = 2.58 and if = 0.05, then z = 1.96, and so on.
If the difference between and is more than z times, the standard error of
, the difference is regarded significant and Ho is rejected at 100 % level of
significance and if the difference between and is less than or equal to z times the
standard error of , the difference is insignificant and Ho is accepted at 100 % level
of significance.
x = / N = s/ N
At 5% level of significance, the critical value of z for two-tailed test = 1.96. If the
computed value of z is greater than +1.96 or less than 1.96, then reject Ho, otherwise
accept Ho.
S12 and S22 can be used if the values of 12 and 22 are unknown.
Illustration2: You are working as a purchase manager for a company. Two
manufacturers of electric bulbs have supplied the following information to you
(in
Company A
1300
Company B
1288
82
93
100
100
Which brand of bulb are you going to purchase if you desire to take a risk of 5%?
population is unknown (and therefore, has to be estimated from a sample), the sampling
distribution of the mean derived from large samples will also be normally distributed,
but if the sample size is small (say 30, or less) then the sample statistic will follow a tdistribution.
The Student's t-distribution obtained by W.S. Gosset was published under the pen
name of "Student" in the year 1908. It is reported that Gosset was a statistician for a
brewery, and that the management did not want him to publish his scholarly theoretical
work under his real name and bring shame to his employer. Consequently, he selected
the pen name of Student.
The study of statistical inference with the small samples is called small sampling
theory or exact sampling theory. We shall discuss in detail the "t" and "F' distributions.
These two distributions are defined in terms of number of degrees of freedom. It is
appropriate at this stage to clarify this concept.
Degrees of freedom: The number of degrees of freedom can be interpreted as the
number of useful items of information generated by a sample of given size with respect
to the estimation of a given population parameter. Thus, a sample of size 1 generates
one piece of useful information if one is estimating the population mean, but none, if
one is estimating the population variance. In order to know about the variance, one need
at least a sample of size n 2. The number of degrees of freedom, in general, is the total
number of observations minus the number of independent constraints imposed on the
observations.
Suppose the expression X = X1 + X2 + X3 has four terms. We can arbitrarily
assign values to any three of these four values (for example, 15 = X1 + 2 + 8) but the
value of the fourth is automatically determined (for example, X1 = 5).
In this example, there are 3 degrees of freedom. If n is the number of observations
and k is the number of independent constants (the number of constants that have to be
estimated from the original data) then n - k is the number of degrees of freedom.
If we consider sample of size n drawn from a normal (or approximately normal)
population with mean and if for each sample we compute t, using the sample mean x`
and sample standard deviation s, the distribution for t can be obtained. The probability
density function of the t-distribution is given by
(5) The t-distribution is more of platykurtic (less peaked at the centre and higher in tails)
than the normal distribution.
(6) The t-distribution has a greater dispersion than the standard normal distribution. As n
gets larger, the t-distribution approaches the normal form. When n is as large as 30, the
difference is very small. Relation between the t-distribution and standard normal
Properties of t-Distribution
Chi-Square Distributions
The chi-square is a continuous probability distribution. Although this theoretical
probability distribution is usually not a direct model of a population distribution, it has
many uses when we are trying to answer questions about populations. For example, the
chi-square distribution can be used to decide whether or not a set of data fits a specified
theoretical probability model a goodness-of-fit test.
Goodness-of-fit tests
Goodness-of-Fit Test with a Specified Parameter
Example: Each day a salesperson calls on 5 prospective customers and she records
whether or not the visit results in a sale. For a period of 100 days her record is as follows:
Number of sales: 0 1 2 3 4 5
Frequency:
15 21 40 14 6 4
A marketing researcher feels that a call results in a sale about 35% of the time, so he
wants to see if this sampling of the salespersons efforts fits a theoretical binomial
distribution for 5 trials with 0.35 probability of success, b( y; 5, 0.35). This binomial
distribution has the following probabilities and leads to the following expected values for
100 days of records:
Since the last category has an expected value of less than 1, he combines the last two
categories to perform the goodness-of-fit test.
t-Tests
If random samples of size less than 30 are taken from a normal distribution and the
samples used to estimate the
Variance, then the statistic
is not normally distributed. The probabilities in the tails of this distribution are greater
than for the standard normal distribution
Example:
Using a t Distribution to Test a Hypothesis about
The sports physiologist would like to test H0:= 17 against Ha:
marathon runners. In a random sample of 8 female runners, he finds
for female
Since n = 8, the degrees of freedom are v =7, and at = 0.05 the null hypothesis will be
rejected if |t|= t0.025,7 2.365. The test statistic is
Thus he rejects the null hypothesis and concludes that for women the distance until stress
is more than 17 miles.
It is possible to make inference about another type of mean, the mean of the difference
between two matched groups. For example, the mean difference between pretest scores
and post-test scores for a certain course or the mean difference in reaction time when the
same subjects have received a certain drug or have not received the drug might be
desired. In such situations, the experimenter will have two sets of sample data (in the
examples just given, pretest/post-test or received/did not receive); however, both sets are
obtained from the same subjects. Sometimes the matching is done in other ways, but the
object is always to remove extraneous variability from the experiment. For example,
identical twins might be used to control for genetically caused variability or two types of
seeds are planted in identical plots of soil under identical conditions to control for the
effect of environment on plant growth. If the experimenter is dealing with two matched
groups, the two sets of sample data contain corresponding members thus he has,
essentially, one set consisting of pairs of data. Inference about the mean difference
between these two dependent groups can be made by working with the differences within
the pairs and using a t distribution with n - 1 degrees of freedom in which n is the number
of pairs.
Example: Matched-Pair t Test
Two types of calculators are compared to determine if there is a difference in the time
required to perform a certain common statistical calculation. Twelve students chosen at
random are given drills with both calculators so that they are familiar with the operation
of each type. Then the time they take to complete the calculation on each device is
measured in seconds (which calculator they are to use first is determined by some random
procedure to control for any additional learning during the first calculation). The data are
as follows:
detect any difference due to the calculators. If possible, a design involving two dependent
samples that can be analyzed by a matched-pair t test is preferable to two independent
samples.
F-Tests
Inference about two variances
There are situations, of course, in which the variances of the two populations under
consideration are different. The variability in the weights of elephants is certainly
different from the variability in the weights of mice, and in many experiments, even
though we do not have these extremes; the treatments may affect the variances as well as
the means.
The null hypothesis H0:12=22 is tested by using a statistic that is in the form of a ratio
rather than a difference; the statistic is s12=s22. Intuitively, if the variances are equal, this
ratio should be approximately equal to 1, so values that differ greatly from 1 indicate
inequality.
It has been found that the statistic s12/s22 from two normal populations with equal
variances follows a theoretical distribution known as anF distribution. The density
functions for F distributions are known, and we can get some understanding of their
nature by listing some of their properties. Let us call a random variable that follows an F
distribution F; then the following properties exist:
1. F > 0.
2. The density function of F is not symmetrical.
3. F depends on an ordered pair of degrees of freedom v1 and v2; that is, there is a
different F distribution for each ordered pair v1, v2. (v1 corresponds to the degrees of
freedom of the numerator of s12 /s22 and v2 corresponds to the denominator.)
4. If a is the area under the density curve to the right of the value Fa,v1,v2 , then
Fa,v1,v2 = 1/F1-a,v2,v1
5. The F distribution is related to the t distribution:
Fa,1,v2 = (ta/2,v2 )2
Table A.12 in the Appendix gives upper critical values for F if a 0.050, 0.025, 0.010,
0.005, 0.001. Lower-tail values can be found using property 4 above.
Example Testing for the Equality of Two Variances
Both rats and mice carry ectoparasites that can transmit disease organisms to humans. To
determine which of the two rodents presents the greater health hazard in a certain area, a
public health officer traps (presumably at random) both and counts the number of
ectoparasites each carries. The data are presented first in side-by-side stem-and-leaf plots
and then as side-by-side box-and-whisker plots:
He wants to test for the equality of means with a group comparison t test. He assumes
that these discrete counts are approximately normally distributed, but because he is
studying animals of different species, sizes, and body surface areas, he has some doubts
about the equality of the variances in the two populations, and the box plots seem to
support that concern. Thus he first must test
with the test statistic F =s12/s22 =43:4/13:0 = 3:34. Since n1 = 31 and n2 = 9, the degrees
of freedom for the numerator are v1 = n1 - 1 = 30 and for the denominator v2 = n2 - 1 =
8.
From table
F0:05,30,8 = 3:079 and F0:05,8,30 = 2:266
thus the region of rejection at a = 0.10 is F >= F0:05, 30,8 = 3:079 and F <= F0:95,30,8 =
1/ F0:05,8,30= 1/ 2:266= 0:441
Since the computed F equals 3.34, the null hypothesis is rejected, and the public health
officer concludes that the variances are unequal. Since one of the sample sizes is small,
he may not perform the usual t test for two independent samples.
One-tailed tests of hypotheses involving the F distribution can also be performed, if
desired, by putting the entire probability of a Type I error in the appropriate tail. Central
confidence intervals on 12= 22 are found as follows:
Although the public health officer cannot perform the usual t test for two independent
samples because of the unequal variances and the small sample size, there are
approximation methods available. One such test is called the BehrensFisher, or the t 0
test for two independent samples and using adjusted degrees of freedom.
Chapter -10
Linear Programming
The organizations today are working in a highly competitive and dynamic external
environment. Not only the number of decisions required to be made has increased
tremendously but the time period within which these have to be made has also shortened
considerably. Decisions can no longer be taken on the basis of personal experience or
gut feeling. This has resulted in a need for the application of appropriate scientific
methods in decision making.
The name operations research (O.R.) is taken directly from the context in which it was
developed and applied. Subsequently, it came to be known by several other names such
as:1. Management science,
2. Decision science.
3. Quantitative methods and
4. Operational analysis.
After the war, the success of the military provided a much needed boost to the discipline
for. The industry during that period was struggling to cope up with the increase in
complexity. There were complex decision problems. Solutions to which were neither
apparent nor forthcoming.
The successful implementation of operations research technique during the war was
probably the most important event, the industry was waiting for. This paved the way for
the application of OR to the business & industry. As .the business requirements changed,
newer and better operations research techniques evolved.
Another factor which has significantly contributed to the development of OR during the
last few decades is the development of high speed computers capable of performing a
large number of operations in a very short time period. Since 1960s, there has been a
rapid increase in the areas in which operations research has found acceptability. A part
from the industry and the business, OR also finds applicability in areas such as:1. Regional planning.
2. Telecommunications,
3. Crime investigation,
5. Public transportation and
6. Medical sciences.
Operations research has now become one of the most importable tools in decisionmaking and is currently being taught under various management and business programs.
Due to the fast pace at which it has developed & gained widespread acceptance,
professional society devoted to the cause of operations research and its allied activities
have been founded world-wide. e.g. Institute of Management Sciences founded in 1953
seeks to integrate scientific knowledge with the management of an industrial house - by
the development of quantitative methodology to the functional aspects of management.
Critical Path Method (CPM) and Project Evaluation and Review Technique (PERT) were
developed in 1958. These are extensively used in scheduling and monitoring complex
and lengthy projects having a greater time and cost over-run. PERT is now considered as
the ultimate management technique and finds applicability in such diverse areas as: 1. Construction projects,
2. Ship-building projects,
3. Transportation projects and
4. Military projects.
A large number of business and industrial houses adopted the methodology of operations
research techniques by early 1970s.
The first use of OR techniques in India, was in the year 1949 at Hyderabad, where at the
Regional Research Institute, an independent operations research unit was set-up. To
identify evaluate and solve the problems related to:1. Planning,
2. Purchases and
3. Proper maintenance of stores
Besides being too lengthy, this definition has also been criticized since it focuses on
complex problems & large systems, giving the impression that it is a highly sophisticated
& technical approach which is suitable on by for very large and complex organizations.
OR is an experimental & applied science devoted to observing understanding &
predicting the behavior of purposeful man and machine in systems; arid operations
research workers are actively engaged in applying this- knowledge to practical problems
in business, government and society.
2. Simulation Models
It is very much similar to the managements trial and error approach in decision-making.
To simulate is to duplicate the features of the problem in a working model, which is then
solved using well known OR techniques. The results hence obtained are tested for
sensitivity analysis, after which these are applied to the original problem. By simulating
the characteristics and features of the organisational problems not a model, the various
decisions can be evaluated and-the risks inherent in actually implementing them is
drastically cut-down or eliminated. Simulation models are normally used for those kinds
of problems or situations which cannot be studied or understood by any other technique.
3. Inventory Models
The inventory models are primarily concerned with the optimal stock or inventory
policies of the organisation. Inventory problems deal with the determination of optimum
levels of different inventory items and ordering policies, optimizing a pre-specified
standard of effectiveness. It is concerned with the factors such as:
1. demand per unit time,
2. cost of placing orders,
3. costs incurred while keeping the goods in inventory,
4. stock-out costs and
5. costs of lost sales etc.
If a customer demands a certain quantity of a product, which is not available, then it
results in a lost sale. On the other hand, excess inventories mean blocked working capital
which is the life blood of modern business. Similarly, in the case of raw materials,
shol1age of even a very small item may cause bottlenecks in the production and the entire
assemb1y line may came to a halt. Inventory models are also useful in dealing with
quantity discounts and multiple products. These models can be of two types
1. deterministic and
2. probabilistic and are used in calculating various important decision variables such as :1. re-order quantity,
2. lead-time,
3. economic order quantity and
4. the pessimistic,. optimistic & the most likely level of stock keeping.
4. Network Models
Networking models are extensively used in planning, scheduling and controlling complex
projects which can be represented in the form of a net-work of various activities & subactivities.
Two of the most important and commonly used networking models are
1. Critical Path Method (CPM) and
2. Programme Evaluation & Review Technique (PERT).
PERT is the better known and more extensively applied of the two and it involves,
finding the time requirements of a given project, & the allocation of scarce resources to
complete the project as scheduled i.e.; within the planned stipulated time and with
minimum cost.
5. Sequencing Models
Sequencing models deal with the selection of the most appropriate or the optimal
sequence in which a series of jobs can be performed on different machines so as to
maximize the operational efficiency of the system. e.g. consider a job shop, where Jobs
are required to be processed on Y machines. Different jobs require different amounts of
time on different machines and each job must be processed on all the machines. In what
order should the jobs be processed so as to minimize the total processing time of all the
jobs. There are several variations of the same problem which can be evaluated by
sequencing models - with the different kinds of optimization criterion. Hence, sequencing
is primarily concerned with those problems in which the efficiency of operations depends
solely upon the sequence of performing a series of jobs.
6. Competitive Problems Models
The competitive problems deal with making decisions under conflict caused by opposing
interests or under competition.
Many problems related to business such as bidding for the same contract, competitions
for the market share; negotiating with labour unions and other associations etc. involve
intense competition. Games theory is the OR technique which is used in such situations,
where only one of the two or more players can win.
However, the competitive model has yet to find the widespread industrial and business
acceptability. Its biggest raw back is that it is too idealistic in outlook and fails to take
into consideration the actual reality and other related factors, within which an
organisation has to operate.
7. Queuing or Waiting Line Models
Any problem that involves waiting before the required service could be provided is
termed as a queuing or waiting line problem. These models seek to ascertain the various
important characteristics of queuing systems such as:1. average time spent in line by a customer,
2. average length of queue etc.
The waiting line models find very wide applicability across virtually every organisation
and in our daily life. Examples of queuing or waiting-line models are:1. waiting for service in a bank,
2. waiting list in schools,
3. waiting for purchases etc.
These models aim at minimizing the cost of providing service. Most of the realistic
waiting line problems are extremely complex and often simulation is used to analyze
such situations.
8. Replacement Models
These models are concerned with determining the optimal time required to replace
equipment or machinery that deteriorates or fails. Hence it seeks to formulate the optimal
replacement policy of an organization.
For example, when to replace the old machine with a newer one in the factory or at what
interval should an old car is replaced with a newer one? In all such cases there exists an
economic tradeoff between the increasing and the decreasing cost functions.
Step VI
Implementation stage/Establishing control mechanisms then analyse the Main Reasons
For Wide-spread Applicability of OR
from different supplier, the finished goods can be sold to various markets, production can
be done with the help of different machines.
5. Non-Negative Restrictions.
Since the negative value of (any) physical quantity has no meaning, therefore all the
variables must assume non-negative values. If some of the variables are unrestricted in
sign, the help of certain mathematical tools can enforce the non- negativity restriction
without altering the original information contained in the problem.
6. Linearity Criterion.
The relationship among the various decision variables must be directly proportional i.e.;
both the objective and the constraint must be expressed in terms of linear equations or
inequalities.
For example if one of the factor inputs (resources like material, labour, plant capacity
etc.) increases, then it should result in a proportionate manner in the final output. These
linear equations and inequations can graphically be presented as a straight line.
7. Additively.
It is assumed that the total profitability and the total amount of each resource utilized
would be exactly equal to the sum of the respective individual amounts. Thus the function
or the activities must be additive - and the interaction among the activities of the
resources does not exist.
8. Mutually Exclusive Criterion.
All decision parameters and the variables are assumed to be mutually exclusive In other
words, the occurrence of anyone variable rules out the simultaneous occurrence of other
such variables.
9. Divisibility.
Variables may be assigned fractional values. i.e.; they need not necessarily always be in
whole numbers. If a fraction of a product cannot be produced, an integer-programming
problem exists. Thus, the continuous values of the decision variables and resources must
be permissible in obtaining an optimal solution.
10. Certainty
It is assumed that conditions of certainty exist i.e.; all the relevant parameters or
coefficients in the Linear Programming model are ful1y and completely known and that
they dont change during the period. However, such an assumption may not hold good at
all times.
11. Finiteness.
Linear Programming assumes the presence of a finite number of activities and constraints
without which it is not possible to obtain the best or the optimal solution. What are the
advantages and limitation of Now it is time to examine the advantages as well as the
limitations of Linear Programming.
4. Multiplicity of Goals.
The long-term objectives of an organisation are not confined to a single goal. An
organisation ,at any point of time in its operations has a multiplicity of goals or the goals
hierarchy all of which must be attained on a priority wise basis for its long term growth.
Some of the common goals can be Profit maximization or cost minimization, retaining
market share, maintaining leadership position and providing quality service to the
consumers. In cases where the management has conflicting, multiple goals, Linear
Programming model fails to provide an optimal solution. The reason being that under
Linear Programming techniques, there is only one goal which can be expressed in the
objective function. Hence in such circumstances, the situation or the given problem has to
be solved by the help of a different mathematical programming technique called the
Goal Programming.
5. Flexibility.
Once a problem has been properly quantified in terms of objective function and the
constraint equations and the tools of Linear Programming are applied to it, it becomes
very difficult to incorporate any changes in the system arising on account of any change
in the decision parameter. Hence, it lacks the desired operational flexibility.
The basic model of Linear Programming:
Linear Programming is a mathematical technique for generating & selecting the optimal
or the best solution for a given objective function. Technically, Linear Programming may
be formally defined as a method of optimizing (i.e.; maximizing or minimizing) a linear
function for a number of constraints stated in the form of linear in equations.
Mathematically the problem of Linear Programming may be stated as that of the
optimization of linear objective function of the following form:
Z = C1x1 + C2x2 ++ Cixi+.. Cnxn
Subject to the Linear constrains of the form:
a11x1 + a12x2 + a13x3++a1ixi++ainxn >= or <= b1
ajix1 + a22x2 + a23x3 +.+a2ixi + ....+a2nxn >= or <= b2
These are called the non-negative constraints. From the above, it is linear that a LP
problem has:
(I) Linear objective function which is to be maximized or minimized.
(ii) Various linear constraints which are simply the algebraic statement of the limits of the
resources or inputs at the disposal.
(iii) Non-negatively constraints.
Linear Programming is one of the few mathematical tools that can be used to provide
solution to a wide variety of large, complex managerial problems.
(ii) By putting the value of the corner points co-ordinates into the objective
function, calculate the profit (or the cost) at each of the corner points.
(iii) In a maximisation problem, the optimal solution occurs at that corner point
which gives the highest profit.
(iv) In a minimisation problem, the optimal solution occurs at that corner point
which gives the lowest profit.
(b) Iso-Profit (or Iso-Cost) method. The term Iso-profit sign if is that any combination
of points produces the same profit as any other combination on the same line. The various
steps involved in this method are given below.
(i) Selecting a specific figure of profit or cost, an iso-profit or iso-cost line is
drawn up so that it lies within the shaded area.
(ii) This line is moved parallel to itself and farther or closer with respect to the
origin till that point after which any further movement would lead to this line
falling totally out of the feasible region.
(iii) The optimal solution lies at the point on the feasible region which is touched
by the highest possible isoprofit or the lowest possible iso-cost line.
(iv)The co-ordinates of the optimal point (x. Y) are calculated with the help of
simultaneous equations and the optimal profit or cost is as curtained.
Example: A retired person wants to invest up to an amount of Rs. 30,000 in the fixed
income securities. His broker recommends investing in two bonds bond A yielding 7%
per annum and bond B yielding 10% per annum. After some consideration he decides to
invest at the most Rs. 12,000 in bond B and at least Rs. 6,000 in bond A. he also wants
that the amount invested in bond A must be at least equal to the amount invested in bond
B. what should the broker recommend if the investor wants to maximize his return on
investment? Solve graphically.
Solution. Designating the decision variables x1 and x2 as the amount invested in bond
A and bond B respectively. Then the appropriate mathematical formulation of the given
problem as LP model is:
Maximize (total return) Z = 0.07 x1 + 0.10x2
Subject to the constraints
X1 + x2 30,000
X1 6,000
X2 12,000
X1-x2 0
X1 0, x2 0
Plotting the constraints graphically as shown. The shaded portion represents the feasible
region and the corner points of the feasible region are A, B, C, D and E.
|The values of the objective function at the corner points are summarized in the following
table:
Corner
Point
A
B
C
D
E
Coordinates
(x1, x2)
(6,000;0)
(6,000; 6,000)
(12,000; 12,000)
(18,000; 12,000)
(30,000; 0)
function
Thus the maximum value of Z is Rs. 2,460 and it occurs when x1 = 18,000 and x2 =
12,000. Hence, the person should invest Rs. 18,000 in bond A and Rs. 12,000 in bond B.
Example: X Ltd wishes to purchase a maximum of 3600 units of a product two types
of product a. & are available in the market Product a occupies a space of 3 cubic Jeet &
cost Rs. 9 whereas occupies a space of 1 cubic feet & cost Rs. 13 per unit. The budgetary
constraints of the company do not allow to spend more than Rs. 39,000. The total
availability of space in the companys godown is 6000 cubic feet. Profit margin of both
the product a & is Rs. 3 & Rs. 4 respectively. Formulate as a linear programming model
and solve using graphical method. You are required to ascertain the best possible
combination of purchase of a & so that the total profits are
maximized.
Step VIII. Finding Optimal Solution. Always keep in mind two things:
For constraint the feasible region will be the area which lie above the constraint lines and
for constraints, it will lie below the constraint lines.
This would be useful in identifying the feasible region.
According to a theorem on linear programming, an optimal solution to a problem (if it
exists) is found at a corner point of the solution space.
Step IX. At corner points (O, A, B, C), find the profit value from the objective function.
That point which maximize the profit is the optimal point
Step IX. At corner points (O, A, B, C), find the profit value from the
objective function. The point which maximizes the profit is the optimal
point.
Corner Point
W- ordinates
0
A
C
(0,0)
(0,3000)
(2000,0)
Objective
functions
Z=3x1+4x2
Z=0+0
Z=0+4*3000
Z=3*2000+0
Value
0
12000
6000
For point B, solve the equation 9 x1 + 12 x2 =39000 And 3x1 + 6x2 =6000 to find point B
(A+B, these two lines are intersecting)
ie, 3x1 + x2 =6000 (1)
9 x1 + x2 =39000 (2)
On solving we get x1 = 13000 and x2= 21000
Tables
X
Chairs
Y
Constraints
Cannot
negative
4 hr/table
3hr/chair
Finishing
2 hr/table
1hr/chair
Maximum
240 hrs for
week
Maximum
100 hrs for
week
Profit
$ 50 per chair
Number
produced
week
Carpentry
per
be X 0, y 0
of 4x+3y 240
the
of 2x + y 100
the
The total profit ($) for the week is given by the objective function
P= 70x +50y
When the simplex method is used in the furniture problem, the objective function
is written in terms of four variables. If the problem has a solution, then the
solution occurs at one of the vertices of a region in four-dimensional space. We
start at one of the vertices and check the neighbouring vertices to see which ones
provide a better solution. We then move to one of the vertices that give a better
solution. The process is repeated until the target vertex is reached.
The first step of the simplex method requires that each inequality be converted into an
equation. Less-than-or-equal-to inequalities are converted to equations by including slack
variables. Suppose
s1
carpentry hours and s2 finishing hours remain unused in a
week. The constraints become:
4x + 3y + s1 = 240 or 4x + 3y + 1 s1 + 0s2= 240
2x+ y +s2=100 or 2x + y + 0 s1 + 1 s2 = 100
As unused result in zero profit, the slack variables can be included in the objective
function with zero coefficients.
P = 70 x1 + 50 x2 + 0 s1 + 0 s2
The problem can now be considered as solving a system of 3 linear equations involving
the 5 variables x , y , s , s , P 1 2 in such a way that P has the maximum value:
4x + 3y + 1 s1 + 0s2 + 0 P = 240
2x + y + 0 s1 + 1 s2 + 0 P = 100
-70x 50y + 0s1 + 0s2 + 1P = 0
The system of linear equations can be written in matrix form or as a 3x6 augmented
matrix.
S1
S2
4
2
-70
3
1
-50
1
0
0
0
1
0
0
0
1
240
100
0
The slack variables 1 s and 2 s form the initial solution mix. The initial solution assumes
that all available hours are unused i.e. the slack variables take the largest possible values.
Variables in the solution mix are called basic variables. Each basic variable has a column
consisting of all 0s except for a single 1. All variables not in the solution mix take the
value 0.
The simplex method uses a four step process (based on the Gauss Jordan method for
solving a system of linear equations) to go from one tableau or vertex to the next. In this
process, a basic variable in the solution mix is replaced by another variable previously
not in the solution mix. The value of the replaced variable is set to 0.
Step 1
Select the pivot column (determine which variable to enter into the solution mix). Choose
the column with the most negative element in the objective function row.
Basic
Variables
S1
S2
S1
S2
4
2
-70
3
1
-50
1
0
0
0
1
0
0
0
1
240
100
0
x should enter into the solution mix because each unit of x (a table) contributes a profit
of$70 compared with only $50 for each unit of y (a chair).
Step 2
Select the pivot row (determine which variable to replace in the solution mix). Divide the
last element in each row by the corresponding element in the pivot column. The pivot
row is the row with the smallest non-negative result.
Basic
Variables
S1
S2
S1
S2
4
2
-70
3
1
-50
1
0
0
0
1
0
0
0
1
240
100
0
S2 should be replaced by x in the solution mix. 60 tables can be made with 240 unused
carpentry hours but only 50 tables can be made with the 100 unused finishing hours.
Therefore we decide to make 50 tables.
Step 3
Calculate new values for the pivot row. Divide every number in the row by the pivot
number.
R2 / 2:
Basic
Variables
S1
S2
S1
S2
4
1
-70
-50
1
0
0
0
0
1
240
50
0
Step 4
Use row operations to make all numbers in the pivot column equal to 0 except for the
pivot number which remains as 1.
R1 4 x R2 and R3 + 70 x R2:
Basic
Variables
S1
x
S1
S2
0
1
0
-15
1
0
0
-2
35
0
0
1
40
50
3500
If 50 tables are made, then the unused carpentry hours are reduced by 200 hours (4
h/table multiplied by 50 tables); the value changes from 240 hours to 40 hours. Making
50 tables results in the profit being increased by $3500 ($70 per table multiplied by 50
tables); the value changes from $0 to $3500.
The new tableau represents the solution or vertex
The existence of 40 unused carpentry hours suggests that a more profitable solution can
be found. For each table removed from the solution, 4 carpentry hours and 3 finishing
hours are made available. If 2 unused carpentry hours are also taken from the 40
available, then 2 chairs can be made with the 6 carpentry hours and 3 finishing hours.
Therefore, if 1 table is replaced by 2 chairs, the marginal increase in profit is $30 (2 x $50
less $70).
Now repeat the steps until there are no negative numbers in the last row.
Step 1
Select the pivot column. y should enter into the solution mix.
Basic
Variables
S1
x
S1
S2
0
1
0
-15
1
0
0
-2
35
0
0
1
40
50
3500
Each unit of y (a chair) added to the solution contributes a marginal increase in profit of
$15.
Step 2
Select the pivot row. 1 s should be replaced by y in the solution mix.
Basic
Variables
S1
x
S1
S2
0
1
0
-15
1
0
0
-2
35
0
0
1
40
50
3500
40 chairs is the maximum number that can be made with the 40 unused carpentry hours.
Step 3
Calculate new values for the pivot row. As the pivot number is already 1, there is no need
to calculate new values for the pivot row.
Step 4
Use row operations to make all numbers in the pivot column equal to 0 except for the
pivot number.
R2 1/2 x R1 and R3 + 15 x R1:
Basic
Variables
y
x
S1
S2
0
1
0
1
0
0
1
-1/2
15
-2
3/2
5
0
0
1
40
30
4100
If 40 chairs are made, then the number of tables is reduced by 20 tables (1/2 table/chair
multiplied by 40 chairs); the value changes from 50 tables to 30 tables. The replacement
of 20 tables by 40 chairs results in the profit being increased by $600 ($15 per chair
multiplied by 40 chairs); the value changes from $3100 to $4100.
The new tableau represents the solution or vertex
As the last row contains no negative numbers, this solution gives the maximum value of
P. The maximum profit of $4100 occurs when 30 tables and 40 chairs are made. There
are no unused hours.
Example: X Ltd wishes to purchase a maximum of 3600 units of a product two types of
product a. & are available in the market Product a occupies a space of 3 cubic Jeet & cost
Rs. 9 whereas occupies a space of 1 cubic feet & cost Rs. 13 per unit. The budgetary
constraints of the company do not allow to spend more than Rs. 39,000. The total
availability of space in the companys godown is 6000 cubic feet. Profit margin of both
the product a & is Rs. 3 & Rs. 4 respectively.
Formulate as a linear programming model and solve using graphical method. You are
required to ascertain the best possible combination of purchase of a & so that the total
profits are maximized.
Solution: Let x1 = no of units of product &
x2 = no of units of product b
Then the problem can be formulated as a P model as follows:
Objective function,
Maximise Z =3 x1+ 4 x 2
Constraint equations,
x1 + x2<= 3600 (Maximum Units Constraints)
3x1 + x2 <= 6000 (Storage area constraints)
9x1 + 13 x2<= 39000 (Budgetary constraints)
x1 + x2 <= 0
Step I. Treating all the constraints as equality, the first constraint is
x1+ x2=3600
Step II. Determine the set of the points which satisfy the constraint:
x1 + x2 = 3600
This can easily be done by verifying whether the origin (0,0) satisfies the constraint.
Here,
0 +0 =3600. Hence all the points below the line will satisfy the constraint.
Step III. The 2nd constraint is: 3 x1+ x2<=6000
Step IX. At corner points (O, A, B, C), find the profit value from the objective function.
The point which maximizes the profit is the optimal point.
BIBLIOGRAPHY
(I) Books :
1.
2.
3.
4.
5.
6.
7.
8.
9.
Gupta, S. P.
Sharma, N. L.
Gupta, K. L.
Gupta, S. P.
Kapoor & Sancheti
Kothari, C. R.
Agarwal, B. M.
Hooda, R. P.
Sharma, J. K.
:
:
:
:
:
:
:
:
:
Business Statistics
Statistics
Business Statistics
Statistical Methods
Business Statistics
Quantitative Techniques
Business Statistics
Introduction to Statistics
Business Statistics
Suggested Books
Business Statistics
1.
2.
3.
4.
Business Statistics
Business Statistics
Business Statistics
Business Statistics