0% found this document useful (0 votes)
42 views181 pages

P RQIl YGZ6 T3 Qhesp 2 VKT OWf 5 Xpy CBT H737 SN ZR1 Z

Uploaded by

anujboy322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views181 pages

P RQIl YGZ6 T3 Qhesp 2 VKT OWf 5 Xpy CBT H737 SN ZR1 Z

Uploaded by

anujboy322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 181

Basic Statistics

Lesson 1
Introduction Of Statistics, Scope And Types Of
Data

Multiple Choice Questions


0
Basic Statistics

Course Name Basic Statistics


Introduction Of Statistics, Scope And
Lesson 1
Types Of Data
Content Creator Name Dr. Vinay Kumar
Chaudhary Charan Singh Haryana
University/College Name
Agricultural University,Hisar
Course Reviewer Name Dr Dhaneshkumar V Patel
Unagadh Agricultural
University/college Name
University,Junagadh

1
Basic Statistics

Objective of the Lesson:


1. Origin and growth of statistics
2. Importance and characteristics of statistics
3. Limitations and scope of statistics
4. Frequency distribution and its types

Glossary of Terms: Statistics, Scope, Frequency, Frequency Curve, etc

Introduction of Statistics, Scope and Types of Data

1.1 Introduction: Modern age is the age of science which


requires that every aspect, whether it pertains to natural
phenomena, politics, economics or any other field, should
be expressed in an unambiguous and precise form. A
phenomenon expressed in ambiguous and vague terms
might be difficult to understand in proper perspective.
Therefore, in order to provide an accurate and precise
explanation of a phenomenon or a situation, figures are
often used. The statement that prices in a country are
increasing conveys only an incomplete information about
the nature of the problem. However, if the figures of prices
of various years are also provided, we are in a better
position to understand the nature of the problem. In
addition to this, these figures can also be used to compare
the extent of price changes in a country vis-a-vis the changes
in prices of some other country. Using these figures, it might
be possible to estimate the possible level of prices at some
future date so that some policy measures can be suggested
to tackle the problem. The subject which deals with such
type of figures, called data, is known as Statistics.
The word ‘Statistics’ is probably derived from the Latin word
‘status’ or the Italian word ‘statista’ or the German word

2
Basic Statistics

‘statistik’, each of which means a ‘political state’. The word


‘Statistics’ is used in singular as well as in plural sense. As a
plural, statistics may be defined as the numerical data
relating to an aggregate of individuals and as a singular it is
defined as the science of collection, organization,
presentation, analysis and interpretation of numerical data.
1.2 Definition: Statistics has been defined differently by
various authors from time to time. One can find more than
hundred definitions in the literature but no definition can
give a complete picture of the fast growing subject of
Statistics. Important definitions are given below.
 It is the branch of science which deals with the collection,
classification and tabulation of numerical facts as the basis
for explanations, description and comparison of
phenomenon by Lovitt
 The science which deals with the collection, analysis and
interpretation of numerical data by Corxton and Cowden.
 The science of statistics is the method of judging collective,
natural or social phenomenon form the results obtained
from the analysis or enumeration or collection of estimates
by Kings.
 Statistics is a science of estimates and probabilities-Boddington

 Statistics is a branch of science, which provides tools


(techniques) for decision making in the face of uncertainty
(Probability) by Wallis and Roberts and this is the modern
definition of statistics which covers the entire body of
statistics.
 According to Sir R.A. Fisher “The science of Statistics is

3
Basic Statistics

essentially a branch of applied mathematics and may be


regarded as mathematics applied to observational data”.
Fisher’s definition is most exact in the sense that it covers all
aspects and fields of Statistics viz. Collection, Organization,
Presentation, Analysis and Interpretation of data.The credit
for application of statistics to diverse fields of biological
sciences goes to Sir R.A. Fisher (1890-1962) who is also
known as “Father of Modern Statistics”.
1.3 Scope of Statistics:

During last few decades statistics has penetrated into almost


all sciences like agriculture, biology, business, social,
engineering, medical, etc. Statistical methods are commonly
used for analyzing and interpreting experimental data. Also,
wide and varied applications have lead to the growth of
many new branches of statistics such as Industrial Statistics,
Biometrics, Biostatistics, Agricultural Statistics and the most
recently developed Statistical Bioinformatics.
From definitions we may conclude that the Statistics is the
science that transforms data into information and role of
statisticians is to serve science and society through the
development, understanding, and dissemination of state-
of-the art techniques for collecting, presenting, analyzing,
and drawing inferences from data. In brief we may
summarize the scope of statistics as follows:
(a) Statistics
has great significance in the field of physical and
natural sciences. It is used in propounding and verifying
scientific laws. Statistics is often used in agricultural and
biological research for efficient planning of experiments and
for interpreting experimental data.

4
Basic Statistics

(b) Statistics is of vital importance in economic planning.


Priorities of planning are determined on the basis of the
statistics related to the resource base of the country and the
short-term and long-term needs of the country.
(c) Statisticaltechniques are used to study the various
economic phenomena such as wages, price analysis,
analysis of time series, demand analysis etc.
(d) Successful business executives make use of statistical
techniques for studying the needs and future prospects of
their products. The formulation of a production plan in
advance is a must, which cannot be done in absence of the
relevant details and their proper analysis, which in turn
requires the services of a trained statistician.
(e) In industry, the statistical tools are very helpful in the
quality control and assessment. In particular, the inspection
plans and control charts are of immense importance and are
widely used for quality control purposes.
1.4. Limitations of Statistics:

i) Statistical methods are best applicable to quantitative data.

ii) Statistical decisions are subject to certain degree of error.

iii)Statistical
laws do not deal with individual observations
but with a group of observations.
iv) Statistical conclusions are true on an average.

v) Statistics is liable to be misused. The misuse of statistics


may arise because of the use of statistical tools by
inexperienced and untrained persons.
5
Basic Statistics

vi) Statistical
results may lead to fallacious conclusions if
quoted out of context or manipulated.
1.5 Concepts, Definitions, Frequency Distributions & Frequency Curves
1.5.1 Raw Data: The data collected by an investigator which have
not been organized numerically and used by anybody else.
1.5.2 Array: An arrangement of raw numerical data in ascending or
descending order of magnitude. The data can also be classified into
Primary data and Secondary data.
1.5.3 Primary data: The data collected directly from the original
source is called the primary data i.e. the data collected for the first
time. The primary data may be collected by:
1.Direct interview method
2.Through mail
3.Through designed experiments

1.5.3.1 Direct interview method: In this method the investigator


contacts the units/individuals and has personal interview. The
information is recorded on the questionnaire or schedule. This
information will be more reliable and correct but more expenditure
may be involved and more time will be spent as the person himself
will be going from place to place to collect the data.
1.5.3.2 Through mail: The data may be collected through
correspondence. The questionnaire or schedules are sent by mail
with the instructions for filling the same and return. It is less costly
to get the data by mail. The main drawback of this method is the
poor response. Usually the response by mail in surveys has been
found to be about 40%.
1.5.3.3 Through designed experiments: Data are generated as
outcome of the research conducted by the investigator himself.

6
Basic Statistics

1.5.4 Secondary Data: Sometimes we find that the data which we


need had already been collected by some agencies for their study
or the data are available in the published records. We may make
use of such collected data, which is known as secondary data. The
data, which have already been collected by some agency and have
been processed or used at least once are called secondary data.
Secondary data may be collected from organizations or private
agencies, government records, journals etc.
1.5.5 Variable: It is a common characteristics in biological science. A
quantitative and qualitative characteristic that varies from
observation to observations in the same group is called a variable.
In case of quantitative variables, observations are made using
interval scales whereas in case of qualitative variables nominal
scales are used. Conventionally, the quantitative variables are
termed as variables and qualitative variables are termed as
attributes. Thus, yields of a crop, available nitrogen in soil, daily
temperature, number of leaves per plant and number of eggs laid
by insects are all variables. A quantity that varies from individual to
individual is called a variable, e.g. height, weight etc. Variables are
of two types i.e. discrete variable and continuous variable
1.5.6 Discrete Variable: A variable that takes only specific values in
a given range, usually the integral values e.g. number of students in
a college, number of petals in a flower, number of tillers in a plant
etc.
1.5.7 Continuous Variable: A variable which can theoretically
assume any value between two given values is called a continuous
variable. A continuous variable can take any value within a certain
range, for example yield of a crop, height of plants and birth rates
etc.
1.6 Classification of data on the basis of Scales
7
Basic Statistics

Four levels or scales of Data measurement are:

i) Nominal Scale: Lowest level where only names are meaningful


ii) Ordinal Scale: Ordinal adds an order to the names.
iii) Interval Scale: Interval adds meaningful differences

iv) Ratio Scale: Ratio adds a zero so that ratios are meaningful.

Frequency: The number of times an individual item is repeated in a


series is called its frequency. In case of grouped data, the number
of observations lying in any class is known as the frequency of that
class.
Frequency Distribution: It is tabular arrangement of data values
along with their along with their frequencies.
Cumulative Frequency (less than type): The cumulative frequency
corresponding to any value or class is the number of observations
less than or equal to that value or upper limit of that class. It may
also be defined as the total of all frequencies up to the value or the
class. On similar lines we can define more than type cumulative
frequencies.
Relative Frequency: The relative frequency of a class is the
frequency of the class divided by the total frequency of all the
classes and is generally expressed as a percentage.
Frequency of the class
Relative Frequency =
Total frequency of all classes

Rules for Constructing a Frequency Distribution:

The following points should be borne in mind while tabulating or


classifying an observed frequency distribution.
8
Basic Statistics

1. The classes should be well defined and non-overlapping.

2. As far as possible the class interval should be of equal width.

3. The classes should be exhaustive i.e. the range of the classes


should cover the entire range of the data.
4. As a general rule, the number of classes should be between 10 and
15 and never more than 20 and not less than 5. However the exact
number depends upon the data in hand.
5. Open-ended classes should be avoided.

Note: It is not necessary to choose the smallest value as the lower


limit of the lowest class or the largest value as the upper limit of
the highest class.
Struge’s formula: A numerical formula as suggested by H.A. Struge
may be used for determining approximately the class size and the
number of classes. According to this formula the number of classes
(k) is given
k = 1+ 3.322 log10 N, where N is the number of observations. Then
class size is determined as
Largest value − Smallest test value
Class width (h) =
Number of Classes
Range
=
k
Ungrouped or discrete
frequency distribution:

When the number of observations in the data is small, then


the listing of the frequency of occurrence against the value
of variable is called the discrete frequency distribution. For
example raw data showing the number of children of 20

9
Basic Statistics

families:
2, 0, 3, 1, 1, 3, 4, 2, 0, 3, 4, 2, 2, 1, 0, 4,1, 2, 2, 3

The number of children can be considered as the variables X


and the frequency of occurrence can be listed as below:

Number of 0 1 2 3 4
children
Frequency 3 4 6 4 3

Grouped (Continuous) frequency distribution:

When the data is very large it becomes necessary to


condense the data into a suitable number of class interval
of the variable along with the corresponding frequencies.
The following two methods of classification are used.
a) Exclusive Method: In this method, the upper limit of any
class interval is kept the same as the lower limit of the just
higher class or there is no gap between upper limit of class
and lower limit of just class. It is continuous distribution.
For example:

Class Frequenc
y
0-10 2
10-20 4
20-30 5

10
Basic Statistics

30-40 3
40-50 1

b) Inclusive method: There will be a gap between the upper


limit of any class and the lower limit of just higher class. It is
discontinuous distribution.
For example:

Class Frequenc
y
0-9 2
10-19 4
20-29 5
30-39 3
40-49 1

One can convert discontinuous distribution to continuous


distribution by subtracting half of the gap (0.5 in this case)
from lower limit and by adding the same quantity to the
upper limit.
Example 1: Construct a frequency distribution
table for he following data: 25, 32, 45, 8, 24, 42,
22, 12, 9, 15, 26, 35, 23, 41, 47, 18, 44, 37,
27, 46, 38, 24, 43, 6, 10, 21, 36, 45, 22, 18

Solution:
Number of observation (N) =

11
Basic Statistics

30
Number of classes(k) = 1 + 3.322 log 30 = 5.9 = 6(approx)

𝐌𝐚𝐱𝐢𝐦𝐮𝐦 𝐕𝐚𝐥𝐮𝐞 − 𝐌𝐢𝐧𝐢𝐦𝐮𝐦 𝐕𝐚𝐥𝐮𝐞


𝐂𝐥𝐚𝐬𝐬 𝐬𝐢𝐳𝐞 (𝐡) =
𝐍𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐂𝐥𝐚𝐬𝐬𝐞𝐬
𝟒𝟔 − 𝟔
= ≅𝟕
𝟔

Inclusive Method:

Cla Tally Fre


ss Mar qu
ks enc
y
6 - 11 IIII 4

12 - 17 I 1
18 - 23 IIIII II 7
24 - 29 IIIII 5
30 - 35 II 2
36 - 41 IIII 4
42 - 47 IIIII II 7

Exclusive method:

Cla Tally Fre

12
Basic Statistics

ss Mar qu
ks enc
y
5.5 – IIII 4
11.5
11.5 – I 1
17.5
17.5 – IIIII II 7
23.5
23.5 – IIIII 5
29.5
29.5 – II 2
35.5
35.5 – IIII 4
41.5
41.5 – IIIII II 7
47.5

13
Basic Statistics

Lesson 2
Measures of Central Tendency

Content
0
Basic Statistics

Course Name Basic Statistics

Lesson 2 Measures Of Central Tendency


Content Creator Name Dr. Vinay Kumar
Chaudhary Charan Singh Haryana
University/College Name
Agricultural University,Hisar
Course Reviewer Name Dr Dhaneshkumar V Patel
Unagadh Agricultural
University/college Name
University,Junagadh

1
Basic Statistics

Objectives of the Lesson:


1. Characteristics for ideal averages
2. Various measures of Central tendency
3. Merit and Demerits of Various measures of central
tendency
4. Deciles and Percentiles
Glossary of Terms: Mean, Median, Mode, Harmonic Mean, Geometric
Mean, etc.

In the study of a population with respect to one in which we are


interested we may get a large number of observations. It is not possible to
grasp any idea about the characteristic when we look at all the
observations. So it is better to get one number for one group. That number
must be a good representative one for all the observations to give a clear
picture of that characteristic. Such representative number can be a central
value for all these observations. This central value is called a measure of
central tendency or an average or a measure of locations.
3.1 Types of Averages:
There are five averages. Among them mean, median and mode are called
simple averages and the other two averages geometric mean and
harmonic mean are called special averages.
3.2 Characteristics for a good or an ideal average:
The following properties should possess for an ideal average.
1. It should be rigidly defined.
2. It should be easy to understand and compute.
3. It should be based on all items in the data.
4. Its definition shall be in the form of a mathematical formula.
5. It should be capable of further algebraic treatment.

2
Basic Statistics

6. It should have sampling stability.


7. It should be capable of being used in further statistical computations
or processing.
3.3 Arithmetic mean
The arithmetic mean (or, simply average or mean) of a set of numbers is
obtained by dividing the sum of numbers of the set by the number of
numbers. If the variable x assumes n values 𝑥1 , 𝑥2 … 𝑥𝑛 then the mean, is
given by
𝑥1 + 𝑥2 + 𝑥3 + ⋯ + 𝑥𝑛 ∑𝑥
𝑋̅ = =
𝑛 𝑛
Example 1: Calculate the mean for 2, 4, 6, 8, and 10.

Solution:
2 + 4 + 6 + 8 + 10 30
𝑀𝑒𝑎𝑛 = = =6
5 5
Direct method : If the observations 𝑥1 , 𝑥2 … 𝑥𝑛 have
frequencies 𝑓1 , 𝑓2 , 𝑓3 , … , 𝑓𝑛 respectively, then the mean is given by :
(𝑓1 𝑥1 + 𝑓2 𝑥2 + … + 𝑓𝑛 𝑥𝑛 ∑𝑓𝑖 𝑥𝑖
𝑀𝑒𝑎𝑛(𝑋̅ ) =) =
𝑓1 + 𝑓2 + ⋯ + 𝑓𝑛 ∑𝑓𝑖
This method of finding the mean is called the direct method.
Example 2: Given the following frequency distribution, calculate the
arithmetic mean
Marks (x) 50 55 60 65 70 75
No of Students (f) 2 5 4 4 5 5
Solution:

Marks (x) 50 55 60 65 70 75 Total

3
Basic Statistics

No of Students 2 5 4 4 5 5 25
(f)

fx 100 275 240 260 350 375 1600

(𝑓1 𝑥1 + 𝑓2 𝑥2 + … + 𝑓𝑛 𝑥𝑛 ∑𝑓𝑖 𝑥𝑖
𝑀𝑒𝑎𝑛(𝑋̅) =) =
𝑓1 + 𝑓2 + ⋯ + 𝑓𝑛 ∑𝑓𝑖

1600
= = 64
25

(ii) Short cut method: In some problems, where the number of variables is
large or the values of xi or fi are larger, then the calculations become
tedious. To overcome this difficulty, we use short cut or deviation method
in which an approximate mean, called assumed mean is taken. This
assumed mean is taken preferably near the middle, say A, and the
deviation di = xi − A is calculated for each variable Then the mean is
given by the formula:
∑𝑓𝑖 𝑥𝑖
𝑀𝑒𝑎𝑛(𝑋̅) = 𝐴 +
∑𝑓𝑖

Mean for a grouped frequency distribution

Example 3: Given the following frequency distribution, calculate the arithmetic mean

Marks (x) 50 55 60 65 70 75
No of Students (f) 2 5 4 4 5 5

Solution:

x f fx d=x- fd
A
50 2 100 -10 -20

4
Basic Statistics

55 5 275 -5 -25
60 4 240 0 00
65 4 260 +5 20
70 5 350 +10 50
75 5 375 +15 75
25 1600 100

By Direct method:

(𝑓1 𝑑1 + 𝑓2 𝑑2 + … + 𝑓𝑛 𝑑𝑛 ) ∑𝑓𝑖 𝑑𝑖
𝑀𝑒𝑎𝑛(𝑋̅ ) =) =
𝑓1 + 𝑓2 + ⋯ + 𝑓𝑛 ∑𝑓𝑖

1600
= = 64
25

By Short-cut method:

∑𝑓𝑖 𝑑𝑖
𝑀𝑒𝑎𝑛(𝑋̅ ) = 𝐴 +
𝑁

100
= 60 + = 60 + 4 = 64
25

Mean for a grouped frequency distribution

Find the class mark or mid-value x, of each class, as

𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡


𝑥𝑖 = 𝑐𝑙𝑎𝑠𝑠 𝑚𝑎𝑟𝑘𝑠 = ( )
2

Then

∑𝑓𝑖 𝑥𝑖 ∑𝑓𝑖 𝑑𝑖
𝑋̅ = 𝑜𝑟 𝑋̅ = 𝐴 + , 𝑑𝑖 = 𝑥𝑖 − 𝐴
∑𝑓𝑖 ∑𝑓𝑖

Example 4: Following is the distribution of persons according to different income groups.


Calculate arithmetic mean.

Inco 0 1 2 3 4 5 6
me - 0 0 0 0 0 0

5
Basic Statistics

SR 1 - - - - - -
(100) 0 2 3 4 5 6 7
0 0 0 0 0 0
Numb 6 8 1 1 7 4 3
er of 0 2
perso
ns

Solution:

Income Number Mid (𝑥𝑖 − 𝐴) fd


𝑑𝑖 =
C.I of X ℎ

Persons
(f)
0-10 6 5 -3 -
18
10-20 8 15 -2 -
16
20-30 10 25 -1 -
10
30-40 12 A 0 0
=35
40-50 7 45 1 7
50-60 4 55 2 8
60-70 3 65 3 9
Total 50 -
20
∑𝑓𝑖 𝑑𝑖 (𝑥𝑖 − 𝐴)
𝑋̅ = 𝐴 + × ℎ, 𝑑𝑖 =
∑𝑓𝑖 ℎ

−20
𝐴+ × 10 = 35 − 4 = 31
50

3.3.1 Merits and demerits of Arithmetic mean:


6
Basic Statistics

Merits:
1. It is rigidly defined.
2. It is easy to understand and easy to calculate.
3. If the number of items is sufficiently large, it is more accurate
and more reliable.
4. It is a calculated value and is not based on its position in the series.
5. It is possible to calculate even if some of the details of the data are
lacking.
6. Of all averages, it is affected least by fluctuations of sampling.
7. It provides a good basis for comparison.
Demerits:
1. It cannot be obtained by inspection nor located through a frequency
graph.
2. It cannot be in the study of qualitative phenomena not capable of
numerical measurement i.e. Intelligence, beauty, honesty etc.,
3. It can ignore any single item only at the risk of losing its accuracy.
4. It is affected very much by extreme values.
5. It cannot be calculated for open-end classes.
6. It may lead to fallacious conclusions, if the details of the data from
which it is computed are not given.

3.4 Harmonic mean (H.M.):


Harmonic mean of a set of observations is defined as the reciprocal of the
arithmetic average of the reciprocal of the given values. If 𝑥1 , 𝑥2 , . . , 𝑥𝑛 are
n observations,
𝑛
𝐻𝑀 =
∑𝑛𝑖=1(1/𝑥𝑖 )
For a frequency distribution

7
Basic Statistics

𝑛
𝐻𝑀 = 𝑛
1
∑ 𝑓( )
𝑖=1 𝑥𝑖

Example 5: From the given data calculate H. M. 5, 10, 17, 24, and 30.
Solution:
x 1
𝑥
5 0.2000
10 0.1000
17 0.0588
24 0.0417
30 0.0333
Total 0.4338

Hence,
𝑛
𝐻𝑀 =
∑𝑛𝑖=1(1/𝑥𝑖 )
5
= = 11.52
0.4338
Example 6: The marks secured by some students of a class are given below.
Calculate the harmonic mean.

Marks 20 21 22 23 24 25
Number of Students 4 2 7 1 3 1

Solution:

Marks No of 1 𝑓𝑖
𝑥𝑖 𝑥𝑖
x students
20 4 0.0500 0.2000
21 2 0.0476 0.0952

8
Basic Statistics

22 7 0.0454 0.3178
23 1 0.0435 0.0435
24 3 0.0417 0.1251
25 1 0.0400 0.0400
18 0.8216

Hence,
𝑛
𝐻𝑀 =
∑𝑛𝑖=1 𝑓(1/𝑥𝑖 )
6
= 21.91
0.8216
3.5 Geometric Mean (G.M.):
The geometric mean of a series containing n observations is the nth root
of the product of the values. If 𝑥1 , 𝑥2 , … , 𝑥𝑛 are observations then

G.M. = 𝑛√𝑥1 . 𝑥2 … . . 𝑥𝑛
1
(𝑥1 . 𝑥2 … . 𝑥𝑛 )𝑛
1
𝐿𝑜𝑔 𝐺. 𝑀. = (log 𝑥1 + log 𝑥2 + ⋯ + log 𝑥𝑛 )
𝑛
∑ log 𝑥𝑖
𝐿𝑜𝑔 𝐺. 𝑀. =
𝑛
∑ log 𝑥𝑖
𝐺. 𝑀. = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔
𝑛
Example 7: Calculate the geometric mean (G.M.) of the following series of
monthly income of a batch of families 180, 250, 490, 1400, 1050.
Solution:
x Log x
180 2.2553

9
Basic Statistics

250 2.3979
490 2.6902
1400 3.1461
1050 3.0212
13.5107
∑ log 𝑥𝑖 13.5107
𝐺. 𝑀. = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 2.70 = 503.6
𝑛 5

Example 8: Calculate the average income per head from the data given below .Use
geometric mean.
Class of Number Monthly
people of income
families per head
(SR)
Landlords 2 5000
Cultivators 100 400
Landless – labours 50 200
Money – lenders 4 3750
Office Assistants 6 3000
Shop keepers 8 750
Carpenters 6 600
Weavers 10 300

Solution:

Class of Annua Numbe Log x f logx


people l r of
income families
( SR) (f)
X
Landlords 5000 2 3.699 7.398
0

10
Basic Statistics

Cultivator 400 100 2.602 260.21


s 1 0
Landless – 200 50 2.301 115.05
labours 0 0
Money – 3750 4 3.574 14.296
lenders 0
Office 3000 6 3.477 20.863
Assistants 1
Shop 750 8 2.875 23.200
keepers 1 8
Carpenter 600 6 2.778 16.669
s 2
Weavers 300 10 2.477 24.771
1
186 482.25
7

∑ fi log xi
G. M. = Antilog
𝑛
482.257
= 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔(2.5928)
186
= 391.50
Combined Mean:
If the arithmetic averages and the number of items in two or more related
groups are known, the combined or the composite mean of the entire
group can be obtained by
𝑛1 𝑥̅1 + 𝑛2 𝑥̅ 2 + … + 𝑛𝑛 𝑥̅𝑛
𝐶𝑜𝑚𝑏𝑖𝑛𝑒𝑑 𝑀𝑒𝑎𝑛, 𝑋̅ =
𝑛1 + 𝑛2 + ⋯ + 𝑛𝑛

Example 9: Find the combined mean for the data given below:

11
Basic Statistics

𝑛1 = 20; 𝑥̅1 = 4; 𝑛2 = 30 𝑎𝑛𝑑 𝑥̅ 2 = 3


Solution:
𝑛1 𝑥̅1 + 𝑛2 𝑥̅ 2 4 × 20 + 3 × 30 80 + 90
𝐶𝑜𝑚𝑏𝑖𝑛𝑒𝑑 𝑀𝑒𝑎𝑛, 𝑋̅ = = =
𝑛1 + 𝑛2 20 + 30 50
170
= = 3.4
50

3.6 Positional Averages (Median and Mode):


These averages are based on the position of the given observation in a
series, arranged in an ascending or descending order. The magnitude or
the size of the values does matter as was in the case of arithmetic mean. It
is because of the basic difference that the median and mode are called the
positional measures of an average.
3.6.1 Median:
The median is the middle value of a distribution i.e., median of a
distribution is the value of the variable which divides it into two equal
parts. It is the value of the variable such that the number of observations
above it is equal to the number of observations below it.
Ungrouped or Raw data:
Arrange the given values in the increasing or decreasing order. If the
numbers of values are odd, median is the middle value. If the numbers of
values are even, median is the mean of middle two values.
By formula,
𝑁+1 th
Median, Md = ( ) item
2

When odd numbers of values are given:-


Example 10: Find median for the following data

12
Basic Statistics

25, 18, 27, 10, 8, 30, 42, 20, 53


Solution:
Arranging the data in the increasing order 8, 10, 18, 20, 25, 27, 30, 42, 53
Here, numbers of observations are odd (N= 9)
𝑁+1 𝑡ℎ 9+1 th
Hence, Median, Md = ( ) item = ( ) item= (5)th item
2 2

The middle value is the 5th item i.e., 25 is the median value.
When even numbers of values are given:-
Example 11: Find median for the following data
5, 8, 12, 30, 18, 10, 2, 22
Solution:
Arranging the data in the increasing order 2, 5, 8, 10, 12, 18, 22, 30
Here median is the mean of the middle two items (i.e) mean of (10, 12) i.
10+12
e., ( ) = 11
2

Example 12: The following table represents the marks obtained by a batch
of 10 students in certain class tests in statistics and Accountancy.
Serial No 1 2 3 4 5 6 7 8 9 10
Marks (Statistics) 53 55 52 32 30 60 47 46 35 28
Marks (Accountancy) 57 45 24 31 25 84 43 80 32 72
Solution: For such question, median is the most suitable measure of central tendency. The marks in
the two subjects are first arranged in increasing order as follows:

Serial No 1 2 3 4 5 6 7 8 9 10
Marks in Statistics 28 30 32 35 46 47 52 53 55 60
Marks in Accountancy 24 25 31 32 43 45 57 72 80 84
46 + 47
Median value for Statistics = (Mean of 5th and 6th items) = = ( ) = 46.5
2

13
Basic Statistics

Median value for Accountancy = (Mean of 5th and 6th items)


43 + 45
= ( ) = 44
2
Therefore, the level of knowledge in Statistics is higher than that in
Accountancy.
Grouped Data:
In a grouped distribution, values are associated with frequencies.
Grouping can be in the form of a discrete frequency distribution or a
continuous frequency distribution. Whatever may be the type of
distribution, cumulative frequencies have to be calculated to know the
total number of items.
Discrete Series:
𝐒𝐭𝐞𝐩𝟏: Find cumulative frequencies.
N+1
𝐒𝐭𝐞𝐩 𝟐: Find ( )
2
N+1
𝐒𝐭𝐞𝐩 𝟑: See in the cumulative frequencies the value just greater than ( )
2
𝐒𝐭𝐞𝐩𝟒: Then the corresponding value of x will be median.
Example 13: The following data are pertaining to the number of members
in a family. Find median size of the family.
Number of members x 1 2 3 4 5 6 7 8 9 10 11 12
Frequency F 1 3 5 6 10 13 9 5 3 2 2 1
Solution:

X f cf
1 1 1
2 3 4
3 5 9

14
Basic Statistics

4 6 15
5 10 25
6 13 38
7 9 47
8 5 52
9 3 55
10 2 57
11 2 59
12 1 60
N= 60

N+1 60 + 1
Median = Size of ( ) th item = Size of ( ) th item
2 2
= 30.5th item.
The cumulative frequency just greater than 30.5 is 38 and the value of x
corresponding to 38 is 6. Hence the median size is 6 members per family.
Note:
It is an appropriate method because a fractional value given by mean does
not indicate the average number of members in a family.

Continuous Series:
The steps given below are followed for the calculation of median in
continuous series.
Step1: Find cumulative frequencies.
N
Step 2: Find ( )
2

15
Basic Statistics

N
Step3: See in the cumulative frequency the value first greater than ( ) Then the co
2
class interval is called the Median Class. Then apply the formula for Median
𝑁
− 𝑐𝑓
2
𝑀𝑑 = 𝑙 + ×ℎ
𝐹
Where,
𝑙 = 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠
𝛴𝑓𝑖 = 𝑛 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑓 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠
ℎ = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠 (𝑎𝑠𝑠𝑢𝑚𝑖𝑛𝑔 𝑐𝑙𝑎𝑠𝑠 𝑠𝑖𝑧𝑒 𝑡𝑜 𝑏𝑒 𝑒𝑞𝑢𝑎𝑙)
𝑐𝑓 = 𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑝𝑟𝑒𝑐𝑒𝑑𝑖𝑛𝑔 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠.
𝑁 = 𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦.
Note:
If the class intervals are given in inclusive type convert them into
exclusive type and call it as true class interval and consider lower limit
in this.
3.7 Quartiles:
The quartiles divide the distribution in four parts. There are three quartiles.
The second quartile divides the distribution into two halves and therefore
is the same as the median. The first (lower) quartile (Q1) marks off the first
one-fourth, the third (upper) quartile (Q3) marks off the three-fourth.
Raw or ungrouped data:
First arrange the given data in the increasing order and use the formula for
Q1 and Q3 then quartile deviation, Q.D. is given by
𝑄3 − 𝑄1
𝑄. 𝐷. =
2
16
Basic Statistics

N+1 N+1
where, Q1 = ( ) th item and Q3 = 3 ( ) th item
4 4
Example 14: Compute quartiles for the data given below 25,18, 30, 8, 15,
5, 10, 35, 40, 45
Solution:
5, 8, 10, 15, 18, 25, 30, 35, 40, 45
𝑁+1
𝑄1 = ( ) 𝑡ℎ 𝑖𝑡𝑒𝑚
4
10 + 1
= ( ) 𝑡ℎ 𝑖𝑡𝑒𝑚
4
= (2.75)𝑡ℎ 𝑖𝑡𝑒𝑚.
3
= 2𝑛𝑑 𝑖𝑡𝑒𝑚 + ( ) (3𝑟𝑑 𝑖𝑡𝑒𝑚 − 2𝑛𝑑 𝑖𝑡𝑒𝑚)
4
3
= 8 + ( ) (10 − 8)
4
3
= 8 + ( )×2
4
= 9.5
𝑁 + 1 𝑡ℎ
𝑄3 = 3 ( ) 𝑖𝑡𝑒𝑚
4
= 3(2.75)𝑡ℎ 𝑖𝑡𝑒𝑚.
= 8.25𝑡ℎ 𝑖𝑡𝑒𝑚
3
= 8𝑡ℎ 𝑖𝑡𝑒𝑚 + ( ) (9𝑡ℎ 𝑖𝑡𝑒𝑚 − 8𝑡ℎ 𝑖𝑡𝑒𝑚)
4
3
= 35 + ( ) (40 − 35)
4
= 35 + 1.25

17
Basic Statistics

= 36.25
Discrete Series:
𝐒𝐭𝐞𝐩 𝟏: Find cumulative frequencies
N+1
𝐒𝐭𝐞𝐩 𝟐: Find ( )
4
N+1
𝐒𝐭𝐞𝐩 𝟑: See in the cumulative frequencies, the value just greater than ( ) then
4
corresponding value of x is Q1
N+1
𝐒𝐭𝐞𝐩 𝟒: Find 3 ( )
4
N+1
B See in the cumulative frequencies, the value just greater than 3 ( ) then
4
the corresponding value of x is Q3 .

Example 15: Compute quartiles for the data given below.

X 5 8 12 15 19 24 30
f 4 3 2 4 5 2 4

Solution:

x f c.f
5 4 4
8 3 7
12 2 9
15 4 13

18
Basic Statistics

19 5 18
24 2 20
30 4 24
Total 24

𝑁 + 1 𝑡ℎ
𝑄1 = ( ) 𝑖𝑡𝑒𝑚
4
24 + 1 𝑡ℎ
= ( ) 𝑖𝑡𝑒𝑚
4
25 𝑡ℎ
= ( ) 𝑖𝑡𝑒𝑚
4
= 6.25𝑡ℎ = 8
𝑁 + 1 𝑡ℎ
𝑄3 = 3 ( ) 𝑖𝑡𝑒𝑚
4
= (3 × 6.25)𝑡ℎ 𝑖𝑡𝑒𝑚
= 18.75𝑡ℎ 𝑖𝑡𝑒𝑚 = 24
Continuous Series:
Step 1: Find cumulative frequencies;
N
𝐒𝐭𝐞𝐩 𝟐: Find ( )
4
𝑁
Step 3: See in the cumulative frequencies, the value just greater( ), then
4
the corresponding class interval is called first quartile class.
𝑁
Step 4: Find 3 ( ), See in the cumulative frequencies the value just greater
4
𝑁
than 3 ( ) then the corresponding class interval is called 3rd quartile
4
class. Then apply the respective formulae

19
Basic Statistics

𝑁
− 𝑐𝑓1
4
𝑄1 = 𝑙1 + × ℎ1
𝑓1
𝑁
3 ( ) − 𝑐𝑓3
4
𝑄3 = 𝑙3 + × ℎ3
𝑓3
𝑤ℎ𝑒𝑟𝑒 𝑙1 = 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
𝑓1 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
ℎ1 = 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
𝑐𝑓1 = 𝑐. 𝑓. 𝑝𝑟𝑒𝑐𝑒𝑑𝑖𝑛𝑔 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
𝑙3 = 1𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑡ℎ𝑒 3𝑟𝑑 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
𝑓3 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 3𝑟𝑑 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
ℎ3 = 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑡ℎ𝑒 3𝑟𝑑 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
𝑐𝑓3 = 𝑐. 𝑓. 𝑝𝑟𝑒𝑐𝑒𝑑𝑖𝑛𝑔 𝑡ℎ𝑒 3𝑟𝑑 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
3.8 Deciles:
These are the values, which divide the total number of observation into 10
equal parts. These are 9 deciles D1, D2…D9. These are all called first decile,
second decile…etc.
Deciles for Raw data or ungrouped data
Example 16: Compute D5 for the data given below 5, 24, 36, 12, 20, 8.
Solution: Arranging the given values in the increasing order 5, 8, 12, 20, 24,
36

𝑁 + 1 𝑡ℎ
𝐷5 = 5 ( ) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
10
6 + 1 𝑡ℎ
= 5( ) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
10

20
Basic Statistics

= (3.5)𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
1
= 3𝑟𝑑 𝑖𝑡𝑒𝑚 + ( ) [ 4𝑡ℎ 𝑖𝑡𝑒𝑚 – 3𝑟𝑑 𝑖𝑡𝑒𝑚]
2
1
= 12 + ( ) [ 20 – 12]
2
= 16.
Deciles for Grouped data:
Same as quartile.

3.9 Percentiles:
The percentile values divide the distribution into 100 parts each containing
1 percent of the cases. The percentile (Pk) is that value of the variable up
to which lie exactly k% of the total number of observations.
Relationship:
𝑃25 = 𝑄1 ; 𝑃50 = 𝐷5 = 𝑄2 = 𝑀𝑒𝑑𝑖𝑎𝑛 𝑎𝑛𝑑 𝑃75 = 𝑄3

Percentile for Raw Data or Ungrouped Data:


Example 17: Calculate P15 for the data given below: 5, 24, 36 , 12 , 20 , 8.
Solution: Arranging the given values in the increasing order. 5, 8, 12, 20,
24, 36

𝑁 + 1 𝑡ℎ
𝑃15 = 15 ( ) 𝑖𝑡𝑒𝑚
100
6 + 1 𝑡ℎ
= 15 ( ) 𝑖𝑡𝑒𝑚
100
= (1.05)𝑡ℎ 𝑖𝑡𝑒𝑚

21
Basic Statistics

= 1𝑠𝑡 𝑖𝑡𝑒𝑚 + 0.5(2𝑛𝑑 𝑖𝑡𝑒𝑚 – 1𝑠𝑡 𝑖𝑡𝑒𝑚)


= 5 + 0.5(8 − 5)
= 5.15
Percentile for Grouped Data:
Example 18: Find P53 for the following frequency distribution.
Class Interval 0-5 5-10 10-15 15-20 20-25 25- 30-35 35-40
30
Frequency 5 8 12 16 20 10 4 3
Solution:

Class Interval Frequency cf


0-5 5 5
5-10 8 13
10-15 12 25
15-20 16 41
20-25 20 61
25-30 10 71
30-35 4 75
35-40 3 78
Total 78
53𝑁
− 𝑐𝑓 41.34 − 41
100
𝑃53 = 𝑙 + × ℎ = 20 + × 5 = 20.085
𝑓 20
3.10 Mode:
The mode or modal value of a distribution is that value of the variable for
which the frequency is the maximum. It refers to that value in a distribution
which occurs most frequently. It shows the center of concentration of the
frequency in around a given value. Therefore, where the purpose is to

22
Basic Statistics

know the point of the highest concentration it is preferred. It is, thus, a


positional measure.
Its importance is very great in marketing studies where a manager is
interested in knowing about the size, which has the highest concentration
of items. For example, in placing an order for shoes or ready-made
garments the modal size helps because these sizes and other sizes around
in common demand.
Computation of the mode:
Ungrouped or Raw Data:
For ungrouped data or a series of individual observations, mode is often
found by mere inspection.
Example 19: 2, 7, 10, 15, 10, 17, 8, 10, 2
𝑀𝑜𝑑𝑒 = 𝑀0 = 10
In some cases the mode may be absent while in some cases there may be
more than one mode
Grouped Data:
For Discrete distribution, see the highest frequency and corresponding
value of X is mode.
Continuous distribution:
See the highest frequency then the corresponding value of class interval is
called the modal class. Then apply the following formula:
𝑓 − 𝑓1
𝑀𝑜𝑑𝑒, 𝑀𝑜 = 𝑙 + ×ℎ
2𝑓 − 𝑓1 − 𝑓2
𝑊ℎ𝑒𝑟𝑒, 𝑙 = 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑜𝑑𝑎𝑙 𝑐𝑙𝑎𝑠𝑠
𝑓 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑜𝑑𝑎𝑙 𝑐𝑙𝑎𝑠𝑠

23
Basic Statistics

𝑓1 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑝𝑟𝑒𝑐𝑒𝑑𝑖𝑛𝑔 𝑡ℎ𝑒 𝑚𝑜𝑑𝑎𝑙 𝑐𝑙𝑎𝑠𝑠


𝑓2 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑓𝑜𝑙𝑙𝑜𝑤𝑖𝑛𝑔 𝑡ℎ𝑒 𝑚𝑜𝑑𝑎𝑙 𝑐𝑙𝑎𝑠𝑠.
ℎ = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑜𝑑𝑎𝑙 𝑐𝑙𝑎𝑠𝑠
Remarks:
If (2𝑓1 − 𝑓0 − 𝑓2 ) comes out to be zero, then mode is obtained by the
following formula taking absolute differences within vertical lines;
𝑓 − 𝑓1
𝑀𝑜𝑑𝑒, 𝑀0 = ×h
|𝑓 − 𝑓1 | + |𝑓 – 𝑓2 |
If mode lies in the first class interval, then f is taken as zero.
The computation of mode poses no problem in distributions with open-
end classes, unless the modal value lies in the open-end class.
Example 20: Calculate mode for the following:
CI f
0-50 5
50-100 14
100-150 40
150-200 91
200-250 150
250-300 87
300-350 60
350-400 38
400 and above 15

Solution:
The highest frequency is 150 and corresponding class interval is 200 – 250,
which is the modal class.

24
Basic Statistics

𝐻𝑒𝑟𝑒, 𝑙 = 200; 𝑓 = 150; 𝑓1 = 91; 𝑓2 = 87 𝑎𝑛𝑑 ℎ = 50


𝑓 − 𝑓1
𝑀𝑜𝑑𝑒, 𝑀𝑜 = 𝑙 + ×ℎ
2𝑓 − 𝑓1 − 𝑓2
150−91
= 200 + × 50
2×150−91−87

= 200 + 24.18
= 224.18
Determination of Modal class:
For a frequency distribution modal class corresponds to the maximum
frequency. But in any one (or more) of the following cases-
If the maximum frequency is repeated
If the maximum frequency occurs in the beginning or at the end of the
distribution
If there are irregularities in the distribution, the modal class is determined
by the method of grouping.
Steps for Calculation:
We prepare a grouping table with 6 columns
1) In column I, we write down the given frequencies;
2) Column II is obtained by combining the frequencies two by two;
3) Leave the 1st frequency and combine the remaining frequencies two
by two and write in column III;
4) Column IV is obtained by combining the frequencies three by three;
5) Leave the 1st frequency and combine the remaining frequencies
three by three and write in column V;
6) Leave the 1st and 2nd frequencies and combine the remaining
frequencies three by three and write in column VI.

25
Basic Statistics

Mark the highest frequency in each column. Then form an analysis table to
find the modal class. After finding the modal class, use the formula to
calculate the modal value.

3.11 Empirical Relationship between Averages


In a symmetrical distribution the three simple averages mean = median =
mode. For a moderately asymmetrical distribution, the relationship
between them are brought by Prof. Karl Pearson as
𝑀𝑜𝑑𝑒 = 3 𝑀𝑒𝑑𝑖𝑎𝑛 − 2 𝑀𝑒𝑎𝑛
Example 21: If the mean and median of a moderately asymmetrical series
are 26.8 and 27.9 respectively, what would be its most probable mode?
Solution:
Using the empirical formula
𝑀𝑜𝑑𝑒 = 3 𝑚𝑒𝑑𝑖𝑎𝑛 − 2 𝑚𝑒𝑎𝑛
= 3 × 27.9 − 2 × 26.8 = 30.1

26
Basic Statistics

Lesson 3
Measures of Dispersion

Content
0
Basic Statistics

Course Name Basic Statistics

Lesson 3 Measures Of Dispersion


Content Creator Name Dr. Vinay Kumar
Chaudhary Charan Singh Haryana
University/College Name
Agricultural University,Hisar
Course Reviewer Name Dr Dhaneshkumar V Patel
Unagadh Agricultural
University/college Name
University,Junagadh

1
Basic Statistics

Lesson-3
Objectives of the lesson:
1. Characteristics of good measure of Dispersion
2. Various absolute and relative measures of Dispersion
3. Mean deviation, Standard deviation and Coefficient of
variation
Glossary of Terms: Dispersion, Range, Quartile Deviation, Mean Deviation,
Standard Deviation, Coefficient of Variation etc.
4.1 Introduction:
The measures of central tendency serve to locate the center of the
distribution, but they do not reveal how the items are spread out on either
side of the center. This characteristic of a frequency distribution is commonly
referred to as dispersion. In a series all the items are not equal. There is
difference or variation among the values. The degree of variation is evaluated
by various measures of dispersion. Small dispersion indicates high uniformity
of the items, while large dispersion indicates less uniformity. For example
consider the following marks of two students.

Student I Student II
68 85
75 90
65 80
67 25
70 65

2
Basic Statistics

Both have got a total of 345 and an average of 69 each. The fact is that the
second student has failed in one paper. When the averages alone are
considered, the two students are equal. But first student has less variation
than second student. Less variation is a desirable characteristic.
4.2 Characteristics of a good measure of dispersion:
An ideal measure of dispersion is expected to possess the following properties
1. It should be rigidly defined
2. It should be based on all the items.
3. It should not be unduly affected by extreme items.
4. It should lend itself for algebraic manipulation.
5. It should be simple to understand and easy to calculate

4.3 Absolute and Relative Measures:


There are two kinds of measures of dispersion, namely
1) Absolute measure of dispersion and
2) Relative measure of dispersion.
Absolute measure of dispersion indicates the amount of variation in a set of
values in terms of units of observations. For example, when rainfalls on
different days are available in mm, any absolute measure of dispersion gives
the variation in rainfall in mm. On the other hand relative measures of
dispersion are free from the units of measurements of the observations. They
are pure numbers. They are used to compare the variation in two or more
sets, which are having different units of measurements of observations.
The various absolute and relative measures of dispersion are listed below.

3
Basic Statistics

Absolute measure Relative measure


Range Co-efficient of Range
Quartile deviation Co-efficient of Quartile deviation
Mean deviation Co-efficient of Mean deviation
Standard deviation Co-efficient of variation

4.3.1 Range and coefficient of Range:


Range:
This is the simplest possible measure of dispersion and is defined as the
difference between the largest and smallest values of the variable.
In symbols, 𝑅𝑎𝑛𝑔𝑒 = 𝐿 – 𝑆.
Where L = Largest value. S = Smallest value.
In individual observations and discrete series, L and S are easily identified. In
continuous series, the following two methods are followed,
Method 1:
L = Upper boundary of the highest class
S = Lower boundary of the lowest class
Method 2:
L = Mid value of the highest class.
S = Mid value of the lowest class.
Co-efficient of Range:
𝐿−𝑆
𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑅𝑎𝑛𝑔𝑒 =
𝐿+𝑆

4
Basic Statistics

Example 1: Find the value of range and its co-efficient for the following data.
7, 9, 6, 8, 11, 10, 4
Solution:
𝐿 = 11, 𝑆 = 4.
𝑅𝑎𝑛𝑔𝑒 = 𝐿 – 𝑆
= 11 − 4 = 7
L−S 11 − 4 7
Co − efficient of Range = = = = 0.4667
L+S 11 + 4 15
Example 2: Calculate range and its co efficient from the following
distribution.
Size: 60- 63 63- 66 66- 69 69- 72 72- 75
Number: 5 18 42 27 8

Solution:
𝐿 = 𝑈𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝑜𝑓 𝑡ℎ𝑒 ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑐𝑙𝑎𝑠𝑠 = 75
𝑆 = 𝐿𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑙𝑜𝑤𝑒𝑠𝑡 𝑐𝑙𝑎𝑠𝑠 = 60
𝑅𝑎𝑛𝑔𝑒 = 𝐿 – 𝑆 = 75 – 60 = 15
𝐿−𝑆 75 − 60 15
𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑅𝑎𝑛𝑔𝑒 = = = = 0.1111
𝐿+𝑆 75 + 60 135
4.3.2 Quartile Deviation and Co efficient of Quartile Deviation:
Quartile Deviation (Q.D.):
Definition: Quartile Deviation is half of the difference between the first and
third quartiles. Hence, it is called Semi Inter Quartile Range.
𝑄3 − 𝑄1
𝐼𝑛 𝑠𝑦𝑚𝑏𝑜𝑙, 𝑄. 𝐷. =
2

5
Basic Statistics

Among the quartiles Q1, Q2 and Q3, the range Q3- Q1 is called inter quartile
𝑄3 − 𝑄1
range and , semi inter quartile range.
2

Co-efficient of Quartile Deviation:


𝑄3 − 𝑄1
𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 =
𝑄3 + 𝑄1
Example 3: Find the Quartile Deviation for the following data:
391, 384, 591, 407, 672, 522, 777, 733, 1490, 2488
Solution: Arrange the given values in ascending order.
384, 391, 407, 522, 591, 672, 733, 777, 1490, 2488
𝑁+1 10 + 1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄1 𝑖𝑠 = = 2.75𝑡ℎ 𝑖𝑡𝑒𝑚
4 4
𝑄1 = 2𝑛𝑑 𝑣𝑎𝑙𝑢𝑒 + 0.75 (3𝑟𝑑 𝑣𝑎𝑙𝑢𝑒 – 2𝑛𝑑 𝑣𝑎𝑙𝑢𝑒)
= 391 + 0.75 (407 – 391)
= 391 + 0.75 × 16
= 391 + 12
= 403
𝑁+1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄3 𝑖𝑠 3 ( ) = 3 × 2.75 = 8.25𝑡ℎ 𝑖𝑡𝑒𝑚
4
𝑄3 = 8𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 + 0.25 (9𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 – 8𝑡ℎ 𝑣𝑎𝑙𝑢𝑒)
= 777 + 0.25 (1490 – 777)
= 777 + 0.25 (713)
= 777 + 178.25 = 955.25

6
Basic Statistics

𝑄3 − 𝑄1 955.25 − 403 552.25


𝑄. 𝐷. = = = = 276.125
2 2 2
Example 4: Weekly wages of labors are given below. Calculate Q.D. and
Coefficient of Q.D.

Weekly Wage 100 200 400 500 600


(Rs.)
No. of Weeks 5 8 21 12 6
Solution:

Weekly No. of Cum.


Wage Weeks No. of
(Rs.) Weeks
100 5 5
200 8 13
400 21 34
500 12 46
600 6 52
Total N=52

𝑁+1 52 + 1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄1 𝑖𝑠 = = 13.25𝑡ℎ 𝑖𝑡𝑒𝑚
4 4
𝑄1 = 13𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 + 0.25 (14𝑡ℎ 𝑉𝑎𝑙𝑢𝑒 – 13𝑡ℎ 𝑣𝑎𝑙𝑢𝑒)
= 13𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 + 0.25 (400 – 200)
= 200 + 0.25 (400 – 200)
= 200 + 0.25 (200)
= 200 + 50 = 250

7
Basic Statistics

𝑁+1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄3 𝑖𝑠 3 ( ) = 3 × 13.25 = 39.25𝑡ℎ 𝑖𝑡𝑒𝑚
4
𝑄3 = 39𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 + 0.75 (40𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 – 39𝑡ℎ 𝑣𝑎𝑙𝑢𝑒)
= 500 + 0.75 (500 – 500)
= 500 + 0.75 × 0
= 500
𝑄3 − 𝑄1 500 − 250 250
𝑄. 𝐷. = = = = 125
2 2 2
𝑄3 − 𝑄1 500 − 250
𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = =
𝑄3 + 𝑄1 500 + 250
250
= = 0.33
750
Example 5: For the date given below, give the quartile deviation and
coefficient of quartile deviation.
X 351 – 501 651 801– 951–
: 500 – – 950 1100
650 800
f 48 189 88 4 28
:
Solution:

x F True Cumulative
class frequency
Intervals
351- 48 350.5- 48
500 500.5
501- 189 500.5- 237
650 650.5

8
Basic Statistics

651- 88 650.5- 325


800 800.5
801- 47 800.5- 372
950 950.5
951- 28 950.5- 400
1100 1100.5
Total N = 400
𝑁
𝑆𝑖𝑛𝑐𝑒, = 100
4
𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒, 𝑄1 𝐶𝑙𝑎𝑠𝑠 𝑖𝑠 500.5 − 650.5
𝑛
𝐻𝑒𝑛𝑐𝑒, 𝑙1 = 500.5; = 100; 𝑐𝑓1 = 48; 𝑓1 = 189; ℎ1 = 150
4
𝑁
− 𝑐𝑓1
4
𝑄1 = 𝑙1 + × ℎ1
𝑓1
100 − 48
= 500.5 + × 150 = 541.77
189
𝑁𝑜𝑤, 𝑓𝑜𝑟 𝑄3
𝑁
3 ( ) = 3 × 100 = 300
4
𝐻𝑒𝑛𝑐𝑒, 𝑄3 𝐶𝑙𝑎𝑠𝑠 𝑖𝑠 650.5 − 800.5
𝑁
𝑙3 = 650.5; 3 ( ) = 300; 𝑐𝑓3 = 237; 𝑓3 = 88; ℎ3 = 150
4
𝑁
3 ( ) − 𝑐𝑓3
4
𝑄3 = 𝑙3 + × ℎ3
𝑓3
300 − 237
= 650.5 + × 150 = 757.89
88
𝑄3 − 𝑄1 757.89 − 541.77 216.12
𝑄. 𝐷. = = = = 108.06
2 2 2
Q3 − Q1
Co − efficient of Quartile Deviation =
Q3 + Q1
9
Basic Statistics

757.89 − 541.77 216.12


= =
757.89 + 541.77 1299.66
= 0.1663
4.3.3 Mean Deviation and Coefficient of Mean Deviation:
Mean Deviation: The range and quartile deviation are not based on all
observations. They are positional measures of dispersion. They do not show
any scatter of the observations from an average. The mean deviation is
measure of dispersion based on all items in a distribution.
Definition:
Mean deviation is the arithmetic mean of the deviations of a series computed
from any measure of central tendency; i.e., the mean, median or mode, all
the deviations are taken as positive i.e., signs are ignored.
We usually compute mean deviation about any one of the three averages
mean, median or mode. Sometimes mode may be ill defined and as such
mean deviation is computed from mean and median. Median is preferred as
a choice between mean and median. But in general practice and due to wide
applications of mean, the mean deviation is generally computed from mean.
M.D can be used to denote mean deviation.
Coefficient of mean deviation:
Mean deviation calculated by any measure of central tendency is an absolute
measure. For the purpose of comparing variation among different series, a
relative mean deviation is required. The relative mean deviation is obtained
by dividing the mean deviation by the average used for calculating mean
deviation.
Mean Deviation (M. D. )
Coefficient of Mean Deviation =
Mean or Median or Mode

10
Basic Statistics

If the result is desired in percentage,


Mean Deviation (M. D. )
The coefficient of mean deviation = × 100
Mean or Median or Mode
Computation of mean deviation – Individual Series:
1) Calculate the average mean, median or mode of the series.
2) Take the deviations of items from average ignoring signs and denote
these deviations by |D|.
3) Compute the total of these deviations, i.e., Σ |D|
4) Divide this total obtained by the number of items.
Symbolically,
∑|𝐷|
𝑀. 𝐷. =
𝑛
Example 6: Calculate mean deviation from mean and median for the
following data:
100, 150, 200, 250, 360, 490, 500, 600, and 671.
Also calculate co- efficient of M.D.
Solution:
∑𝑋
𝑀𝑒𝑎𝑛 =
𝑛
100 + 150 + 200 + 250 + 360 + 490 + 500 + 600 + 671
=
9
3321
= = 369
9
Now arrange the data in ascending order
100, 150, 200, 250, 360, 490, 500, 600, 671

11
Basic Statistics

N + 1 th 9 + 1 th
Median, Md = Value of ( ) item = ( ) item = 5th item
2 2
= 360
X 𝑫 |𝑫|
= |𝒙 = |𝒙 − 𝑴𝒅 |
− 𝑴𝒆𝒂𝒏|
100 269 260
150 219 210
200 169 160
250 119 110
360 9 0
490 121 130
500 131 140
600 231 240
671 302 311
3321 1570 1561
∑|𝐷| 1570
𝑀. 𝐷. 𝑓𝑟𝑜𝑚 𝑚𝑒𝑎𝑛 = = = 174.44
𝑛 9
𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑀. 𝐷. ) 174.44
𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑀. 𝐷. = = = 0.47
𝑀𝑒𝑎𝑛 369
∑|𝐷| 1561
𝑀. 𝐷. 𝑓𝑟𝑜𝑚 𝑚𝑒𝑑𝑖𝑎𝑛 = = = 173.44
𝑛 9
𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑀. 𝐷. ) 173.44
𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑀. 𝐷. = = = 0.48
𝑀𝑒𝑑𝑖𝑎𝑛 360
Mean Deviation- Discrete Series:
∑ 𝑓 |𝐷|
𝑀. 𝐷. =
𝑛
Example:7
Compute Mean deviation from mean and median from the following data:

12
Basic Statistics

Height in cms 158 159 160 161 162 163 164 165 166
No. of persons 15 20 32 35 33 22 20 10 8
Also compute coefficient of mean deviation.

Solution:
Height No. of 𝒅 𝒇𝒅 |𝑫| = |𝑿 − 𝒎𝒆𝒂𝒏| 𝒇|𝑫|
X persons = 𝒙
f − 𝑨,
𝑨
= 𝟏𝟔𝟐
158 15 -4 - 3.51 52.65
60
159 20 -3 - 2.51 50.20
60
160 32 -2 - 1.51 48.32
64
161 35 -1 - 0.51 17.85
35
162 33 0 0 0.49 16.17
163 22 1 22 1.49 32.78
164 20 2 40 2.49 49.80
165 10 3 30 3.49 34.90
166 8 4 32 4.49 35.92
195 - 338.59
95
∑𝑓 𝑑 −95
𝑀𝑒𝑎𝑛 = 𝐴 + = 162 + = 162 – 0.49 = 161.51
𝑁 195
∑ 𝑓 |𝐷| 338.59
𝑀. 𝐷. = = = 1.74
𝑛 195
𝑀𝐷 1.74
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑀. 𝐷. = = = 0.0108
𝑀𝑒𝑎𝑛 161.51

13
Basic Statistics

Height No. of c.f. |𝑋 − 𝑀𝑒𝑑𝑖𝑎𝑛| 𝑓 |𝐷|


x persons
(f)
158 15 15 3 45
159 20 35 2 40
160 32 67 1 32
161 35 102 0 0
162 33 135 1 33
163 22 157 2 44
164 20 177 3 60
165 10 187 4 40
166 8 195 5 40
195 334

𝑁 + 1 𝑡ℎ 195 + 1 𝑡ℎ
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑆𝑖𝑧𝑒 𝑜𝑓 ( ) 𝑖𝑡𝑒𝑚 = ( ) 𝑖𝑡𝑒𝑚 = (96)𝑡ℎ 𝑖𝑡𝑒𝑚 = 161
2 2
∑ 𝑓 |𝐷| 334
𝑀. 𝐷. = = = 1.71
𝑛 195
𝑀𝐷 1.71
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑀. 𝐷. = = = 0.0106
𝑀𝑒𝑑𝑖𝑎𝑛 161
Mean Deviation-Continuous Series:
The method of calculating mean deviation in a continuous series same as the discrete series.
In continuous series we have to find out the mid points of the various classes and take deviation
of these points from the average selected. Thus
∑ 𝑓 |𝐷 |
𝑀. 𝐷. =
𝑛
Example 8: Find out the mean deviation from mean and median from the following series.

Age in years No of persons


0-10 20
10-20 25

14
Basic Statistics

20-30 32
30-40 40
40-50 42
50-60 35
60-70 10
70-80 8
Also compute co-efficient of mean deviation.

Solution:

𝑚−𝐴
𝑑= ,
𝐴 = 35,𝑐𝑐
𝐷
X M F = 10 fd 𝑓|𝐷|
= |𝑚 − 𝑥̅ |

0- 5 20 -3 -60 31.5 630.0


10
10- 15 25 -2 -50 21.5 537.5
20
20- 25 32 -1 -32 11.5 368.0
30
30- 35 40 0 0 1.5 60.0
40
40- 45 42 1 42 8.5 357.0
50
50- 55 35 2 70 18.5 647.5
60
60- 65 10 3 30 28.5 285.0
70
70- 75 8 4 32 38.5 308.0

15
Basic Statistics

80
212 32 3193.0

∑𝑓𝑑
𝑀𝑒𝑎𝑛 = 𝐴 + ×𝐶
𝑁
32
= 35 + × 10 = 36.5
212
∑ 𝑓 |𝐷| 3193
𝑀. 𝐷. = = = 15.06
𝑛 212
Calculation of median and M.D. from median:
X M f Cf |𝐷 | f|D|
= |𝑚
− 𝑀𝑑 |
0-10 5 20 20 32.25 645.00
10-20 15 25 45 22.25 556.25
20-30 25 32 77 12.25 392.00
30-40 35 40 117 2.25 90.00
40-50 45 42 159 7.75 325.50
50-60 55 35 194 17.75 621.25
60-70 65 10 204 27.75 277.50
70-80 75 8 212 37.75 302.00
Total N=212 3209.50

𝑁 212
= = 106
2 2
𝑁
𝑙 = 30; = 106; 𝑐𝑓 = 77; 𝑓 = 40; ℎ = 10
2
𝑁
− 𝑐𝑓 106 − 77
2
𝑀𝑒𝑑𝑖𝑎𝑛, 𝑀𝑑 = 𝑙 + × ℎ = 30 + × 10 = 37.25
𝑓 40

16
Basic Statistics

∑ 𝑓 |𝐷| 3209.50
𝑀. 𝐷. = = = 15.14
𝑛 212
𝑀𝐷 15.14
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑀. 𝐷. = = = 0.41
𝑀𝑒𝑑𝑖𝑎𝑛 37.25
4.3.3.1 Merits and Demerits of M.D:
Merits:
1. It is simple to understand and easy to compute.
2. It is rigidly defined.
3. It is based on all items of the series.
4. It is not much affected by the fluctuations of sampling.
5. It is less affected by the extreme items.
6. It is flexible, because it can be calculated from any average.
7. It is better measure of comparison.
4.3.3.2 Demerits:
1. It is not a very accurate measure of dispersion.
2. It is not suitable for further mathematical calculation.
3. It is rarely used. It is not as popular as standard deviation.
4. Algebraic positive and negative signs are ignored. It is mathematically unsound and
illogical.

4.3.4 Standard Deviation and Coefficient of variation:


Standard Deviation:
Karl Pearson introduced the concept of standard deviation in 1893. It is the most important
measure of dispersion and is widely used in many statistical formulae. Standard deviation is also
called Root-Mean Square Deviation. The reason is that it is the square–root of the mean of
the squared deviation from the arithmetic mean. It provides accurate result. Square of standard
deviation is called Variance.
Definition:
It is defined as the positive square-root of the arithmetic mean of the Square of the deviations
of the given observation from their arithmetic mean. The standard deviation is denoted by the

17
Basic Statistics

Greek letter  (sigma).


Calculation of Standard deviation-Individual Series:
There are two methods of calculating Standard deviation in an individual series.
a) Deviations taken from Actual mean; and
b) Deviation taken from Assumed mean
(a) Deviation taken from Actual mean:
This method is adopted when the mean is a whole number.

Steps:

1. Find out the actual mean of the series ( x̅ )


2. Find out the deviation of each value from the mean 𝑋 = (𝑥 − 𝑥̅ )
3. Square the deviations and take the total of squared deviations x2
Σ𝑋 2
4. Divide the total (Σ𝑋 2 ) by the number of observation, 𝑁
Σ𝑋 2
5. The square root of , is standard deviation.
𝑁

∑𝑥 2 ∑(x−x̅)2
Thus, Standard Deviation (SD or σ2 ) = √ = √
𝑁 𝑁

(b) Deviations taken from assumed mean:


This method is adopted when the arithmetic mean is fractional value. Taking deviations from fractional
value would be a very difficult and tedious task. To save time and labour, we apply short– cut method;
deviations are taken from an assumed mean. The formula is:

∑𝑑2 ∑𝑑 2
𝑆𝐷 𝑜𝑟 𝜎 = √ −( )
𝑁 𝑁

Where d-stands for the deviation from assumed mean = (𝑥 − 𝐴)

Steps:

Assume any one of the item in the series as an average (A)

1. Find out the deviations from the assumed mean; i.e., X-A denoted by d and also the total of the
deviations Σd

18
Basic Statistics

2. Square the deviations; i.e., 𝑑2 and add up the squares of deviations, i.e, ∑𝑑2
3. Then substitute the values in the following formula:

∑𝑑2 ∑𝑑 2
𝑆𝐷 𝑜𝑟 𝜎 = √ −( )
𝑁 𝑁

Example 9: Calculate the standard deviation from the following data. 14, 22, 9, 15, 20, 17, 12, 11

Solution:

Deviations from actual mean.

Value (x) d = x − x̅ (x − x̅)2


14 -1 1
22 7 49
9 -6 36
15 0 0
20 5 25
17 2 4
12 -3 9
11 -4 16
120 140
∑𝑥 120
𝑀𝑒𝑎𝑛, 𝑥̅ = = = 15
𝑁 8

∑𝑥 2 (𝑥 − 𝑥̅ )2 140
𝑇ℎ𝑢𝑠, 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑆𝐷 𝑜𝑟 𝜎) = √ =√ =√ = √17.5 = 4.18
𝑁 𝑁 8

Example 10:The table below gives the marks obtained by 10 students in statistics. Calculate standard
deviation.

Student Nos 1 2 3 4 5 6 7 8 9 10
Marks 43 48 65 57 31 60 37 48 78 59
Solution: (Deviations from assumed mean)

19
Basic Statistics

Nos. Marks 𝑑 = 𝑋 − 𝐴, (𝐴 𝑑2
(x) = 57)
1 43 -14 196
2 48 -9 81
3 65 8 64
4 57 0 0
5 31 -26 676
6 60 3 9
7 37 -20 400
8 48 -9 81
9 78 21 441
10 59 2 4
n= d=-44 d2
10 =1952

∑𝑑2 ∑𝑑 2 195.2 44 2
𝑆𝐷 𝑜𝑟 𝜎 = √ −( ) = √ − ( ) = √195.2 − 19.36 = √175.84 = 13.26
𝑁 𝑁 10 10

Calculation of standard deviation:

Discrete Series:

There are three methods for calculating standard deviation in discrete series:

a) Actual mean methods: If the actual mean in fractions, the calculation takes lot of time and labour;
and as such this method is rarely used in practice.
b) Assumed mean method: Here deviations are taken not from an actual mean but from an assumed
mean. Also this method is used, if the given variable values are not in equal intervals.
c) Step-deviation method: If the variable values are in equal intervals, then we adopt this method.
Example 11:Calculate Standard deviation from the following data.

X: 20 22 25 31 35 40 42 45

20
Basic Statistics

f 5 12 15 20 25 14 10 6
Solution:

Deviations from assumed mean

X F 𝑑 𝑑2 fd 𝑓𝑑 2
= 𝑥 − 𝐴,
(𝐴
= 31)
20 5 -11 121 -55 605
22 12 -9 81 -108 972
25 15 -6 36 -90 540
31 20 0 0 0 0
35 25 4 16 100 400
40 14 9 81 126 1134
42 10 11 121 110 1210
45 6 14 196 84 1176
N=107 fd=167 fd2=
6037

∑𝑓𝑑2 ∑𝑑 2 6037 167 2


𝑆𝐷 𝑜𝑟 𝜎 = √ −( ) = √ −( ) = √56.42 − 2.44 = √53.98 = 7.35
∑𝑓 ∑𝑓 107 107

Calculation of Standard Deviation –Continuous Series:

In the continuous series the method of calculating standard deviation is almost the same as in a discrete
series. But in a continuous series, mid-values of the class intervals are to be found out. The step- deviation
method is widely used.

Coefficient of Variation:

The Standard deviation is an absolute measure of dispersion. It is expressed in terms of units in which the
original figures are collected and stated. The standard deviation of heights of students cannot be
compared with the standard deviation of weights of students, as both are expressed in different units, i.e
heights in centimeter and weights in kilograms. Therefore the standard deviation must be converted into

21
Basic Statistics

a relative measure of dispersion for the purpose of comparison. The relative measure is known as the
coefficient of variation.

The coefficient of variation is obtained by dividing the standard deviation by the mean and multiplies it
by 100. Symbolically,

𝑆𝐷 𝜎
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 (𝐶𝑉) = × 100 = × 100
𝑀𝑒𝑎𝑛 𝑋

If we want to compare the variability of two or more series, we can use C.V. The series or groups of data
for which the C.V. is greater indicate that the group is more variable, less stable, less uniform, less
consistent or less homogeneous. If the C.V. is less, it indicates that the group is less variable, more stable,
more uniform, more consistent or more homogeneous.

Example 12: In two factories A and B located in the same industrial area, the average weekly wages (in
SR) and the standard deviations are as follows:

Factory Average Standard Deviation No. of


(x) (σ) workers
A 34.5 5 476
B 28.25 4.5 524

Which factory A or B pays out a larger amount as weekly wages?

Which factory A or B has greater variability in individual wages?

Solution:

𝐺𝑖𝑣𝑒𝑛 𝑁1 = 476; 𝑋̅1 = 34.5 𝑎𝑛𝑑 𝜎1 = 5

𝑁2 = 524, 𝑋̅2 = 28.5, 𝜎2 = 4.5

1. Total wages paid by factory A


= 34.5 × 476
= SR16.422

22
Basic Statistics

Total wages paid by factory B


= 28.5 × 524
= SR. 14,934
Therefore factory A pays out larger amount as weekly wages.
2. C. V. of distribution of weekly wages of factory A and B are
σ1 5
CV (A) = ̅ × 100 = × 100 = 14.49
X1 34.5
σ2 4.5
CV (B) = ̅ × 100 = × 100 = 15.79
X2 28.5
Factory B has greater variability in individual wages, since C.V. of factory B is
greater than C.V of factory A.

Example 13: Prices of a particular commodity in five years in two cities are
given below:
Price in city A Price in city B
20 10
22 20
19 18
23 12
16 15
Which city has more stable prices?

Solution: Actual mean method

City A City B
Pric Deviations from Pri Deviation
dx2 s from Y̅ Dy2
es X̅ = 20 dx ces
(X)
20 0 0 (Y)
10 = 15-5 dy 25

23
Basic Statistics

22 2 4 20 5 25
19 -1 1 18 3 9
23 3 9 12 -3 9
16 -4 16 15 0 0

Σx=100 Σdx=0 Σdx2=30 Σy=75 Σdy=0 Σdy2 =68

City A:
∑X 100
Mean = = = 20;
N 5

∑dx2 30
SD (σ) = √ = √ = √6 = 2.45
N 5

SD 2.45
CV (City A) = = × 100 = × 100 = 12.25%
Mean 20
City B:
∑𝑋 75
𝑀𝑒𝑎𝑛 = = = 15;
𝑁 5

∑𝑑𝑥 2 68
𝑆𝐷 (𝜎) = √ = √ = √13.6 = 3.69
𝑁 5

𝑆𝐷 3.69
𝐶𝑉 (𝐶𝑖𝑡𝑦 𝐵) = = × 100 = × 100 = 24.6%
𝑀𝑒𝑎𝑛 15
Therefore, City A had more stable prices than City B, because the coefficient
of variation is less in City A.

24
Basic Statistics

25
Basic Statistics

Lesson 4
Probability Theory and Distribution

Content
0
Basic Statistics

Course Name Basic Statistics

Lesson 4 Probability Theory and Distribution


Content Creator Name Dr. Vinay Kumar
Chaudhary Charan Singh Haryana
University/College Name
Agricultural University,Hisar
Course Reviewer Name Dr Dhaneshkumar V Patel
Unagadh Agricultural
University/college Name
University,Junagadh

1
Basic Statistics

Lesson-4
Objectives of the Lesson:
1. Probability – Basic concepts
2. Equally likely, mutually exclusive,
independent event
3. Additive and Multiplicative laws
4. Normal Distribution and its properties
Glossary of Terms: Sample Space, Event, Addison Law, Conditional
Probability, Normal Distribution etc.
4.1 Introduction:
The concept of probability is difficult to define in precise terms. In ordinary
language, the word probable means likely (or) chance. Generally the word,
probability, is used to denote the happening of a certain event, and the
likelihood of the occurrence of that event, based on past experiences. By
looking at the clear sky, one will say that there will not be any rain today.
On the other hand, by looking at the cloudy sky or overcast sky, one will
say that there will be rain today. In the earlier sentence, we aim that there
will not be rain and in the latter we expect rain. On the other hand a
mathematician says that the probability of rain is ‘0’ in the first case and
that the probability of rain is ‘1’ in the second case. In between 0 and 1,
there are fractions denoting the chance of the event occurring. In ordinary
language, the word probability means uncertainty about happenings. In
Mathematics and Statistics, a numerical measure of uncertainty is
provided by the important branch of statistics – called theory of
probability. Thus we can say, that the theory of probability describes
certainty by 1 (one), impossibility by 0 (zero) and uncertainties by the co-
efficient which lies between 0 and 1.
Trial and Event An experiment which, though repeated under essentially
identical (or) same conditions does not give unique results but may result
Basic Statistics

in any one of the several possible outcomes. Performing an experiment is


known as a trial and the outcomes of the experiment are known as events.
Example 1: Seed germination – either germinates or does not germinates
are events. In a lot of 5 seeds none may germinate (0), 1 or 2 or 3 or 4 or
all 5 may germinate.
Sample space (S)
A set of all possible outcomes from an experiment is called sample space.
For example, a set of five seeds are sown in a plot, none may germinate, 1,
2, 3, 4 or all five may germinate. i.e the possible outcomes are {0, 1, 2, 3, 4,
5. The set of numbers is called a sample space. Each possible outcome (or)
element in a sample space is called sample point.
Exhaustive Events
The total number of possible outcomes in any trial is known as exhaustive
events (or) exhaustive cases.
Example 1:
When pesticide is applied a pest may survive or die. There are two
exhaustive cases namely ( survival, death)
In throwing of a die, there are six exhaustive cases, since anyone of the 6
faces. 1, 2, 3, 4, 5, 6 may come uppermost.
In drawing 2 cards from a pack of cards the exhaustive number of cases is
52
C2, since 2 cards can be drawn out of 52 cards in 52C2 ways

Trial Random Experiment Total number of Sample Space


trials
(1) One pest is exposed to pesticide 21=2 {S,D}
(2) Two pests are exposed to 22=4 {SS, SD, DS,
pesticide DD}
Basic Statistics

(3) Three pests are exposed to 23=8 {SSS, SSD, SDS,


pesticide DSS, SDD,
DSD,DDS, DDD}
(4) One set of three seeds 41= 4 {0,1,2,3}
(5) Two sets of three seeds 42=16 {0,1},{0,2},{0,3}
etc

Favourable Events
The number of cases favourable to an event in a trial is the number of
outcomes which entail the happening of the event.
Example 2:
1. When a seed is sown if we observe non germination of a seed, it is a
favourable event. If we are interested in germination of the seed then
germination is the favourable event.
Mutually Exclusive Events
Events are said to be mutually exclusive (or) incompatible if the happening
of any one of the events excludes (or) precludes the happening of all the
others i.e.) if no two or more of the events can happen simultaneously in
the same trial. (i.e.) The joint occurrence is not possible.
Example 3:
In observation of seed germination the seed may either germinate or it will
not germinate. Germination and non germination are mutually exclusive
events.
Equally Likely Events
Outcomes of a trial are said to be equally likely if taking in to consideration
all the relevant evidences, there is no reason to expect one in preference
to the others. (i.e.) Two or more events are said to be equally likely if each
one of them has an equal chance of occurring.
Basic Statistics

Independent Events
Several events are said to be independent if the happening of an event is
not affected by the happening of one or more events.
Example
When two seeds are sown in a pot, one seed germinates. It would not
affect the germination or non germination of the second seed. One event
does not affect the other event.
Dependent Events
If the happening of one event is affected by the happening of one or more
events, then the events are called dependent events.
Example 4:
If we draw a card from a pack of well shuffled cards, if the first card drawn
is not replaced then the second draw is dependent on the first draw.
Note: In the case of independent (or) dependent events, the joint
occurrence is possible.
4.2 Definition of Probability
4.2.1 Mathematical (or) Classical (or) a-priori Probability
If an experiment results in ‘n’ exhaustive cases which are mutually
exclusive and equally likely cases out of which ‘m’ events are favourable to
the happening of an event ‘A’, then the probability ‘p’ of happening of ‘A’
is given by
𝐹𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑠𝑒 𝑚
𝑝 = 𝑃(𝐴) = =
𝐸𝑥ℎ𝑎𝑢𝑠𝑡𝑖𝑣𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑠𝑒𝑠 𝑛
Note
1. If m = 0 ⇒ P(A) = 0, then ‘A’ is called an impossible event. (i.e.) also
by P(ϕ) = 0.
Basic Statistics

2. If m = n ⇒ P(A) = 1, then ‘A’ is called assure (or) certain event.


3. The probability is a non-negative real number and cannot exceed
unity (i.e.) lies between 0 to 1.
4. The probability of non-happening of the event ‘A’ (i.e.) P(A̅ ) It is
denoted by ‘q’.
𝑛−𝑚 𝑚
𝑃 (𝐴̅) = = 1 − = 1 − 𝑃(𝐴)
𝑛 𝑛
⇒𝑞 = 1−𝑝
⇒𝑝+𝑞 = 1
𝑜𝑟 𝑃(𝐴) + 𝑃(𝐴̅) = 1
4.2.2 Statistical (or) Empirical Probability (or) a-posteriori Probability
If an experiment is repeated a number (n) of times, an event ‘A’ happens
‘m’ times then the statistical probability of ‘A’ is given by
𝑚
𝑝 = 𝑃(𝐴) = lim
𝑛→∞ 𝑛

4.2.3 Axioms for Probability


1. The probability of an event ranges from 0 to 1. If the event cannot take
place its probability shall be ‘0’ if it certain, its probability shall be ‘1’.
𝐿𝑒𝑡 𝐸1 , 𝐸2 , … , 𝐸𝑛 𝑏𝑒 𝑎𝑛𝑦 𝑒𝑣𝑒𝑛𝑡𝑠, 𝑡ℎ𝑒 𝑃(𝐸𝑖 ≥ 0)
2. The probability of the entire sample space is ‘1’. (i.e.) P(S) = 1.
𝑛

, Total Probability, ∑ 𝑃(𝐸𝑖 ) = 1


𝑖=1
3. If A and B are mutually exclusive (or)disjoint events then the
probability of occurrence of either A (or) B denoted by P(AUB) shall be
given by
𝑃(𝐴 ∪ 𝐵) = 𝑃 (𝐴) + 𝑃(𝐵)
𝑃(𝐸1 ∪ 𝐸2 ∪ … .∪ 𝐸𝑛) = 𝑃 (𝐸1) + 𝑃 (𝐸2) + … … + 𝑃 (𝐸𝑛)
If E1, E2, …., En are mutually exclusive events.
Basic Statistics

Example 5: Two dice are tossed. What is the probability of getting (i) Sum
6 (ii) Sum 9?
Solution
When 2 dice are tossed. The exhaustive number of cases is 36 ways.
(i) Sum 6 = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)}
5
𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑠𝑒𝑠 = 5, 𝑃(𝑆𝑢𝑚 6) =
36
(ii) Sum 9 = {(3,6), (4,5), (5,4), (6,3)}
∴ Favourable number of cases = 4
4 1
𝑃(𝑆𝑢𝑚 9) = =
36 9
Example 6: A card is drawn from a pack of cards. What is a probability of
getting (i) a king (ii) a spade (iii) a red card (iv) a numbered card?
Solution
52
There are 52 cards in a pack. One can be selected in 𝐶1ways.
52
∴ Exhaustive number of cases is = 𝐶1 = 52.
(i) A king
There are 4 kings in a pack.
One king can be selected in 4 𝐶1ways.
4
∴ Favourable number of cases is = 𝐶1 = 4
4 1
Hence the probability of getting a king = =
52 13

(ii) A spade
There are 13 kings in a pack.
One spade can be selected in 13C1 ways.
∴ Favourable number of cases is = 13C1 = 13
13 1
Hence the probability of getting a spade = =
52 4
Basic Statistics

(iii) A red card


There are 26 kings in a pack.
One red card can be selected in 26C1 ways.
∴ Favourable number of cases is = 26C1 = 26
26 1
Hence the probability of getting a red card = =
52 2
(iv)A numbered card
There are 36 kings in a pack.
One numbered card can be selected in 36C1 ways.
∴ Favourable number of cases is = 36C1 = 36
36
Hence the probability of getting a numbered card =
52
Example 7: What is the probability of getting 53 Sundays when a leap
year selected at random?
Solution
A leap year consists of 366 days.
This has 52 full weeks and 2 days remained.
The remaining 2 days have the following possibilities.
(i) Sun. Mon
(ii) Mon, Tues
(iii) Tues, Wed
(iv) Wed, Thurs
(v) Thurs, Fri
(vi) Fri, Sat
(vii) Sat, Sun.
In order that a lap year selected at random should contain 53 Sundays, one
of the 2 over days must be Sunday.
Exhaustive number of cases is = 7
∴ Favourable number of cases is = 2
∴ Required Probability is =
Basic Statistics

4.3 Conditional Probability:


Two events A and B are said to be dependent, when B can occur only
when A is known to have occurred (or vice versa). The probability
attached to such an event is called the conditional probability and is
denoted by P (A/B) (read it as: A given B) or, in other words, probability of
A given that B has occurred.
𝑃(𝐴 ∩ 𝐵) 𝑃(𝐴𝐵)
𝑃(𝐴/𝐵) = =
𝑃 (𝐵) 𝑃 (𝐵)
If two events A and B are dependent, then the conditional probability of B
given A is,
𝑃(𝐴 ∩ 𝐵) 𝑃(𝐴𝐵)
𝑃(𝐵/𝐴) = =
𝑃 (𝐴) 𝑃(𝐴)

There are two important theorems of probability namely,


1. The addition theorem on probability
2. The multiplication theorem on probability.
4.3.1 Addition Theorem on Probability
(i) Let A and B be any two events which are not mutually exclusive
𝑃 (𝐴 𝑜𝑟 𝐵) = 𝑃 (𝐴 ∪ 𝐵) = 𝑃 (𝐴 + 𝐵)
= 𝑃 (𝐴) + 𝑃 (𝐵) – 𝑃 (𝐴 ∩ 𝐵)(𝑜𝑟)
= 𝑃 (𝐴) + 𝑃 (𝐵) – 𝑃 (𝐴𝐵)
Basic Statistics

Proof

Let us take a random experiments with a sample space S of N sample


points. The by the definition of probability,

𝒏(𝑨 ∪ 𝑩) 𝒏(𝑨 ∪ 𝑩)
𝑷(𝑨 ∪ 𝑩) = =
𝒏(𝑺) 𝑵

From the diagram, using the axiom for the mutually exclusive events, we
write
𝑛(𝐴) + 𝑛(𝐴̅ ∩ 𝐵)
𝑃(𝐴 ∪ 𝐵) =
𝑁
Adding and subtracting 𝑛(𝐴 ∩ 𝐵) in the numinator
𝑛(𝐴) + 𝑛(𝐴̅ ∩ 𝐵) + 𝑛(𝐴 ∩ 𝐵) − 𝑛(𝐴 ∩ 𝐵)
=
𝑁
𝑛(𝐴) + 𝑛(𝐵) − 𝑛(𝐴 ∩ 𝐵)
𝑁
𝑛(𝐴) 𝑛(𝐵) 𝑛(𝐴 ∩ 𝐵)
= + −
𝑁 𝑛 𝑁
𝑃 (𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)
(iii) Lat A and B be any two events which are mutually excusive
Basic Statistics

events then
𝑃 (𝐴 𝑜𝑟 𝐵) = 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴 + 𝐵) = 𝑃(𝐴) + 𝑃(𝐵)

Proof:

We know thaty, 𝑛(𝐴 ∪ 𝐵) = 𝑛(𝐴) + 𝑛(𝐵)


𝑛(𝐴 ∪ 𝐵)
𝑃(𝐴 ∪ 𝐵) =
𝑛
𝑛(𝐴) + 𝑛(𝐵)
𝑛
𝑛(𝐴) 𝑛(𝐵)
= +
𝑛 𝑛
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵)
Note
(1) In the case of 3 events, (not mutually exclusive events)
𝑃(𝐴 𝑜𝑟 𝐵 𝑜𝑟 𝐶 ) = 𝑃(𝐴 ∪ 𝐵 ∪ 𝐶 ) = 𝑃(𝐴 + 𝐵 + 𝐶 )
= 𝑃(𝐴) + 𝑃 (𝐵) + 𝑃(𝐶 ) − 𝑃 (𝐴 ∩ 𝐵) − 𝑃 (𝐵 ∩ 𝐶 ) − 𝑃(𝐴 ∩ 𝐶 )
+ 𝑃(𝐴 ∩ 𝐵 ∩ 𝐶)
(2) In the case of 3 events (mutually exclusive events)
𝑃(𝐴 𝑜𝑟 𝐵 𝑜𝑟 𝐶 ) = 𝑃(𝐴 ∪ 𝐵 ∪ 𝐶 ) = 𝑃(𝐴 + 𝐵 + 𝐶 )
= 𝑃(𝐴) + 𝑃(𝐵) + 𝑃(𝐶 )
Example : Using the additive law of probability we can find the probability
that in one roll of a die, we will obtain either a one-spot or a six-spot. The
probability of obtaining a onespot is 1/6. The probability of obtaining a six-
Basic Statistics

spot is also 1/6. The probability of rolling a die and getting a side that has
both a one-spot with a six-spot is 0. There is no side on a die that has both
these events. So substituting these values into the equation gives the
following result
1 1 2 1
+ − 0 = = = 0.3333
6 6 6 3
Finding the probability of drawing a 4 of hearts or a 6 or any suit using the
additive law of probability would give the following:
1 4 5
+ −0= = 0.0962
52 52 52
There is only a single 4 of hearts, there are 4 sixes in the deck and there
isn't a single card that is both the 4 of hearts and a six of any suit.
Now using the additive law of probability, you can find the probability of
drawing either a king or any club from a deck of shuffled cards. The
equation would be completed like this:
4 13 1 16
+ − = = 0.3077
52 52 52 52
There are 4 kings, 13 clubs, and obviously one card is both a king and a
club. We don't want to count that card twice, so you must subtract one of
it's occurrences away to obtain the result.
4.3.2 Multiplication Theorem on Probability
(i) If A and B be any two events which are not independent then
𝑃(𝐴 𝑎𝑛𝑑 𝐵) = 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴𝐵) = 𝑃(𝐴). 𝑃(𝐵/𝐴)
= 𝑃(𝐵). 𝑃(𝐴/𝐵)
Where P(B/A) and P(A/B) are the conditional probability of B given A
and A given B, respectively.
Proof
Let n is the total number of events
Basic Statistics

n (A) is the number of events in A


n (B) is the number of events in B
n (A∪ B) is the number of events in (A∪ B)
n (A∩B) is the number of events in (A∩B)
𝑛(𝐴 ∩ 𝐵)
𝑃(𝐴 ∩ 𝐵) =
𝑛
𝑛(𝐴 ∩ 𝐵) 𝑛(𝐴)
×
𝑛 𝑛(𝐴)
𝑛(𝐴) 𝑛(𝐴 ∩ 𝐵)
×
𝑛 𝑛(𝐴)
𝑃 (𝐴 ∩ 𝐵) = 𝑃(𝐴). 𝑃(𝐵/𝐴) (1)
𝑛(𝐴 ∩ 𝐵)
𝑃(𝐴 ∩ 𝐵) =
𝑛
𝑛(𝐴 ∩ 𝐵) 𝑛(𝐵)
×
𝑛 𝑛(𝐵)
𝑛(𝐵) 𝑛(𝐴 ∩ 𝐵)
×
𝑛 𝑛(𝐵)
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴). 𝑃(𝐴/𝐵) (2)
(ii) If A and B be any two events which are independent, then,
P(B/A) = P(B) and P(A/B) = P(A)
𝑃 (𝐴 𝑎𝑛𝑑 𝐵) = 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴𝐵) = 𝑃(𝐴). 𝑃(𝐵)
Note
(i) In the case of 3 events (dependent)
𝑃 (𝐴 ∩ 𝐵 ∩ 𝐶 ) = 𝑃(𝐴). 𝑃(𝐵/𝐴). 𝑃(𝐶/𝐴𝐵)
(ii) In the case of 3 events (independent)
𝑃 (𝐴 ∩ 𝐵 ∩ 𝐶 ) = 𝑃 (𝐴). 𝑃(𝐵). 𝑃(𝐶)
Basic Statistics

Example 8:
So in finding the probability of drawing a 4 and then a 7 from a well shuffled
deck of cards, this law would state that we need to multiply those separate
probabilities together. Completing the equation above gives
4 4 16
𝑃(4 𝑎𝑛𝑑 7) = × = = 0.0059
52 52 2704
Given a well shuffled deck of cards, what is the probability of drawing a
Jack of Hearts, Queen of Hearts, King of Hearts, Ace of Hearts, and 10 of
Hearts?
1 1 1 1
𝑃(10, 𝐽, 𝑄, 𝐾, 𝐴 𝑜𝑓 ℎ𝑒𝑎𝑟𝑡𝑠) = × × × = 0,000000026
52 52 52 52
In any case, given a well shuffled deck of cards, obtaining this assortment
of cards, drawing one at a time and returning it to the deck would be highly
unlikely (it has an exceedingly low probability).
4.4 Normal distribution
Continuous Probability distribution is normal distribution. It is also known
as error law or Normal law or Laplacian law or Gaussian distribution. Many
of the sampling distribution like student-t, f distribution and χ 2
distribution.
Definition
A continuous random variable x is said to be a normal distribution with
parameters μ and σ2, if the density function is given by the probability law
1 1 𝑥−𝜇 2
𝑓(𝑥 ) = 𝑒 −2( 𝜎 ) ; −∞ < 𝑥 < ∞, = ∞ < 𝜇 < ∞, 𝜎 > 0
𝜎√2𝜋
Note
The mean m and standard deviation s are called the parameters of Normal
distribution. The normal distribution is expressed by X ~ N(μ , σ2)
Basic Statistics

4.4.1 Condition of Normal Distribution


Normal distribution is a limiting form of the binomial distribution under the
following conditions.
1. n, the number of trials is indefinitely large ie., n→∞ and
2. Neither p nor q is very small.
3. Normal distribution can also be obtained as a limiting form of Poisson
distribution with parameter
4. Constants of normal distribution are mean = μ , variation =σ2 , Standard
deviation = σ .
4.4.2 Normal probability curve
The curve representing the normal distribution is called the normal
probability curve. The curve is symmetrical about the mean (m), bell-
shaped and the two tails on the right and left sides of the mean extends to
the infinity. The shape of the curve is shown in the following figure.

4.4.3 Properties of normal distribution


1. The normal curve is bell shaped and is symmetric at x = μ .
2. Mean, median, and mode of the distribution are coincide
i.e., Mean = Median = Mode = μ
3. It has only one mode at x = μ (i.e., unimodal)
4. The points of inflection are at x = μ ± σ
Basic Statistics

1
5. The maximum ordinate occurs at x = μ and its value is =
𝜎 √2𝜋

6. Area Property
𝑃(𝜇 − 𝜎 < 𝑥 < 𝜇 + 𝜎) = 0.6826

𝑃(𝜇 – 2𝜎 < 𝑥 < 𝜇 + 2𝜎) = 0.9544


𝑃(𝜇 – 3𝜎 < 𝑥 < 𝜇 + 3𝜎) = 0.9973
4.4.4 Standard Normal distribution
Let X be random variable which follows normal distribution with mean m
𝑋−𝜇
and variance σ2 .The standard normal variate is defined as 𝑍 = which
𝜎
follows
standard normal distribution with mean 0 and standard deviation 1 i.e., Z
1 1 2
~ N(0,1). standard normal distribution is given by 𝜙(𝑍) = 𝑒 −2(𝑧)
√2𝜋

The advantage of the above function is that it doesn’t contain any


parameter. This enables us to compute the area under the normal
probability curve.
Note
Property of 𝝓(𝒁)
1. 𝜙(−𝑍) = 1 − 𝜙(𝑍)
2. 𝑃(2 ≤ 𝑍 ≤ 𝑏) = 𝜙(𝑏) − 𝜙(𝑎)
Example 9: In a normal distribution whose mean is 12 and standard
deviation is 2. Find the probability for the interval from x = 9.6 to x = 13.8
Solution
Given that Z~ N (12, 4)
9.6 − 12 13 − 8 − 12
𝑃(1.6 ≤ 𝑍 ≤ 13.8) = 𝑃 ( ≤𝑍≤ )
2 2
Basic Statistics

= 𝑃(−1.2 ≤ 𝑍 ≤ 0) + 𝑃(0 ≤ 𝑍 ≤ 0.9)


= 𝑃(0 ≤ 𝑍 ≤ 1.2) + 𝑃(0 ≤ 𝑍
≤ 0.9) [𝑏𝑦 𝑢𝑠𝑖𝑛𝑔 𝑠𝑦𝑚𝑚𝑒𝑡𝑟𝑖𝑐 𝑝𝑟𝑜𝑝𝑒𝑟𝑡𝑦]
= 0.3849 + 0.3159
= 0.7008
When it is converted to percentage (ie) 70% of the observations are
covered between 9.6 to 13.8.
Example 10: For a normal distribution whose mean is 2 and standard
deviation 3. Find the value of the variate such that the probability of the
variate from the mean to the value is 0.4115
Solution:
Given that Z~ N (2, 9)
To find X1:
We have 𝑃(2 ≤ 𝑍 ≤ 𝑋1 ) = 0.4115
2−2 𝑋−𝜇 𝑋1 −2
𝑃( ≤ ≤ ) = 0.4115
3 𝜎 3
𝑋1 −2
𝑃 (0 ≤ 𝑍 ≤ 𝑍1 ) = 0.4115, where 𝑍1 =
3

[From the normal table where 0.4115 lies is rthe value of Z1]
Form the normal table we have Z1=1.35
𝑋1 − 2
∴ 1.35 =
3
⇒ 3(1.35) + 2 = 𝑋1
= 𝑋1 = 6.05
(i.e) 41 % of the observation converged between 2 and 6.05
Basic Statistics
Basic Statistics

Lesson 5
Sampling Theory

Content
0
Basic Statistics

Course Name Basic Statistics

Lesson 5 Sampling Theory


Content Creator Name Dr. Vinay Kumar
Chaudhary Charan Singh Haryana
University/College Name
Agricultural University,Hisar
Course Reviewer Name Dr Dhaneshkumar V Patel
Unagadh Agricultural
University/college Name
University,Junagadh

1
Basic Statistics

Lesson-5
Objectives of the Lesson:
1. Sampling-basic concepts
2. Sampling methods
3. Simple random sampling
4. Stratified random sampling
Glossary of Terms: Sampling, Simple random Sampling, Stratified
Sampling, Census Survey etc.
5.1 Basic Terminology of Sampling Theory:

Population (Universe)
Population means aggregate of all possible units. It need not be
human population. It may be population of plants, population of insects,
population of fruits, etc.
Finite population
When the number of observation can be counted and is definite, it is
known as finite population
 No. of plants in a plot.
 No. of farmers in a village.
 All the fields under a specified crop.
Infinite population
When the number of units in a population is innumerably large, that we
cannot count all of them, it is known as infinite population.
 The plant population in a region.
 The population of insects in a region.
Frame
Basic Statistics

A list of all units of a population is known as frame.


Parameter
A summary measure that describes any given characteristic of the
population is known as parameter. Population are described in terms of
certain measures like mean, standard deviation etc. These measures of the
population are called parameter and are usually denoted by Greek letters.
For example, population mean is denoted by μ, standard deviation by σ
and variance by σ2.
Sample
A portion or small number of unit of the total population is known as
sample.
 All the farmers in a village(population) and a few farmers(sample)
 All plants in a plot is a population of plants.
 A small number of plants selected out of that population is a sample
of plants

Statistic
A summary measure that describes the characteristic of the sample is
known as statisitic. Thus sample mean, sample standard deviation etc is
statistic. The statistic is usually denoted by roman letter.
𝑥̅ − 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
𝑠 − 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
The statistic is a random variable because it varies from sample to sample.
Sampling
The method of selecting samples from a population is known as sampling.
Basic Statistics

5.2 Sampling technique


There are two ways in which the information is collected during statistical
survey.
They are
 Census survey
 Sampling survey
5.2.1 Census
It is also known as population survey and complete enumeration survey.
Under census survey the information are collected from each and every
unit of the population or universe.
5.2.2 Sample survey
A sample is a part of the population. Information are collected from only a
few units of a population and not from all the units. Such a survey is known
as sample survey. Sampling technique is universal in nature, consciously or
unconsciously it is adopted in everyday life.
For example:
1. A handful of rice is examined before buying a sack.
2. We taste one or two fruits before buying a bunch of grapes.
3. To measure root length of plants only a portion of plants are selected
from a plot.
5.3 Need for sampling
The sampling methods have been extensively used for a variety of
purposes and in great diversity of situations.
In practice it may not be possible to collected information on all units of a
population due to various reasons such as
1. Lack of resources in terms of money, personnel and equipment.
2. The experimentation may be destructive in nature. Eg- finding out
Basic Statistics

the germination percentage of seed material or in evaluating the


efficiency of an insecticide the experimentation is destructive.
3. The data may be wasteful if they are not collected within a time limit.
The census survey will take longer time as compared to the sample
survey. Hence for getting quick results sampling is preferred.
Moreover a sample survey will be less costly than complete
enumeration.
4. Sampling remains the only way when population contains infinitely
many number of units.
5. Greater accuracy.
5.4 Sampling methods
The various methods of sampling can be grouped under
1. Probability sampling or random sampling
2. Non-probability sampling or non random sampling
5.5 Random sampling
Under this method, every unit of the population at any stage has equal
chance (or) each unit is drawn with known probability. It helps to estimate
the mean, variance etc of the population.
Under probability sampling there are two procedures
1. Sampling with replacement (SWR)
2. Sampling without replacement (SWOR)
When the successive draws are made with placing back the units selected
in the preceding draws, it is known as sampling with replacement. When
such replacement is not made it is known as sampling without
replacement.
When the population is finite sampling with replacement is adopted
otherwise SWOR is adopted.
Mainly there are many kinds of random sampling. Some of them are.
Basic Statistics

1. Simple Random Sampling


2. Systematic Random Sampling
3. Stratified Random Sampling
4. Cluster Sampling
5.5 Simple Random sampling (SRS)
The basic probability sampling method is the simple random sampling. It is
the simplest of all the probability sampling methods. It is used when the
population is homogeneous.
When the units of the sample are drawn independently with equal
probabilities. The sampling method is known as Simple Random Sampling
(SRS). Thus if the population consists of N units, the probability of selecting
any unit is 1/N.
A theoretical definition of SRS is as follows
Suppose we draw a sample of size n from a population of size N. There are
N
Cn possible samples of size n. If all possible samples have an equal
probability 1/NCn of being drawn, the sampling is said be simple random
sampling.
There are two methods in SRS
1. Lottery method
2. Random no. table method
5.5.1 Lottery method
This is most popular method and simplest method. In this method all the
items of the universe are numbered on separate slips of paper of same size,
shape and color. They are folded and mixed up in a drum or a box or a
container. A blindfold selection is made. Required number of slips is
selected for the desired sample size. The selection of items thus depends
on chance.
Basic Statistics

For example, if we want to select 5 plants out of 50 plants in a plot, we


number the 50 plants first. We write the numbers from 1-50 on slips of the
same size, role them and mix them. Then we make a blindfold selection of
5 plants. This method is also called unrestricted random sampling because
units are selected from the population without any restriction. This
method is mostly used in lottery draws. If the population is infinite, this
method is inapplicable. There is a lot of possibility of personal prejudice if
the size and shape of the slips are not identical.
5.5.2 Random number table method
As the lottery method cannot be used when the population is infinite, the
alternative method is using of table of random numbers.
There are several standard tables of random numbers. But the credit for
this technique goes to Prof. LHC. Tippet (1927). The random number table
consists of 10,400 four-figured numbers. There are various other random
numbers. They are fishers and Yates (19380 comprising of 15,000 digits
arranged in twos. Kendall and B.B Smith (1939) consisting of 1, 00,000
numbers grouped in 25,000 sets of 4 digit random
numbers, Rand corporation (1955) consisting of 2, 00,000 random
numbers of 5 digits each etc.,
5.5.3 Merits
1. There is less chance for personal bias.
2. Sampling error can be measured.
3. This method is economical as it saves time, money and labour.
5.5.4 Demerits
1. It cannot be applied if the population is heterogeneous.
2. This requires a complete list of the population but such up-to-date
lists are not available in many enquires.
3. If the size of the sample is small, then it will not be a representative
Basic Statistics

of the population.
5.6 Stratified Sampling
When the population is heterogeneous with respect to the characteristic
in which we are interested, we adopt stratified sampling.
When the heterogeneous population is divided into homogenous sub-
population, the sub-populations are called strata. From each stratum a
separate sample is selected using simple random sampling. This sampling
method is known as stratified sampling.
We may stratify by size of farm, type of crop, soil type, etc.
The number of units to be selected may be uniform in all strata (or) may
vary from stratum to stratum.
There are four types of allocation of strata
1. Equal allocation
2. Proportional allocation
3. Neyman’s allocation
4. Optimum allocation

 If the number of units to be selected is uniform in all strata it is known


as equal allocation of samples.
 If the number of units to be selected from a stratum is proportional
to the size of the stratum, it is known as proportional allocation of
samples.
 When the cost per unit varies from stratum to stratum, it is known
as optimum allocation.
 When the costs for different strata are equal, it is known as
Neyman’s allocation.
5.6.1 Merits
Basic Statistics

1. It is more representative.
2. It ensures greater accuracy.
3. It is easy to administrate as the universe is sub-divided.
5.6.2 Demerits
1. To divide the population into homogeneous strata, it requires more
money, time and statistical experience which is a difficult one.
2. If proper stratification is not done, the sample will have an effect of
bias.

References:
1. Cochran, W.G. (1977), Sampling techniques, Wiley Eastern Limited.
2. Des Raj and Chandok. P. ( 1998 ), Sampling Theory. Narosa Publishing
House. New Deihi.
3. Murthy, M.N. ( 1967), Sampling Theory and methods. Statistical
Publishing Society. Calcutta.
Basic Statistics

Lesson 6
Testing of Hypothesis

Content
0
Basic Statistics

Course Name Basic Statistics

Lesson 6 Testing of Hypothesis


Content Creator Name Dr. Vinay Kumar
Chaudhary Charan Singh Haryana
University/College Name
Agricultural University,Hisar
Course Reviewer Name Dr Dhaneshkumar V Patel
Unagadh Agricultural
University/college Name
University,Junagadh

1
Basic Statistics

Lesson-6
Objectives of the Lesson:
1. Test of significance and Basic concepts
2. Null hypothesis, alternative hypothesis and level of
significance
3. Standard error and its importance
4. Steps in testing of hypothesis with different tests
6.1 Sampling Distribution
By drawing all possible samples of same size from a population we can
calculate the statistic, for example, x̅ for all samples. Based on this we can
construct a frequency distribution and the probability distribution of x̅. Such
probability distribution of a statistic is known a sampling distribution of that
statistic. In practice, the sampling distributions can be obtained theoretically
from the properties of random samples.
6.2 Standard Error
As in the case of population distribution the characteristic of the sampling
distributions are also described by some measurements like mean & standard
deviation. Since a statistic is a random variable, the mean of the sampling
distribution of a statistic is called the expected valued of the statistic. The SD
of the sampling distributions of the statistic is called standard error of the
Statistic. The square of the standard error is known as the variance of the
statistic. It may be noted that the standard deviation is for units whereas the
standard error is for the statistic.
6.3 Theory of Testing Hypothesis
6.3.1 Hypothesis
Hypothesis is a statement or assumption that is yet to be proved.
Basic Statistics

6.3.2 Statistical Hypothesis


When the assumption or statement that occurs under certain conditions is
formulated as scientific hypothesis, we can construct criteria by which a
scientific hypothesis is either rejected or provisionally accepted. For this
purpose, the scientific hypothesis is translated into statistical language. If the
hypothesis in given in a statistical language it is called a statistical hypothesis.
For eg:-
The yield of a new paddy variety will be 3500 kg per hectare – scientific
hypothesis.
In Statistical language if may be stated as the random variable (yield of paddy)
is distributed normally with mean 3500 kg/ha.
6.3.3 Simple Hypothesis
When a hypothesis specifies all the parameters of a probability distribution,
it is known as simple hypothesis. The hypothesis specifies all the parameters,
i.e µ and σ of a normal distribution.
Eg:-
The random variable x is distributed normally with mean µ=0 & SD=1 is a
simple hypothesis. The hypothesis specifies all the parameters (µ & σ) of a
normal distributions.
6.3.4 Composite Hypothesis
If the hypothesis specific only some of the parameters of the probability
distribution, it is known as composite hypothesis. In the above example if only
the µ is specified or only the σ is specified it is a composite hypothesis.
6.3.5 Null Hypothesis - Ho
Basic Statistics

Consider for example, the hypothesis may be put in a form ‘paddy variety A
will give the same yield per hectare as that of variety B’ or there is no
difference between the average yields of paddy varieties A and B. These
hypotheses are in definite terms. Thus these hypothesis form a basis to work
with. Such a working hypothesis in known as null hypothesis. It is called null
hypothesis because if nullities the original hypothesis, that variety A will give
more yield than variety B.
The null hypothesis is stated as ‘there is no difference between the effect of
two treatments or there is no association between two attributes (ie) the two
attributes are independent. Null hypothesis is denoted by H0.
Eg:-
There is no significant difference between the yields of two paddy varieties
(or) they give same yield per unit area. Symbolically, H0: µ1=µ2.
6.3.6 Alternative Hypothesis
When the original hypothesis is µ1>µ2 stated as an alternative to the null
hypothesis is known as alternative hypothesis. Any hypothesis which is
complementary to null hypothesis is called alternative hypothesis, usually
denoted by H1.
Eg:-
There is a significance difference between the yields of two paddy
varieties.
Symbolically,
𝐻1 : µ1 ≠ µ2 (𝑡𝑤𝑜 𝑠𝑖𝑑𝑒𝑑 𝑜𝑟 𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛𝑙𝑒𝑠𝑠 𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒)
If the statement is that A gives significantly less yield than B (or) A gives
significantly more yield than B. Symbolically,
Basic Statistics

𝐻1 : µ1 < µ2 (𝑜𝑛𝑒 𝑠𝑖𝑑𝑒𝑑 𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 − 𝑙𝑒𝑓𝑡 𝑡𝑎𝑖𝑙𝑒𝑑)


𝐻1 : µ1 > µ2 (𝑜𝑛𝑒 𝑠𝑖𝑑𝑒𝑑 𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 − 𝑟𝑖𝑔ℎ𝑡 𝑡𝑎𝑖𝑙𝑒𝑑)
6.4 Testing of Hypothesis
Once the hypothesis is formulated we have to make a decision on it. A
statistical procedure by which we decide to accept or reject a statistical
hypothesis is called testing of hypothesis.
6.5 Sampling Error
From sample data, the statistic is computed and the parameter is estimated
through the statistic. The difference between the parameter and the statistic
is known as the sampling error.
6.6 Test of Significance
Based on the sampling error the sampling distributions are derived. The
observed results are then compared with the expected results on the basis of
sampling distribution. If the difference between the observed and expected
results is more than specified quantity of the standard error of the statistic, it
is said to be significant at a specified probability level. The process up to this
stage is known as test of significance.
6.7 Decision Errors
By performing a test we make a decision on the hypothesis by accepting or
rejecting the null hypothesis Ho. In the process we may make a correct
decision on Ho or commit one of two kinds of error.
 We may reject Ho based on sample data when in fact it is true. This
error in decisions is known as Type I error.
 We may accept Ho based on sample data when in fact it is not true. It
is known as Type II error.
Basic Statistics

Accept Ho Reject Ho
Ho is true Correct Decision Type I error
Ho is false Type II error Correct Decision
The relationship between type I & type II errors is that if one increases the
other will decrease. The probability of type I error is denoted by α. The
probability of type II error is denoted by β. The correct decision of rejecting
the null hypothesis when it is false is known as the power of the test. The
probability of the power is given by 1-β.
6.8 Critical Region
The testing of statistical hypothesis involves the choice of a region on the
sampling distribution of statistic. If the statistic falls within this region, the null
hypothesis is rejected: otherwise it is accepted. This region is called critical
region.
Let the null hypothesis be 𝐻0 : µ1 = µ2 and its alternative be 𝐻1 : µ1 ≠ µ2 .
Suppose Ho is true.
Based on sample data it may be observed that statistic (𝑥̅1 − 𝑥̅ 2 ) follows
a normal distribution given by
(𝑥̅1 − 𝑥̅ 2 ) − (𝜇1 − 𝜇2 )
𝑍=
𝑆𝐸 (𝑥̅1 − 𝑥̅ 2 )
We know that 95% values of the statistic from repeated samples will fall in
the range (𝑥̅1 − 𝑥̅ 2 ) ± 1.96 𝑡𝑖𝑚𝑒𝑠 𝑆𝐸(𝑥̅1 − 𝑥̅ 2 ). This is represented by a
diagram
Basic Statistics

The border line value ±1.96 is the critical value or tabular value of Z. The area
beyond the critical values (shaded area) is known as critical region or region
of rejection. The remaining area is known as region of acceptance.
If the statistic falls in the critical region we reject the null hypothesis and, if it
falls in the region of acceptance we accept the null hypothesis.
In other words if the calculated value of a test statistic (Z, t, χ 2 etc) is more
than the critical value in magnitude it is said to be significant and we reject
Ho and otherwise we accept Ho. The critical values for the t and are given in
the form of readymade tables. Since the criticval values are given in the form
of table it is commonly referred as table value. The table value depends on
the level of significance and degrees of freedom.
Example: Zcal < Ztab -We accept the H0 and conclude that there is no significant
difference between the means
Test Statistic
The sampling distribution of a statistic like Z, t, & χ2 are known as test statistic.
Generally, in case of quantitative data
𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 − 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟
𝑇𝑒𝑠𝑡 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑠 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 (𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 )
Note
The choice of the test statistic depends on the nature of the variable (ie)
qualitative or quantitative, the statistic involved (i.e) mean or variance and
the sample size, (i.e) large or small.
6.9 Level of Significance
Basic Statistics

𝛼 𝛼
The probability that the statistic will fall in the critical region is + =𝛼.
2 2
This is nothing but the probability of committing type I error. Technically the
probability of committing type I error is known as level of Significance.
6.10 One and two tailed test
The nature of the alternative hypothesis determines the position of the
critical region. For example, if 𝐻1 𝑖𝑠 𝜇1 ≠ 𝜇2 it does not show the direction
and hence the critical region falls on either end of the sampling distribution.
If H1 is 𝜇1 < 𝜇2 𝑜𝑟 𝜇1 > 𝜇2 the direction is known. In the first case the
critical region falls on the left of the distribution whereas in the second case
it falls on the right side.
6.10.1 One tailed test – When the critical region falls on one end of the
sampling distribution, it is called one tailed test.
6.10.2 Two tailed test – When the critical region falls on either end of the
sampling distribution, it is called two tailed test.
For example, consider the mean yield of new paddy variety (μ 1) is compared
with that of a ruling variety (μ2). Unless the new variety is more promising
that the ruling variety in terms of yield we are not going to accept the new
variety. In this case 𝐻1 : 𝜇1 > 𝜇2 for which one tailed test is used. If both the
varieties are new our interest will be to choose the best of the two. In this
case 𝐻1 : 𝜇1 ≠ 𝜇2 for which we use two tailed test.
Degrees of freedom
The number of degrees of freedom is the number of observations that are
free to vary after certain restriction have been placed on the data. If there are
n observations in the sample, for each restriction imposed upon the original
observation the number of degrees of freedom is reduced by one.
Basic Statistics

The number of independent variables which make up the statistic is known as


the degrees of freedom and is denoted by (Nu)
Steps in testing of hypothesis
The process of testing a hypothesis involves following steps.
1. Formulation of null & alternative hypothesis.
2. Specification of level of significance.
3. Selection of test statistic and its computation.
4. Finding out the critical value from tables using the level of significance,
sampling distribution and its degrees of freedom.
5. Determination of the significance of the test statistic.
6. Decision about the null hypothesis based on the significance of the test
statistic.
7. Writing the conclusion in such a way that it answers the question on
hand.
8.11 Large sample theory
The sample size n is greater than 30 (n≥30) it is known as large sample. For
large samples the sampling distributions of statistic are normal (Z test). A
study of sampling distribution of statistic for large sample is known as large
sample theory.
6.12 Small sample theory
If the sample size n ils less than 30 (n<30), it is known as small sample. For
small samples the sampling distributions are t, F and χ2 distribution. A study
of sampling distributions for small samples is known as small sample theory.
6.13 Test of Significance
The theory of test of significance consists of various test statistic. The theory
had been developed
Basic Statistics

under two broad heading


1. Test of significance for large sample
Large sample test or Asymptotic test or Z test (n≥30)
2. Test of significance for small samples(n<30)
Small sample test or Exact test-t, F and χ2.
It may be noted that small sample tests can be used in case of large samples
also.
6.13.1 Large sample test
Large sample test are
1. Sampling from attributes
2. Sampling from variables
6.13.2 Sampling from attributes
There are two types of test for attributes
1. Test for single proportion
2. Test for equality of two proportions
6.13.2.1 Test for single proportion
In a sample of large size n, we may examine whether the the sample would
have come from a population having a specified proportion P=P0. For testing
We may proceed as follows
1. Null Hypothesis (H0)
H0: The given sample would have come from a population with specified
proportion P=P0
2. Alternative Hypothesis(H1)
H1 : The given sample may not be from a population with specified
proportion
Basic Statistics

P≠P0 (Two Sided)


P>P0(One sided-right sided)
P<P0(One sided-left sided)
3. Test statistic
|𝑝 − 𝑃 |
𝑍=
𝑃𝑄

𝑛

It follows a standard normal distribution with μ=0 and σ2=1


4. Level of Significance
The level of significance may be fixed at either 5% or 1%
5. Expected vale or critical value
In case of test statistic Z, the expected value is
Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Zo exceeds the table value Ze
we reject the Null Hypothesis Ho otherwise accept it.
6.13.2.2 Test for equality of two proportions
Given two sets of sample data of large size n1 and n2 from attributes. We may
examine whether the two samples come from the populations having the
same proportion. We may proceed as follows:
Basic Statistics

1. Null Hypothesis (Ho)


Ho: The given two sample would have come from a population having
the same proportion P1=P2
2. Alternative Hypothesis (H1)
H1 : The given two sample may not be from a population with specified
proportion P1≠P2 (Two Sided)
P1>P2(One sided-right sided)
P1<P2(One sided-left sided)
3. Test statistic
|(𝑝1 − 𝑝2 ) − (𝑃1 − 𝑃2 )|
𝑍=
𝑃1 𝑄1 𝑃 𝑄
√ + 2 2
𝑛1 𝑛2

When P1and P2 are not known, then


|𝑝1 − 𝑝2 |
𝑍=
𝑝1 𝑞1 𝑝2 𝑞2
√𝑛 + 𝑛
1 2

for heterogeneous population Where q1 = 1-p1 and q2 = 1-p2


|𝑝1 − 𝑝2 |
𝑍=
1 1
√𝑝𝑞 ( + ) 𝑛1 𝑛2

for homogeneous population


p= combined or pooled estimate.
𝑛1 𝑝1 + 𝑝2 𝑛2
𝑝=
𝑛1 + 𝑛2
4. Level of Significance
The level may be fixed at either 5% or 1%
5. Expected vale
The expected value is given by
Basic Statistics

Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may
reject the Null Hypothesis H0 otherwise accept it.
6.13.3 Sampling from variable
In sampling for variables, the test are as follows
1. Test for single Mean
2. Test for single Standard Deviation
3. Test for equality of two Means
4. Test for equality of two Standard Deviation
6.13.3.1 Test for single Mean
In a sample of large size n, we examine whether the sample would have come
from a population having a specified mean
1. Null Hypothesis (H0)
H0: There is no significance difference between the sample mean ie.,
µ=µo
or
The given sample would have come from a population having a
specified mean ie., µ=µ0
2. Alternative Hypothesis(H1)

H1 : There is significance difference between the sample mean is.


Basic Statistics

µ≠µ0 or µ>µ0 or µ<µo

3. Test statistics

|𝑥̅ − 𝜇 |
𝑍= 𝜎
√𝑛

When population variance is not known, it may be replaced by its estimate

(∑𝑥)2
|𝑥̅ − 𝜇 | ∑𝑥 2 −
𝑍= , 𝑤ℎ𝑒𝑟𝑒 𝑠 = √ 𝑛
𝑠
𝑛−1
√𝑛

4. Level of Significance
The level may be fixed at either 5% or 1%
5. Expected vale
The expected value is given by
Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may
reject the Null Hypothesis H0 otherwise accept it.
6.13.3.2 Test for equality of two Means
Given two sets of sample data of large size n1 and n2 from variables. We may
examine whether the two samples come from the populations having the
same mean. We may proceed as follows
Basic Statistics

1. Null Hypothesis (H0)


Ho: There is no significance difference between the sample mean ie.,
µ=µ0
or
The given sample would have come from a population having a
specified mean ie., µ1=µ2
2. Alternative Hypothesis (H1)
H1: There is significance difference between the sample mean ie., µ=µo
ie., µ1≠µ2 or µ1<µ2 or µ1>µ2
3. Test statistic
When the population variances are known and unequal (i.e) 𝜎12 ≠ 𝜎22
|(𝑥̅1 − 𝑥̅ 2 ) − (𝜇1 − 𝜇2 )|
𝑍=
𝜎12 𝜎22
√ +
𝑛1 𝑛2

When 𝜎12 = 𝜎22


|(𝑥̅1 − 𝑥̅ 2 )| (𝑛1 𝜎12 + 𝑛2 𝜎22 )
𝑍= , 𝑤ℎ𝑒𝑟𝑒 𝜎 =
1 1 (𝑛1 + 𝑛2 )
𝜎√ +
𝑛1 𝑛2

The equality of variances can be tested by using F test.


When population variance is unknown, they may be replaced by their
estimates 𝑠12 𝑎𝑛𝑑 𝑠22
|(𝑥̅1 − 𝑥̅ 2 )|
𝑍= , 𝑤ℎ𝑒𝑛 𝑠12 ≠ 𝑠22
𝑠2 𝑠22
√1 +
𝑛1 𝑛2
Basic Statistics

When 𝑠12 = 𝑠22


|(𝑥̅1 − 𝑥̅ 2 )| (𝑛1 𝑠12 + 𝑛2 𝑠22 )
𝑍= , 𝑤ℎ𝑒𝑟𝑒 𝑠 =
1 1 (𝑛1 + 𝑛2 )
𝑠√ +
𝑛1 𝑛2

4. Level of Significance
The level may be fixed at either 5% or 1%
5. Expected vale
The expected value is given by
The expected value is given by
Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.34 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Z we
may reject the Null Hypothesis H0 otherwise accept it.
Basic Statistics

Lesson 7
T-Test and F-Test

Content
0
Basic Statistics

Course Name Basic Statistics

Lesson 7 T-Test and F-Test


Content Creator Name Dr. Vinay Kumar
Chaudhary Charan Singh Haryana
University/College Name
Agricultural University,Hisar
Course Reviewer Name Dr Dhaneshkumar V Patel
Unagadh Agricultural
University/college Name
University,Junagadh

1
Basic Statistics

Objectives of the lesson:

1. Applications or uses of t-test


2. T-test for single mean
3. T-test for two means
4. T-test for paired sample
5. F-Test and its applications
Glossary of the lesson: Test-Statistic, Level of Significance, Independent

Sample etc.

7.1 Student’s t test

|x̅−μ|
When the sample size is smaller, the ratio Z = s will follow t distribution
√n

and not the standard normal distribution. Hence the test statistic is given as
|x̅−μ|
t= s which follows normal distribution with mean 0 and unit standard
√n

deviation. This follows a t distribution with (n-1) degrees of freedom which


can be written as t(n-1) d.f.
This fact was brought out by Sir William Gossest and Prof. R.A Fisher. Sir
William Gossest published his discovery in 1905 under the pen name Student
and later on developed and extended by Prof. R.A Fisher. He gave a test known
as t-test.

7.2 Applications (or) uses

1 .To test the single mean in single sample case

2.To test the equality of two means in double sample case.


Basic Statistics

• Independent samples(Independent t test)

• Dependent samples (Paired t test)

2. To test the significance of observed correlation coefficient.

3. To test the significance of observed partial correlation coefficient.

4. To test the significance of observed regression coefficient.

7.2.1 Test for single Mean


1. Form the null hypothesis
Ho: µ=µ (i.e) There is no significance difference between the sample mean
and the population mean
2. Form the Alternate hypothesis
H1: µ≠µo (or µ>µo or µ<µo) ie., There is significance difference between the
sample mean and the population mean
3. Level of Significance
The level may be fixed at either 5% or 1%

4. Test statistic

x̅ − μ
t=| s |
√n

Which follows t distribution with (n-1) degrees of freedom where

2 (∑x)2

x =
∑xi
and s = √∑x − n
n n−1
Basic Statistics

5. Find the table value of t corresponding to (n-1) d.f. and the specified level
of significance.
6. Inference
If t < ttab we accept the null hypothesis H0. We conclude that there is no
significant difference sample mean and population mean

(or) if t > ttab we reject the null hypothesis H0. (ie) we accept the alternative
hypothesis and conclude that there is significant difference between the
sample mean and the population mean.

Example 1
Based on field experiments, a new variety of green gram is expected to
given a yield of 12.0 quintals per hectare. The variety was tested on 10
randomly selected farmer’s fields. The yield (quintals/hectare) were
recorded as
14.3,12.6,13.7,10.9,13.7,12.0,11.4,12.0,12.6,13.1.
Do the results conform to the expectation?
Solution

1. Null hypothesis

H0: =12.0

(i.e) the average yield of the new variety of green gram is 12.0
quintals/hectare.

2. Alternative Hypothesis:

H1:μ ≠ 12.0
Basic Statistics

(i.e) the average yield is not 12.0 quintals/hectare, it may be less or more
than 12 quintals
/ hectare

3. Level of significance: 5 %

4. Test statistic:

x̅ − μ
t=| s |
√n

From the given data

∑x = 126.3 , ∑x 2 = 1605.77

∑x 126.3
x̅ = = = 12.63
n 10

2 (∑x)2

s= √∑x − n
=√
1605.77 − 1595.169
=√
10.601
= 1.0853
n−1 9 9
s 1.0853
= = 0.3432
√n √10

x̅ − μ
Now, t = | s |
√n

12.63 − 12
t= = 1.836
0.3432

Table value for t corresponding to 5% level of significance and 9 d.f. is 2.262

(two tailed test)


5. Inference
Basic Statistics

t < ttab

We accept the null hypothesis H0

We conclude that the new variety of green gram will give an average

yield of 12 quintals/hectare.
Note
Before applying t test in case of two samples the equality of their
variances
has to be tested by using F-test
s12
F = 2 ∼ Fn1 −1,n2−1 d, f, if s12 > s22
s2
or
s12
F = 2 ∼ Fn2 −1,n1−1 d, f, if s12 < s22
s2
where s12 is the variance of the first sample whose size in n1
s22 is the variance of the second sample whose size is n2

It may be noted that the numerator is always the greater variance. The
critical value for F is read from the F table corresponding to a specified
d.f. and level of significance Inference
F <Ftab

We accept the null hypothesis H0.(i.e) the variances are equal otherwise
the variances are unequal.
7.2.2 Test for equality of two means (Independent Samples)
Given two sets of sample observation x11, x12, x13 … x1n , and
Basic Statistics

x21, x22, x23 … x2n of sizes n1 and n2 respectively from the normal
population.
1. Using F-Test , test their variances

(i) Variances are Equal


H0 : µ1 = µ2
H1 : µ1 ≠ µ2 (or µ1 < µ2 or µ1 > µ2 )

2. Test statistics
|(x̅1 − x̅2 )|
t=
1 1
√s 2 (n + n )
1 2

(∑xi )2 (∑xi )2
[∑xi2 − ] + [∑xi2 − ]
2 n1 n2
where the combined variance s =
n1 + n2 − 2
The test statistics t follows a t distribution with n1 + n2 -2 d.f.
ii) Variance are unequal and n1=n2
|(x̅1 − x̅2 )|
t=
1 1
√s 2 (n + n )
1 2

𝑛1 +𝑛2
It follows a t distribution with ( ) − 1 𝑑. 𝑓.
2

(i) Variances are unequal and n1≠n2

(𝑥̅1 − 𝑥̅2 )
𝑡 = || ||
𝑠 2 𝑠 2
√( 1 + 2 )
𝑛1 𝑛2

This statistic follows neither t nor normal distribution but it follows


Basic Statistics

Behrens-Fisher d distribution. The Behrens – Fisher test is laborious one.


An alternative simple method has been suggested by Cochran & Cox. In
this method the critical value of t is altered as tw (i.e) weighted t
𝑠2 𝑠2
𝑡1 ( 1 ) + 𝑡2 ( 2 )
𝑛1 𝑛2
𝑡𝑥 =
𝑠12 𝑠22
+
𝑛1 𝑛2

where t1is the critical value for t with (n1-1) d.f. at a dspecified level of

significance and t2 is the critical value for t with (n2-1) d.f. at a specified
level of significance and

Example 2
In a fertilizer trial the grain yield of paddy (Kg/plot) was observed as follows
Under ammonium chloride 42,39,38,60 &41 kgs
Under urea 38, 42, 56, 64, 68, 69,& 62 kgs.
Find whether there is any difference between the sources of nitrogen?

Solution
𝐻𝑜 : µ1 = µ2 (i.e) there is no significant difference in effect between the
sources of nitrogen.
𝐻1 : µ1 ≠ µ2 (i.e) there is a significant difference between the two sources
Level of significance = 5%
Before we go to test the means first we have to test their variances by using
F-test. F-test
𝐻0 : 𝜎12 = 𝜎22

𝐻1 : 𝜎12 ≠ 𝜎22
Basic Statistics

(∑𝑥1 )2
∑𝑥12 −
𝑛1
𝑠12 = = 82.5
𝑛1 − 1

(∑𝑥2 )2
∑𝑥22 −
𝑛1
𝑠22 = = 154.33
𝑛2 − 1


𝑠12
𝐹 = 2 ∼ 𝐹𝑛2 −1,𝑛1−1 𝑑, 𝑓, 𝑖𝑓 𝑠12 < 𝑠22
𝑠2
154.33
𝐹= = 1.8707
32.5

Ftab(6,4) d.f. = 6.16

F < Ftab

We accept the null hypothesis H0. (i.e) the variances are equal. Use the test

statistic
|(𝑥̅1 − 𝑥̅2 )|
𝑡=
1 1
√𝑠 2 (𝑛 + 𝑛 )
1 2

(∑𝑥1 )2 (∑𝑥2 )2
[∑𝑥12 − ]+ [∑𝑥22 − ] 330 + 992
𝑛1 𝑛2
𝑠= = = 125.6
𝑛1 + 𝑛2 − 1 10
|(44 − 57)|
𝑡= = 1.98
1 1
√125.7 ( + )
7 75

The degrees of freedom is 5+7-2= 10. For 5 % level of significance, table

value of t is 2.228
Basic Statistics

Inference:

t <ttab

We accept the null hypothesis H0

We conclude that the two sources of nitrogen do not differ significantly with

regard to the grain yield of paddy.

7.3 F-Test:

A large number of research experiments are conducted to examine the effect


of various factors on the production and quality attributes of milk and milk
products. F-test is used either for testing the hypothesis about the equality of
two population variances or the equality of two or more population means.
The equality of two population means was dealt with t-test. Besides a t-test,
we can also apply F-test for testing equality of two population means. Sir
Ronald A. Fisher defined a statistic Z which is based upon ratio of two sample
variances. In this lesson we will consider the distribution of ratio of two
sample variances which was worked out by G.W. Snedecor.

7.3.1 F-Statistic:

Let X1i (i=1,2,..,n1) be a random sample of size n1from the first population with
variance σ12 and X2j (j=1,2,….,n2) be another independent random sample of
size n2 from the second normal population with variance σ22. The F- statistic
is defined as the ratio of estimates of two variances as given below:
Basic Statistics

where, S12 > S22 and are unbiased estimates of population variances which
are given by:

It follows Snedecor’s F- distribution with (n1-1, n2-1) d.f. i.e., F~F (n1 - 1, n2 -
1). Further, if X is a χ2-variate with n1 d.f. and Y is another independent χ2-
variate with n2 d.f., then F-statistic is defined as:

i.e. F-statistic is the ratio of two independent Chi-square variates divided by


their respective degrees of freedom. This statistic follows G.W. Snedecor F
distribution with (n1,n2) d.f. The sampling distribution of F-statistic does not
involve any population parameter and depends only on the degrees of
freedom n1and n2.

7.4 Application of F- Distribution

F-distribution has a number of applications in statistics, some of which are


given below

a) F-test for equality of population variances


b) F test for testing equality of several population means
Basic Statistics

Lesson 8
Ch-Square Distribution

Content
0
Basic Statistics

Course Name Basic Statistics

Lesson 8 Ch-Square Distribution


Content Creator Name Dr. Vinay Kumar
Chaudhary Charan Singh Haryana
University/College Name
Agricultural University,Hisar
Course Reviewer Name Dr Dhaneshkumar V Patel
Unagadh Agricultural
University/college Name
University,Junagadh

1
Basic Statistics

Objectives of the lesson:


1. Test for goodness of fit
2. Conditions for validity of chi-square test
3. Chi-Square (χ2) test for independence of attributes
4. Yates correction for continuity
Glossary of the lesson: Test-Statistic, Level of Significance, Goodness
of fit, Yates correction etc.
8.1 χ2 distribution:
In case of attributes we cannot employ the parametric tests such as F and
t. Instead we have to apply χ2 test. When we want to test whether a set of
observed values are in agreement with those expected on the basis of
some theories or hypothesis then χ2 statistic provides a measure of
agreement between such observed and expected frequencies.
The χ2 test has a number of applications. It is used to
1. Test the independence of attributes
2. Test the goodness of fit
3. Test the homogeneity of variances
4. Test the homogeneity of correlation coefficients
5. Test the equality of several proportions.
6. In genetics it is applied to detect linkage. Applications
8.2 χ2 – test for goodness of fit
A very powerful test for testing the significance of the discrepancy between
theory and experiment was given by Prof. Karl Pearson in 1900 and is
known as “chi-square test of goodness of fit “.
If 𝑂𝑖 (𝑖 = 1,2, … . . , 𝑛) is a set of observed (experimental frequencies) and
𝐸𝑖 (𝑖 = 1,2, … . . , 𝑛) is the corresponding set of expected (theoretical or
hypothetical) frequencies, then,
Basic Statistics

𝑛
2
(𝑂𝑖 − 𝐸𝑖 )2
𝜒 =∑
𝐸𝑖
𝑖=1

It follow a χ2 distribution with n-1 d.f.. In case of χ2 only one tailed test is
used.
For Example :In plant genetics, our interest may be to test whether the
observed segregation ratios deviate significantly from the mendelian
ratios. In such situations we want to test the agreement between the
observed and theoretical frequency, such test is called as test of goodness
of fit.
8.3 Conditions for the validity of χ2 –test
χ2- test is an approximate test for large values of ‘n’ for the validity of χ 2-
test of goodness of fit between theory and experiment, the following
conditions must be satisfied.
1. The sample observations should be independent.
2. Constraints on the cell frequencies, if any, should be linear
Example ∑𝑂𝑖 = ∑𝐸𝑖
3. N, the total frequency should be reasonable large, say greater than
(>) 50.
4. No theoretical cell frequency should be less than 5. If any theoretical
cell frequency is < 5, then for the application of chi-square test, it is
pooled with the preceding or succeeding frequency so that the
pooled frequency is more than 5 and finally adjust for degree’s of
freedom lost in pooling
Example 1 :
The number of yeast cells counted in a haemocyto meter is compared to
the theoretical value is given below. Does the experimental result support
the theory?
Basic Statistics

No. of Yeast cells in Obseved Expected Frequency


the square Frequency
0 103 106
1 143 141
2 98 93
3 42 41
4 8 14
5 6 5

Solution
H0: the experimental results support the theory
H1: the experimental results does not support the theory. Level of
significance=5%
Test Statistic:
𝑛
2
(𝑂𝑖 − 𝐸𝑖 )2
𝜒 =∑ = 3.1779
𝐸𝑖
𝑖=1

Oi Ei Oi-Ei (Oi-Ei)2 (Oi-Ei)2/Ei


103 106 -3 9 0.0849
143 141 2 4 0.0284
98 93 5 25 0.2688
42 41 1 1 0.0244
8 14 -6 36 2.5714
6 5 1 1 0.2000
400 400 3.1779

Table value
2
𝜒6−1 = 5 𝑎𝑡 5 % 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 = 10.070
Inference
2 2
𝐴𝑠 𝜒𝑐𝑎𝑙 < 𝜒𝑡𝑎𝑏
Basic Statistics

We accept the null hypothesis (i.e) there is a good correspondence


between theory and experiment.
8.4 Chi-Square (χ2) test for independence of attributes
At times we may consider two characteristics on attributes simultaneously.
Our interest will be to test the association between these two attributes
Example:- An entomologist may be interested to know the effectiveness of
different concentrations of the chemical in killing the insects. The
concentrations of chemical form one attribute. The state of insects ‘killed
& not killed’ forms another attribute. The result of this experiment can be
arranged in the form of a contingency table. In general one attribute may
be divided into m classes as 𝐴1 , 𝐴2 , … … . 𝐴𝑚 and the other attribute may
be divided into n classes as 𝐵1 , 𝐵2 , … … . 𝐵𝑛 . Then the contingency table will
have m x n cells. It is termed as m x n contingency table

B\A A1 A2 … Aj … Am Row Total


B1 𝑂11 𝑂12 … 𝑂1𝑗 … 𝑂1𝑚 𝑟1

B2 𝑂21 𝑂22 … 𝑂2𝑗 … 𝑂2𝑚 𝑟2

.
Bi 𝑂𝑖1 𝑂𝑖2 𝑂𝑖𝑗 𝑂𝑖𝑚 𝑟𝑖
.
Bn 𝑂𝑛1 𝑂𝑛2 𝑂𝑛𝑖 𝑂𝑛𝑚 𝑟𝑛
𝑛 = ∑𝑟𝑖
Col
𝑐1 𝑐2 𝑐2 𝑐𝑛 = ∑𝑐𝑗
Total

where Oij are observed frequencies.


Basic Statistics

𝑟𝑖 𝑐𝑗
The expected frequencies corresponding to Oij is calculated as . 𝑇ℎ𝑒 𝜒 2
𝑛
is computed as
𝑛 𝑚 2
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )
𝜒2 = ∑ ∑
𝐸𝑖𝑗
𝑖=1 𝑗=1

where
Oij – observed frequencies
Eij – Expected frequencies
n - number of rows
m - number of columns
It can be verified that ∑𝑂𝑖𝑗 = ∑𝐸𝑖𝑗

This χ2 is distributed as χ2 with (n-1) d.f.


7.3.5 2x2 – contingency table:
When the number of rows and number of columns are equal to 2 it is
termed as 2 x 2 contingency table. It will be in the following form

𝐵1 𝐵1 Row Total
𝐵1 𝑎 𝑏 𝑎+𝑏 𝑟1
𝐵1 𝑐 𝑑 𝑐+𝑑 𝑟2
𝑎 𝑏
Col 𝑎+𝑏+𝑐
+𝑐 +𝑑 =𝑛
Tot +𝑑
𝑐1 𝑎
Where a, b, c and d are cell frequencies c1 and c2 are column totals, r1 and
r2 are row Totals and n is the total number of observations.
In case of 2 x 2 contingency table, χ2 can be directly found using the short
cut formula
Basic Statistics

𝑛(𝑎𝑑 − 𝑏𝑐 )2
2
𝜒 =
𝑐1 𝑐2 𝑟1 𝑟2
The d.f. associated with χ2 is (2-1)(2-1) = 1
8.5 Yates correction for continuity
If anyone of the cell frequency is < 5, we use Yates correction to make χ2 as
continuous. The Yates correction is made by adding 0.5 to the least cell
frequency and adjusting the other cell frequencies so that the column and
row totals remain same. Suppose, the first cell frequency is to be corrected
then the contingency table will be as follows:
𝐵1 𝐵1 Row Total
𝐵1 𝑎 + 0.5 𝑏 − 0.5 𝑎 + 𝑏 = 𝑟1
𝐵1 𝑐 − 0.5 𝑑 + 0.5 𝑐 + 𝑑 = 𝑟2
Col Tot 𝑎+𝑐 𝑏+𝑑 𝑎+𝑏+𝑐
= 𝑐1 = 𝑐2 +𝑑

Then use the χ2 – statistic as


𝑛 2
𝑛 (|𝑎𝑑 − 𝑏𝑐 | − )
2
𝜒2 =
𝑐1 𝑐2 𝑟1 𝑟2
The d.f. associated with χ2 is (2-1)(2-1) = 1
Example 2:
The severity of a disease and blood group were studied in a research
projects. The findings are given in the following table, known as the m x n
contingency table. Can this severity of the condition and blood group are
associated. Severity of a disease classified by blood group in 1500 patients.
Blood Groups
Condition Total
O A B AB
Basic Statistics

Severe 51 40 10 9 110
Moderate 105 103 25 17 250
Mild 384 527 125 104 1140
Total 540 670 160 130 1500

Solution
H0: The severity of the disease is not associated with blood group.
H1: The severity of the disease is associated with blood group.
Blood Groups
Condition Total
O A B AB
Severe 51 40 10 9 110
Moderate 105 103 25 17 250
Mild 384 527 125 104 1140
Total 540 670 160 130 1500

Test statistics:
𝑛 𝑚 2
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )
𝜒2 = ∑ ∑
𝐸𝑖𝑗
𝑖=1 𝑗=1

The d.f. associated with χ2 is (3-1)(4-1) = 6


Calculation of Expected frequencies
(𝑂𝑖 − 𝐸𝑖 )2
𝑂𝑖 𝐸𝑖 𝑂𝑖 − 𝐸𝑖 (𝑂𝑖 − 𝐸𝑖 )2
𝐸𝑖
51 39.6 11.4 129.96 3.2818
40 49.1 -9.1 82.81 1.6866
10 11.7 -1.7 2.89 0.2470
9 9.5 -0.5 0.25 0.0263
Basic Statistics

105 90.0 15 225.00 2.5000


103 111.7 -8.7 75.69 0.6776
25 26.7 -1.7 2.89 0.1082
17 21.7 -4.7 22.09 1.0180
384 410.4 -26.4 696.96 1.6982
527 509.2 17.8 316.84 0.6222
125 121.6 3.4 11.56 0.0951
104 98.8 5.2 27.04 0.2737
Total 12.2347

Table value of χ 2 for 6 d.f. at 5% level of significance is 12.59


Inference
χ2 < χ2 tab
We accept the null hypothesis.
The severity of the disease has no association with blood group.
Example 3:
In order to determine the possible effect of a chemical treatment on the
rate of germination of cotton seeds a pot culture experiment was
conducted. The results are given below
Chemical treatment and germination of cotton seeds
Germinated Not Total
germinated
Chemically 118 22 140
Treated
Untreated 120 40 160
Total 238 62 300
Basic Statistics

Does the chemical treatment improve the germination rate of cotton


seeds?
Solution
H0:The chemical treatment does not improve the germination rate of
cotton seeds.
H1: The chemical treatment improves the germination rate of cotton
seeds.
Level of significance = 1%
Test statistic

2
𝑛(𝑎𝑑 − 𝑏𝑐 )2
𝜒 = 𝑤𝑖𝑡ℎ 1 𝑑. 𝑓.
(𝑎 + 𝑏)(𝑐 + 𝑑)(𝑎 + 𝑐 )(𝑛 + 𝑑)

2
300(118 × 40 − 22 × 120)2
𝜒 = = 3.927
140 × 160 × 62 × 238
Table value
𝜒 2 (1)𝑑. 𝑓. 𝑎𝑡 1 % 𝑙𝑒𝑣𝑒𝑙𝑜𝑓 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 = 6.635
Inference
2
𝜒 2 < 𝜒𝑡𝑎𝑏
We accept the null hypothesis.
The chemical treatment will not improve the germination rate of cotton
seeds significantly.

Example 4
In an experiment on the effect of a growth regulator on fruit setting in
muskmelon the following results were obtained. Test whether the fruit
Basic Statistics

setting in muskmelon and the application of growth regulator are


independent at 1% level.
Aa Fruit set Fruit not set Total
Treated 16 9 25
Control 4 21 25
Total 20 30 50

Solution
H0:Fruit setting in muskmelon does not depend on the application of
growth regulator.
H1: Fruit setting in muskmelon depend on the application of growth
regulator.
Level of significance = 1%
After Yates correction we have
Fruit set Fruit not set Total
Treated 15.5 9.5 25
Control 4.5 20.5 25
Total 20 30 50

Test statistic
𝑛 2
𝑛 (|𝑎𝑑 − 𝑏𝑐 | − )
2
𝜒2 =
(𝑎 + 𝑏)(𝑐 + 𝑑)(𝑎 + 𝑐 )(𝑛 + 𝑑)
50 2
50 (|15.5 × 20.5 − 9.5 × 4.5| − |)
2
𝜒2 = = 8.33
25 × 25 × 20 × 30
Table value
Basic Statistics

χ2 (1) d.f. at 1 % level of significance is 6.635


Inference
2
𝜒 2 > 𝜒𝑡𝑎𝑏
We reject the null hypothesis.
Fruit setting in muskmelon is influenced by the growth regulator.
Application of growth regulator will increase fruit setting in musk melon.
Basic Statistics

Lesson 9
Correlation and Regression

Content
0
Basic Statistics

Course Name Basic Statistics

Lesson 9 Correlation and Regression


Content Creator Name Dr. Vinay Kumar
Chaudhary Charan Singh Haryana
University/College Name
Agricultural University,Hisar
Course Reviewer Name Dr Dhaneshkumar V Patel
Unagadh Agricultural
University/college Name
University,Junagadh

1
Basic Statistics

Lesson-9

Objectives of the Lecture:


1. Methods of Measuring the Correlation
2. Properties of Correlation
3. Spearman rank Correlation and its
Properties
4. Regression and properties of regression
coefficients

Glossary of terms: Correlation, Scatter Diagram, Regression,


Regression Coefficient etc.

9.1 Introduction:
The term correlation is used by a common man without knowing that he
is making use of the term correlation. For example when parents advice
their children to work hard so that they may get good marks, they are
correlating good marks with hard work.
The study related to the characteristics of only variable such as height,
weight, ages, marks, wages, etc., is known as univariate analysis. The
statistical Analysis related to the study of the relationship between two
variables is known as Bivariate Analysis. Sometimes the variables may be
inter-related. In health sciences we study the relationship between blood
pressure and age, consumption level of some nutrient and weight gain,
total income and medical expenditure, etc. The nature and strength of
relationship may be examined by correlation and Regression analysis.
Thus Correlation refers to the relationship of two variables or more. (e-
g) relation between height of father and son, yield and rainfall, wage and
price index, share and debentures etc.
Correlation is statistical Analysis which measures and analyses the degree
or extent to which the two variables fluctuate with reference to each

2
Basic Statistics

other. The word relationship is important. It indicates that there is some


connection between the variables. It measures the closeness of the
relationship. Correlation does not indicate cause and effect relationship.
Price and supply, income and expenditure are correlated.

9.2 Uses of correlation:


1. It is used in physical and social sciences.
2. It is useful for economists to study the relationship between variables like
price, quantity etc. Businessmen estimates costs, sales, price etc. using
correlation.
3. It is helpful in measuring the degree of relationship between the variables
like income and expenditure, price and supply, supply and demand etc.
4. Sampling error can be calculated.
5. It is the basis for the concept of regression.
9.3 Scatter Diagram
To investigate whether there is any relation between the variables X
and Y we use scatter diagram. Let (𝑥1 , 𝑦1 ), (𝑥1 , 𝑦2 ) … . (𝑥𝑛 , 𝑦𝑛 ) be n
pairs of observations. If the variables X and Y are plotted along the X-
axis and Y-axis respectively in the x-y plane of a graph sheet the
resultant diagram of dots is known as scatter diagram. From the
scatter diagram we can say whether there is any correlation between
x and y and whether it is positive or negative or the correlation is
linear or curvilinear.
9.3.1 Types of Correlation
Positive correlation: Both variables tend to increase (or decrease)
together.
Negative correlation: The two variables tend to change in opposite
directions, with one increasing while the other decreases.
No correlation: There is no apparent (linear) relationship between
the two variables.

3
Basic Statistics

Nonlinear relationship: The two variables are related, but the


relationship results in a scatter diagram that does not follow a
straight-line pattern

9.4 Classification of Correlation:


Correlation is classified into various types. The most important ones are
i) Positive and negative.
ii) Linear and non-linear.
iii) Partial and total.
iv) Simple and Multiple.

9.4.1 Positive and Negative Correlation:


It depends upon the direction of change of the variables. If the two
variables tend to move together in the same direction (i.e) an increase in
the value of one variable is accompanied by an increase in the value of the
other, (or) a decrease in the value of one variable is accompanied by a
decrease in the value of other, then the correlation is called positive or
4
Basic Statistics

direct correlation. Price and supply, height and weight, yield and rainfall,
are some examples of positive correlation.
If the two variables tend to move together in opposite directions so that
increase (or) decrease in the value of one variable is accompanied by a
decrease or increase in the value of the other variable, then the
correlation is called negative (or) inverse correlation. Price and demand,
yield of crop and price, are examples of negative correlation.
9.4.2 Linear and Non-linear correlation:
If the ratio of change between the two variables is a constant then
there will be linear correlation between them.
Consider the following.
X 2 4 6 8 10 12
Y 3 6 9 12 15 18

Here the ratio of change between the two variables is the same. If we
plot these points on a graph we get a straight line.
If the amount of change in one variable does not bear a constant
ratio of the amount of change in the other. Then the relation is called
Curvilinear (or) non-linear correlation. The graph will be a curve.
9.4.3 Simple and Multiple correlations:
When we study only two variables, the relationship is simple correlation.
For example, quantity of money and price level, demand and price. But
in a multiple correlation we study more than two variables
simultaneously. The relationship of price, demand and supply of a
commodity are an example for multiple correlations.
9.4.4 Partial and total correlation:
The study of two variables excluding some other variable is called Partial
correlation. For example, we study price and demand eliminating supply
side. In total correlation all facts are taken into account.
9.5 Computation of correlation:
When there exists some relationship between two variables, we have to
measure the degree of relationship. This measure is called the measure
5
Basic Statistics

of correlation (or) correlation coefficient and it is denoted by ‘r’.


Co-variation:
The covariance between the variables x and y is defined as-
∑(𝑋 − 𝑋̅ )(𝑌 − 𝑌̅ )
𝐶𝑜𝑣 (𝑋𝑌) =
𝑁
∑𝑥𝑦
=
𝑁
Where, X bar is the mean of X and Y bar is the mean of Y. x and y are
deviations from its mean.
Karl Pearson’s Coefficient of Correlation:
It is most widely used method in practice and it is known as
Pearsonian Coefficient of Correlation. It is denoted by ‘r’. The formula for
calculating ‘r’ is-
𝐶𝑜𝑣 (𝑥, 𝑦)
𝑟 = ; 𝑊ℎ𝑒𝑟𝑒 𝜎𝑥 = 𝑆𝐷 𝑜𝑓 𝑋 𝑎𝑛𝑑 𝜎𝑦 = 𝑆𝐷 𝑜𝑓 𝑌
𝜎𝑥. 𝜎𝑦
∑𝑥𝑦
𝑟 =
𝑁. 𝜎𝑥. 𝜎𝑦
∑𝑥𝑦
𝑟 =
√∑𝑥2. ∑𝑦2
The third formula is easy to calculate, and it is not necessary to
calculate the standard deviations of x and y series respectively.
9.6 Properties of Correlation Coefficient:
Property 1: Correlation coefficient lies between –1 and +1.
Property 2: ‘r’ is independent of change of origin and scale.
Property 3: It is a pure number independent of units of measurement.
Property 4: Independent variables are uncorrelated but the converse is
not true.
Property 5: Correlation coefficient is the geometric mean of two
regression coefficients.
Property 6: The correlation coefficient of x and y is symmetric. rxy = ryx.

6
Basic Statistics

9.7 Limitations:
1. Correlation coefficient assumes linear relationship regardless of the
assumption is correct or not.
2. Extreme items of variables are being unduly operated on correlation
coefficient.
3. Existence of correlation does not necessarily indicate cause- effect
relation.

9.8 Interpretation:
The following rules helps in interpreting the value of ‘ r’ .
1. When r = 1, there is perfect +ve relationship between the variables.
2. When r = -1, there is perfect –ve relationship between the variables.
3. When r = 0, there is no relationship between the variables.
4.If the correlation is +1 or –1, it signifies that there is a high degree of
correlation. (+ve or –ve) between the two variables.
If r is near to zero (ie) 0.1,-0.1, (or) 0.2 there is less correlation.
Example 1:
Find Karl Pearson’s coefficient of correlation from the following data
between height of father (x) and son (y).
X 64 65 66 67 68 69 70
Y 66 67 65 68 70 68 72
Comment on the result.
Solution:
x y (𝑥 − 𝑥̅ ) (𝑥 − 𝑥̅ )2 (𝑦 − 𝑦̅) (𝑦 − 𝑦̅)2 (𝑥 − 𝑥̅ )(𝑦
− 𝑦̅)
64 66 -3 9 -2 4 6
65 67 -2 4 -1 1 2
66 65 -1 1 -3 9 3
67 68 0 0 0 0 0
68 70 1 1 2 4 2
69 68 2 4 0 0 0

7
Basic Statistics

70 72 3 9 4 16 12
469 476 0 28 0 34 25
∑𝑋 469
𝑀𝑒𝑎𝑛 𝑜𝑓 𝑋 = = = 67;
𝑁 7
∑𝑌 476
𝑀𝑒𝑎𝑛 𝑜𝑓 𝑌 = = = 68.
𝑁 7
Hence, Karl Pearson’s Coefficient of Correlation,
∑𝑥𝑦 25 25 25
𝑟 = = = = = 0.81
√∑𝑥2. ∑𝑦2 √28 × 34 √952 30.85
𝑆𝑖𝑛𝑐𝑒 𝑟 = + 0.81, the variables are highly positively correlated i. e., tall
fathers have tall sons.
Example:
Calculate coefficient of correlation from the following data.
X 1 2 3 4 5 6 7 8 9
Y 9 8 1 1 1 1 1 1 1
0 2 1 3 4 6 5

Example:
Calculate Pearson’s Coefficient of correlation.
X 4 5 5 5 6 6 6 7 7 8 8
5 5 6 8 0 5 8 0 5 0 5
Y 5 5 4 6 6 6 6 7 7 8 9
6 0 8 0 2 4 5 0 4 2 0

9.9 Rank Correlation


It is studied when no assumption about the parameters of the population
is made. This method is based on ranks. It is useful to study the qualitative
measure of attributes like honesty, colour, beauty, intelligence, character,
morality etc. The individuals in the group can be arranged in order and
there on, obtaining for each individual a number showing his/her rank in
the group. This method was developed by Edward Spearman in 1904. It
is defined as-

8
Basic Statistics

6∑D2
ρ = 1− 3
N −N
where, ρ (rho) = rank correlation coefficient;
∑D2 = sum of squares of differences between the pairs of ranks; and
N = number of pairs of observations.
The value of ρ lies between –1 and +1. If ρ = +1, there is complete
agreement in order of ranks and the direction of ranks is also same. If ρ =
-1, then there is complete disagreement in order of ranks and they are in
opposite directions.
Computation for tied observations: There may be two or more items
having equal values. In such case the same rank is to be given. The
ranking is said to be tied. In such circumstances an average rank is to
be given to each individual item. For example if the value so is repeated
twice at the 5th rank, the common rank to be assigned to each item is
5+6
= = 5.5
2
which is the average of 5 and 6 given as 5.5, appeared twice.
If the ranks are tied, it is required to apply a correction factor which is
1
(m3 − m). A slight formula is used when there is more than one item
12
having the same value. The formula is-
6[∑D2 + (m3 − m) + (m3 − m) + (m3 − m)]
1 1 1
12 12 12
ρ = 1− 3
N −N
Where m is the number of items whose ranks are common and should
be repeated as many times as there are tied observations.
Example 2:
In a marketing survey the price of tea and coffee in a town based on quality
was found as shown below. Could you find any relation between and tea
and coffee price.

Pric 8 9 9 7 6 7 5
e of 8 0 5 0 0 5 0

9
Basic Statistics

tea
Pric 1 1 1 1 1 1 1
e of 2 3 5 1 1 4 0
coff 0 4 0 5 0 0 0
ee

Price Rank Price Rank D D2


of of
tea coffee
88 3 120 4 1 1
90 2 134 3 1 1
95 1 150 1 0 0
70 5 115 5 0 0
60 6 110 6 0 0
75 4 140 2 2 4
50 7 100 7 0 0

∑D2
=6

6∑D2 6×6 36
ρ = 1− 3 = 1− 3 =1− = 1 − 0.1071
N −N 7 −7 336
= 0.8929
The relation between price of tea and coffee is positive at 0.89.
Based on quality the association between price of tea and price
of coffee is highly positive.

Example 3:
In an evaluation of answer script the following marks are awarded by
the examiners.

1 8 9 7 9 5 8 7 8

10
Basic Statistics

st 8 5 0 6 0 0 5 5
0
2 8 9 8 5 4 8 8 7
n 4 0 8 5 8 5 2 2
d

Do you agree the evaluation by the two examiners is fair?


Solution:

x R1 y R2 D D2
88 2 84 4 2 4
95 1 90 1 0 0
70 6 88 2 4 16
60 7 55 7 0 0
50 8 48 8 0 0
75 5 82 5 0 0
80 4 85 3 1 1
85 3 75 6 3 9
30

6∑D2 6 × 30 180
ρ = 1− 3 = 1− 3 =1− = 1 − 0.357 = 0.643
N −N 8 −8 504
ρ = 0.643 shows fair in awarding marks in the sense that uniformity has
arisen in evaluating the answer scripts between the two examiners.
Example 4:
Rank Correlation for tied observations. Following are the marks
obtained by 10 students in a class in two tests.
St A B C D E F G H I J
ud
en
ts

11
Basic Statistics

Te 7 6 6 5 6 6 7 6 6 7
st 0 8 7 5 0 0 5 3 0 2
1
Te 6 6 8 6 6 5 7 6 6 7
st 5 5 0 0 8 8 5 3 0 0
2

Calculate the rank correlation coefficient between the marks of two tests.

Solution:

Studen Tes R Tes


R2 D D2
t t1 1 t2
-
A 70 3 65 5.5 2. 6.25
5
-
B 68 4 65 5.5 1. 2.25
5
4. 16.0
C 67 5 80 1.0
0 0
1.
D 55 10 60 8.5 2.25
5
4. 16.0
E 60 8 68 4.0
0 0
-
10.
F 60 8 58 2. 4.00
0
0
-
G 75 1 75 2.0 1. 1.00
0

12
Basic Statistics

-
H 63 6 62 7.0 1. 1.00
0
0.
I 60 8 60 8.5 0.25
5
-
J 72 2 70 3.0 1. 1.00
0
∑D2
=
50.0
0

60 is repeated 3 times in test 1. 60, 65 is repeated twice in test 2. m = 3;


m = 2; m = 2.
6[∑D2 + (m3 − m) + (m3 − m) + (m3 − m)]
1 1 1
12 12 12
ρ = 1− 3
N −N
6[50 + (33 − 3) + (23 − 2) + (23 − 2)]
1 1 1
12 12 12
ρ = 1− 3
10 − 10
6[50 + 2 + 0.5 + 0.5] 6 × 53
ρ = 1− = 1− = 0.68
990 990
Interpretation: There is uniformity in the performance of students in the
two tests.

9.10 Regression
Regression is the functional relationship between two variables and of the
two variables one may represent cause and the other may represent effect.
The variable representing cause is known as independent variable and is
denoted by X. The variable X is also known as predictor variable or
repressor. The variable representing effect is known as dependent variable
and is denoted by Y. Y is also known as predicted variable. The relationship
13
Basic Statistics

between the dependent and the independent variable may be expressed


as a function and such functional relationship is termed as regression.
When there are only two variables the functional relationship is known as
simple regression and if the relation between the two variables is a straight
line I is known a simple linear regression. When there are more than two
variables and one of the variables is dependent upon others, the functional
relationship is known as multiple regression. The regression line is of the
form y=a+bx where a is a constant or intercept and b is the regression
coefficient or the slope. The values of ‘a’ and ‘b’ can be calculated by using
the method of least squares. An alternate method of calculating the values
of a and b are by using the formula:
The regression equation of y on x is given by 𝑦 = 𝑎 + 𝑏𝑥
The regression coefficient of y on x is given by
∑𝑥∑𝑦
∑𝑥𝑦 −
𝑛
𝑏= (∑𝑥)2
𝑥2 −
𝑛

𝑎𝑛𝑑 𝑎 = 𝑦̅ − 𝑏𝑥̅
The regression line indicates the average value of the dependent variable
Y associated with a particular value of independent variable X.
9.11 Assumptions
1. The x’s are non-random or fixed constants
2. At each fixed value of X the corresponding values of Y have a
normal distribution about a mean.
3. For any given x, the variance of Y is same.
4. The values of y observed at different levels of x are completely
independent.
9.12 Properties of Regression coefficients
1. The correlation coefficient is the geometric mean of the two
regression coefficients
14
Basic Statistics

2. Regression coefficients are independent of change of origin but not


of scale.
3. If one regression coefficient is greater than unit, then the other
must be less than unit but not vice versa. ie. both the regression
coefficients can be less than unity but both cannot be greater than
unity, ie. if b1>1 then b2<1 and if b2>1, then b1<1.
4. Also if one regression coefficient is positive the other must be
positive (in this case the correlation coefficient is the positive
square root of the product of the two regression coefficients) and if
one regression coefficient is negative the other must be negative (in
this case the correlation coefficient is the negative square root of
the product of the two regression coefficients). ie.if b1>0, then b2>0
and if b1<0, then b2<0.
5. If θ is the angle between the two regression lines then it is given by
(1 − 𝑟 2 )𝜎𝑥 𝜎𝑦
tan 𝜃 =
𝑟(𝜎𝑥2 + 𝜎𝑦2 )

15
Basic Statistics

Lesson 10
ANOVA and CRD

Content
0
Basic Statistics

Course Name Basic Statistics

Lesson 10 ANOVA and CRD


Content Creator Name Dr. Vinay Kumar
Chaudhary Charan Singh Haryana
University/College Name
Agricultural University,Hisar
Course Reviewer Name Dr Dhaneshkumar V Patel
Unagadh Agricultural
University/college Name
University,Junagadh

1
Basic Statistics

Lesson-10
Objectives of the lesson:
1. Assumption of the ANOVA
2. One way Classification of ANOVA
3. Two way Classification of ANOVA
Glossary of the lesson: Test-Statistic, Variance, ANOVA, Source of variation
etc.
10.1 Introduction:
In hypothesis testing, we test the significance of difference between two
sample means. For this, one test statistic employed was the t-test where
we assumed that the two populations from which the samples were drawn
had the same variance. But in real life, there may be situations when
instead of comparing two sample means, a researcher has to compare
three or more than three sample means (specifically, more than two). A
researcher may have to test whether the three or more sample means
computed from the three populations are equal. In other words, the null
hypothesis can be that three or more population means are equal as
against the alternative hypothesis that these population means are not
equal. For example, suppose that a researcher wants to measure work
attitude of the employees in four organizations. The researcher has
prepared a questionnaire consisting of 10 questions for measuring the
work attitude of employees. A five-point rating scale is used with 1 being
the lowest score and 5 being the highest score. So, an employee can score
10 as the minimum score and 50 as the maximum score. The null
hypothesis can be set as all the means are equal (i.e., there is no difference
in the degree of work attitude of the employees) as against the alternative
hypothesis that at least one of the means is different from the others
(there is a significant difference in the degree of work attitude of the
employees). In this situation, analysis of variance technique is used. “In its
simplest form analysis of variance may be regarded as an extension or
development of the t  test .” Analysis of variance technique makes use of

2
Basic Statistics

F distribution (F-statistic). Some more examples are presented below


where we are required to test the equality of means of three or more
populations. For example, whether:
(1) The average life of light bulbs being produced in three different
plants is the same.
(2) All the three varieties of fertilizers have the same impact on the yield
of rice.
(3) The level of satisfaction among the participants in all IIMs is the
same.
(4) The impact of training on salesmen trained in three institutes is the
same.
(5) The service time of a transaction is the same on four different
counters in a service unit.
(6) The average price of different commodities in four different retail
outlets is the same.
(7) Performance of salesmen in four zones is the same.
The word ‘analysis of variance’ is used since the technique involves
first finding out the total variation among the observations in the collected
data, then assigning causes or components of variation to various factors
and finally drawing conclusions about the equality of means. Thus Analysis
of variance or ANOVA can be defined as a technique of testing hypotheses
about the significant difference in several population means. This
statistical technique was developed by R.A.Fisher. The main purpose of
analysis of variance is to detect the difference among various population
means based on the information gathered from the samples (sample
means) of the respective populations.

3
Basic Statistics

10.2 Assumptions of Analysis of Variance


Analysis of variance is based on some assumptions.
(i) Each sample is a simple random sample.
(ii) Populations from which the samples are selected are normally
distributed. However, in case of large samples this assumption is
not required.
(iii) Samples are independent.
(iv) The population variances are identical, i.e.,𝜎12 = 𝜎22 = ⋯ = 𝜎𝑛2
10.3 Computation of Test Statistic:
The beauty of the technique of analysis of variance is that it performs the
test of equality of more than two population means by actually analyzing
the variance. In simple terms, ANOVA decomposes the total variation into
two components of variation, namely, variation between the samples
known as the mean square between samples and variation within the
samples known as the mean square within samples. The variance ratio
denoted by F is given by:
𝑀𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑜𝑟 𝑔𝑟𝑜𝑢𝑝𝑠
𝐹=
𝑀𝑒𝑎𝑛 𝑠𝑞𝑢𝑎𝑟𝑒 𝑤𝑖𝑡ℎ𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑜𝑟 𝑔𝑟𝑜𝑢𝑝𝑠
If the calculated value of F is greater than the critical value of F , we must
reject the null hypothesis. In case the calculated value of F is less than the
critical value of F , we are to retain or accept the null hypothesis.
10.3.1 Analysis of variance table (ANOVA):
The table showing the sources of variation, the sum of squares, degrees of
freedom, mean squares and the formula for the F statistic is known as
ANOVA table.

4
Basic Statistics

10.4 Classification of Analysis of Variance


ANOVA is mainly carried on under the following two classifications:
(i) One-way classification
(ii) Two- way classification
Variance and its different components may be obtained in each of the two
types of classification by:
(a) Direct Method, (b) Short-cut Method

10.4.1 One-Way Classification


Many business applications involve experiments in which different
populations (or groups) are classified with respect to only one attribute of
interest such as (i) percentage of marks secured by students in a course,
(ii) flavor preference of ice-cream by customers, (iii) yield of crop due to
varieties of seeds, and so on. In all such cases observations in the sample
data are classified into several groups based on a single attribute (i.e.,
criterion) and are termed one-way classification of sample data.
Under this one-way classification we set up the null hypothesis 𝐻0 : 𝜇1 =
𝜇2 = ⋯ = 𝜇𝑘 where 𝜇1 , 𝜇2 , … , 𝜇𝑘 are the arithmetic means of the
populations from which the k samples are drawn.
The alternative hypothesis is 𝐻0 : 𝜇1 ≠ 𝜇2 ≠ ⋯ ≠ 𝜇𝑘
After formulation of the null hypothesis and alternative hypothesis one
needs to calculate the following by using any one of the above mentioned
methods.
(i)Variance between the samples.
(ii)Variance within the samples.
(iii)Total variance by summing (i) and (ii) which may also be calculated
directly for verification of calculations.
(iv)F-ratio (or F-statistic)
Thus in this process, the total variance can be divided into two additive and
independent parts as shown:
Variance between
Samples
5
Total variance

Variance within
Basic Statistics

After calculating the test statistic F, it should be compared with the critical
value of F at a specified level of significance  for k  1, n  k  degrees of
freedom and on the basis of this comparison; accordingly the decision to
accept or reject the null hypothesis is taken.
(a) Direct Method
I. Calculation of variance between samples
It is the sum of squares of the deviations of the means of various samples
from the grand mean. The procedure of calculating the variance between
the samples is as shown below:

Observations Number of samples

1 2 ……j ………… k

1 𝑥11 𝑥12 … 𝑥1𝑗 …………. 𝑥1𝑘


2 𝑥21 𝑥22 … 𝑥2𝑗 …………. 𝑥2𝑘
.
. . . . . .
i 𝑥𝑖1 𝑥𝑖2 … 𝑥𝑖𝑗 …………. 𝑥𝑖𝑘
. .
.
N 𝑥𝑛1 𝑥𝑛2 … 𝑥𝑛𝑗 …………. 𝑥𝑛𝑘
Total 𝑇1 𝑇2 … 𝑇𝑗 …………. 𝑇𝑘
A.M. 𝑥̅1 𝑥̅ 2 … 𝑥̅𝑗 …………. 𝑥̅ 𝑘

1 1
𝑇𝑗 = ∑𝑛𝑖=1 𝑥𝑖𝑗 , 𝑇 = ∑𝑘𝑗=1 𝑇𝑗 , 𝑥̅𝑗 = 𝑛 ∑𝑘𝑖=1 𝑇𝑗 𝑎𝑛𝑑 𝑥̿ = 𝑛𝑘 ∑𝑘𝑗=1 𝑥̅𝑗

6
Basic Statistics

The steps in calculating variance between samples are given by:


(i) In the first step, we need to calculate the mean of each sample
i.e., 𝑥̅1 , 𝑥̅ 2 , … , 𝑥̅ 𝑘 f all the k samples.
(ii) Next, the grand mean is calculated by using the formula
𝑥̅1 + 𝑥̅ 2 + ⋯ + 𝑥̅ 𝑘 𝑇
𝑥̿ = =
𝑘 𝐾
where T =Grand total of all observations.
N =Total number of observations in all k samples.
(iii) In step 3, the difference between the mean of each sample and
grand mean is calculated, that is we calculate 𝑥̅1 − 𝑥̿ , 𝑥̅ 2 −
𝑥̿ , … , 𝑥̅ 𝑘 − 𝑥̿
(iii) In step 4, we square the deviations obtained in step (iii) and
multiply by the number of items in the corresponding sample and
then add the total. This total gives the sum of the squares of the
deviations between the samples (or between the columns) and it
is denoted by SSB or SSC
𝑘
2
𝑇ℎ𝑢𝑠 𝑆𝑆𝐵 = ∑ 𝑛𝑗 (𝑥̅𝑗 − 𝑥̿ )
𝑗=1
where k is the number of groups or samples being compared, nj

the number of observations in group j , 𝑥̅𝑗 the sample mean of


group j , and 𝑥̿ the grand mean.
(iv) In the last step, the total obtained in step (iv) is divided by the
degrees of freedom. The degrees of freedom is one less than the
total number of samples. If there are k samples, the degrees of
freedom will be 𝑣 = 𝑘 − 1 When the sum of squares obtained in
step (iv) is divided by the number of degrees of freedom, the
result is called mean square which is denoted by MSB or MSC.
MSB indicates the degree of explained variance due to sampling
fluctuations.
𝑆𝑆𝐵
𝑇ℎ𝑢𝑠 𝑀𝑆𝐵 =
𝑘−1
7
Basic Statistics

II. Calculation of Variance within the samples:


This is usually referred to as the sum of squares within samples. The
variance within samples measures the difference within the samples due
to chance. This is usually denoted by SSE. The steps involved are the
following:
(i) The first step is to calculate the mean values 𝑥̅1 , 𝑥̅ 2 , … , 𝑥̅ 𝑘 of all k
samples,
(ii) Calculate the deviations of the various observations of k samples
from the mean values of the respective samples,
(iii) Square all the deviations obtained in (ii) and find the total of these
squared deviations. This total gives the sum of the squares of
deviations within the samples or the sum of squares due to error.
It is denoted by SSW or SSE.
(v) 𝑇ℎ𝑢𝑠 𝑆𝑆𝑊 𝑜𝑟 𝑆𝑆𝐸 = ∑𝑛𝑖=1 ∑𝑘𝑗=1(𝑥𝑖𝑗 − 𝑥̅𝑗 )
where xij is the ith observation in the jth group ,𝑥̅𝑗 the sample mean
of jth group, k the number of groups being compared, and N the
total number of observations in all the groups.
(iv) As the last step, divide the total squared deviations obtained in
step (iii) by the degrees of freedom and obtain the mean square.
The number of degrees of freedom can be calculated as the
difference between the total number of observations and the
number of samples. If there are N observations and k samples
then the degrees of freedom is   N  k .
𝑆𝑆𝐸
𝑇ℎ𝑢𝑠 𝑀𝑆𝑊 =
𝑁−𝑘
III. Calculate total sum of squares: The total variation is equal to the sum
of the squared difference between each observation (sample value) and
the grand mean x and is often referred to as SST.
Thus SST can be calculated as:
𝑆𝑆𝑇 = 𝑆𝑆𝐵 + 𝑆𝑆𝐸
III. Calculation of the Test Statistic F:

8
Basic Statistics

When the null hypothesis is true, both mean squares MSB (MSC) and MSW
(MSE) are the independent unbiased estimates of the same population
variance  2 . Hence, the test statistic is
𝑀𝑆𝐵 𝑀𝑆𝐶
𝐹= 𝑜𝑟 𝐹 =
𝑀𝑆𝑊 𝑀𝑆𝐸
which follows F-distribution with degrees of freedom ( k  1, N  k )
F is the ratio between the greater variance to the smaller variance.
Generally, variance between the samples (MSB or MSC) is greater than the
variance within the samples (MSW or MSE). But if MSW>MSB then the
reverse ratio of F will be used i.e.,
𝑀𝑆𝑊 𝑀𝑆𝐸
𝐹= 𝑜𝑟, 𝐹 =
𝑀𝑆𝐵 𝑀𝑆𝐶
IV. Conclusion:
To compare the calculated value of F with the critical value (tabulated) of
F for ( k  1, N  k ) degrees of freedom at the specified level of significance,
usually 5% or 1% level. If the calculated value is greater than the tabulated
value, the null hypothesis H 0 is rejected and conclude that all population
means are not equal. Otherwise, the null hypothesis is accepted.
For analyzing variance in case of one way classification the following table
known as the Analysis of Variance Table (or ANOVA Table) is constructed.

ANOVA Table
Sources of Sum of Degrees of Mean Test
Variation squares (SS) Freedom (df) Squares(MS) Statistic
Between samples SSB 𝑘−1 𝑆𝑆𝐵 𝑀𝑆𝐵
𝑀𝑆𝐵 = 𝐹=
𝑘−1 𝑀𝑆𝑊
Within samples SSW 𝑁−𝑘 𝑆𝑆𝑊
𝑀𝑆𝑊 =
𝑁−𝑘
Total SST N 1

(b) Short-cut Method: The calculation of F-statistic (variance ratio) by using


the direct Method is very time consuming. In practice, a short-cut Method
based on sum of squares of the individual values (observations) is usually
9
Basic Statistics

used. The computational work is much minimized in this method. The


method is more convenient when some or all the sample means and the
grand mean are fractional. The various steps involved in the calculations of
variance ratio are the following:
(i) Calculate the grand total of all observations in samples, T
𝑇 = ∑𝑥1 + ∑𝑥2 + ⋯ + ∑𝑥𝑘
𝑇2
(ii) Calculate the correction factor 𝐶𝐹 = ; N= Total observations in
𝑁
samples.
(iv) Find the sum of the squares of all observations in samples from
each of k samples and subtract CF from this sum to obtain the
total sum of the squares of deviations SST:
𝑆𝑆𝑇 = (∑𝑥12 + ∑𝑥22 + … + ∑𝑥𝑘2 − 𝐶𝐹
2
(∑𝑥𝑗 )
𝑆𝑆𝐵 = − 𝐶𝐹, where n j is the size of the jth sample.
𝑛𝑗

And 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝐵


(v) MSB, MSW and F are obtained as in the Direct method and
decision either to accept or to reject is exactly same as taken in
case of Direct method.
Example 1: The following data give the yield on 12 plots of land in three
samples of 4 plots each, under three varieties of fertilizers A, B and C
A B C
25 20 24
22 17 26
24 16 30
21 16 20

Test whether there is any significant difference in the average yields of


land under three varieties of fertilizers.
Solution: Here first we set up null and alternative hypotheses
Null hypothesis H 0 : There is no significant difference in the average yields
under the three varieties.

10
Basic Statistics

Alternative hypothesis H 1 : There is significant difference in the average


yield under the three varieties.
We calculate sample means, the variance between the samples and the
variance within the samples by using the Direct Method.

Sample Sample Sample


I II III
X1 X2 X3
25 20 24
22 17 26
24 16 30
21 16 20
Total 92 72 100

∑𝑋𝑖 92 ∑𝑋2 72 ∑𝑋3 100


𝑥̅1 = = = 23, 𝑋̅2 = = = 18, 𝑥̅ 3 = = = 25
𝑛1 4 𝑛2 4 𝑛3 4
𝑥̅1 + 𝑥̅ 2 + 𝑥̅ 3 23 + 18 + 25
𝐺𝑟𝑎𝑛𝑑 𝑚𝑒𝑎𝑛 (𝑋̿ ) = = = 22
3 3
2 2 2
𝑆𝑆𝐵 = 𝑛1 (𝑋1 − 𝑋̿ ) + 𝑛2 (𝑋2 − 𝑋̿ ) + 𝑛3 (𝑋3 − 𝑋̿)
= 4(23 − 22)2 + 4(8 − 22)2 + 4(25 − 22)2
= 104
Degrees of freedom, 𝜈1 = 𝑘 − 1 = 2
Now MSW=Mean square between the samples
𝑆𝑆𝐵 104
= = = 52
𝜈1 12
Calculation for SSW
Sample I Sample II Sample III
𝑋1 (𝑋1 − 𝑋̅1 )2 𝑋2 (𝑋2 − 𝑋̅2 )2 𝑋3 (𝑋3 − 𝑋̅3 )2

25 4 20 4 24 1
22 1 17 1 26 1
24 1 16 4 30 25

11
Basic Statistics

21 4 19 1 20 25
10 10 52

SSW = Sum of squares within the samples


= ∑(𝑋1 − 𝑋̅1 )2 + ∑(𝑋2 − 𝑋̅2 )2 + ∑(𝑋3 − 𝑋̅3 )2
= 10 + 10 + 52 = 72
The degrees of freedom,𝜈2 = 𝑁 − 𝑘 = 9
72
𝑀𝑆𝑊 = =8
9
Now we prepare the Analysis of Variance Table
Analysis of Variance Table (ANOVA)
Sources of Sum of Degrees Mean Test Statistic
Variation squares of Squares(MS
(SS) Freedom )
(df)
Between samples 104 2 52 F=52/8=6.5
Within samples 72 9 8
Total SST=176 N-1=11

The critical value of F for  1  2 and  2  9 at 5% level i.e., F0.05 (2,9) is


4.26.
Since the calculated value of F (i.e., 6.5) is greater than the critical
value F0.05 (2,9) , herefore we reject the null hypothesis at 5% level and
conclude that the difference in the average yields under the three
varieties is significant.
Solution of the above problem by Short cut Method:
After formulating the null and alternative hypotheses as in the Direct
Method one can proceed with the Short cut Method as follows:
Calculation of T, SST, SSB, SSW

Sample Sample I Sample I

12
Basic Statistics

X1 X 12 X2 X 22 X3 X 32
25 625 20 400 24 576
22 484 17 289 26 676
24 576 16 256 30 900
21 441 19 361 20 400
92 2126 72 1306 100 2552

T = Sum of all the values in the three samples


= ∑𝑋1 + ∑𝑋2 + ∑𝑋3 = 92 + 72 + 100 = 264
𝑇 2 (264)2
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑖𝑜𝑛 𝑓𝑎𝑐𝑡𝑜𝑟 = = = 5808
𝑁 12
SST = Total sum of squares
2 2 2
𝑇2
= (∑𝑋1 + ∑𝑋2 + ∑𝑋3 ) −
𝑁
(2126 + 1306 + 2552) − 5808 = 176
SSB = Sum of squares between the samples
∑𝑋12 ∑𝑋22 ∑𝑋32 𝑇 2
+ + −
𝑛1 𝑛2 𝑛3 𝑁
= 2116 + 1296 + 2500 − 5808
= 104
𝐷𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 = 𝑘 − 1 = 2
𝑆𝑆𝑊 = 𝑆𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑤𝑖𝑡ℎ𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠.
= 𝑆𝑆𝑇 − 𝑆𝑆𝐵 = 176 − 104 = 72
𝐷𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 = 𝑁 – 𝑘 = 12 − 3 = 9
Now proceeding with the ANOVA table as in the Direct Method, one can
arrive at the same conclusion

10.5 Design of Experiments


Choice of treatments, method of assigning treatments to experimental
units and arrangement of experimental units in different patterns are
known as designing an experiment. We study the effect of changes in one
variable on another variable. For example how the application of various
13
Basic Statistics

doses of fertilizer affects the grain yield. Variable whose change we wish
to study is known as response variable. Variable whose effect on the
response variable we wish to study is known as factor.
Treatment: Objects of comparison in an experiment are defined as
treatments. Examples are Varieties tried in a trail and different chemicals.
Experimental unit: The object to which treatments are applied or basic
objects on which the experiment is conducted is known as experimental
unit.
Example: piece of land, an animal, etc
Experimental error: Response from all experimental units receiving the
same treatment may not be same even under similar conditions. These
variations in responses may be due to various reasons. Other factors like
heterogeneity of soil, climatic factors and genetic differences, etc also may
cause variations (known as extraneous factors). The variations in response
caused by extraneous factors are known as experimental error.
Our aim of designing an experiment will be to minimize the experimental
error.

10.6 Basic principles


To reduce the experimental error we adopt certain principles known as
basic principles of experimental design. The basic principles are
1) Replication,
2) Randomization and
3) Local control
10.6.1 Replication
Repeated application of the treatments is known as replication. When the
treatment is applied only once we have no means of knowing about the
variation in the results of a treatment. Only when we repeat several times
we can estimate the experimental error.

14
Basic Statistics

With the help of experimental error we can determine whether the


obtained differences between treatment means are real or not. When the
number of replications is increased, experimental error reduces.
10.6.2 Randomization
When all the treatments have equal chance of being allocated to different
experimental units it is known as randomization.
If our conclusions are to be valid, treatment means and differences among
treatment means should be estimated without any bias. For this purpose
we use the technique of randomization.
10.6.3 Local Control
Experimental error is based on the variations from experimental unit to
experimental unit. This suggests that if we group the homogenous
experimental units into blocks, the experimental error will be reduced
considerably. Grouping of homogenous experimental units into blocks is
known as local control of error.
In order to have valid estimate of experimental error the principles of
replication and randomization are used.
In order to reduce the experimental error, the principles of replication and
local control are used.
In general to have precise, valid and accurate result we adopt the basic
principles.
10.7 Completely Randomized Design (CRD)
CRD is the basic single factor design. In this design the treatments are
assigned completely at random so that each experimental unit has the
same chance of receiving any one treatment. But CRD is appropriate only
when the experimental material is homogeneous. As there is generally
large variation among experimental plots due to many factors CRD is not
preferred in field experiments.
In laboratory experiments and greenhouse studies it is easy to achieve
homogeneity of experimental materials and therefore CRD is most useful
in such experiments.

15
Basic Statistics

10.7.1 Layout of a CRD


Completely randomized Design is the one in which all the experimental
units are taken in a single group which are homogeneous as far as possible.
The randomization procedure for allotting the treatments to various units
will be as follows.
Step 1: Determine the total number of experimental units.
Step 2: Assign a plot number to each of the experimental units starting
from left to right for all rows.
Step 3: Assign the treatments to the experimental units by using random
numbers.
The statistical model for CRD with one observation per unit
𝑌𝑖𝑗 = 𝜇 + 𝑡𝑖 + 𝑒𝑖𝑗
μ = overall mean effect
ti = true effect of the ith treatment
eij = error term of the jth unit receiving ith treatment
The arrangement of data in CRD is as follows:
Treatments
T1 T2 Ti TK
y11 y21 yi1 YK1
y12 y22 yi2 YK2
y1r1 y2r2 yiri Yk
Total Y1 Y2 Yi Tk GT

(GT – Grand total)


The null hypothesis will be
𝐻𝑜 ∶ 𝜇1 = 𝜇2 = ⋯ = 𝜇𝑘 or There is no significant difference between the
treatments
And the alternative hypothesis is
𝐻_1: 𝜇1 ≠ 𝜇2 ≠ … … … … . ≠ 𝜇𝑘 . There is significant difference between
the treatments
The different steps in forming the analysis of variance table for a CRD are:

16
Basic Statistics

(𝐺𝑇)2
𝐶. 𝐹. = , 𝑤ℎ𝑒𝑟𝑒 𝑛 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑛
𝑘 𝜈

𝑇𝑜𝑡𝑎𝑙 𝑆𝑆 = 𝑇𝑆𝑆 = ∑ ∑ 𝑦𝑖𝑗2 − 𝐶. 𝐹.


𝑖=1 𝑗=1
𝑌12 𝑌22 𝑌𝑘2
𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑆𝑠 = 𝑇𝑟𝑆𝑆 = + + …+ − 𝐶. 𝐹.
𝑟1 𝑟2 𝑟𝑘
𝑘
𝑌𝑖2
= ∑ − 𝐶. 𝐹.
𝑟𝑖
𝑖=1
𝑘 𝜈 𝑘
𝑌𝑖2
𝐸𝑟𝑟𝑜𝑟 𝑆𝑆 = 𝐸𝑆𝑆 = ∑ ∑ 𝑦𝑖𝑗2 −∑
𝑟𝑖
𝑖=1 𝑗=1 𝑖=1
= 𝑇𝑆𝑆 − 𝑇𝑟𝑆𝑆
Form the following ANOVA table and calculate F value
Source of d.f. SS MS F
variation
Treatment t-1 TrSS 𝑇𝑟𝑀𝑆 𝑇𝑟𝑀𝑆
𝑇𝑟𝑆𝑆 𝐸𝑀𝑆
=
𝑡−1
Error n-t ESS 𝐸𝑀𝑆
𝐸𝑆𝑆
=
𝑛−𝑡
Total n-1
6. Compare the calculated F with the critical value of F corresponding to
treatment degrees of freedom and error degrees of freedom so that
acceptance or rejection of the null hypothesis can be determined.
7. If null hypothesis is rejected that indicates there is significant differences
between the different treatments.
8. Calculate C D value.
1 1
𝐶. 𝐷. = 𝑆𝐸(𝑑 ) × 𝑡, 𝑤ℎ𝑒𝑟𝑒 𝑆. 𝐸. (𝑑 ) = √𝐸𝑀𝑆 ( + )
𝑟𝑖 𝑟𝑗

17
Basic Statistics

ri = number of replications for treatment i


rj = number of replications for treatment j and
t is the critical t value for error degrees of freedom at specified level of
significance, either 5% or 1%.

10.7.2 Advantages of a CRD


1. Its layout is very easy.
2. There is complete flexibility in this design i.e. any number of
treatments and replications for each treatment can be tried.
3. Whole experimental material can be utilized in this design.
4. This design yields maximum degrees of freedom for experimental
error.
5. The analysis of data is simplest as compared to any other design.
6. Even if some values are missing the analysis can be done.

10.7.3 Disadvantages of a CRD


1. It is difficult to find homogeneous experimental units in all respects
and hence CRD is seldom suitable for field experiments as compared
to other experimental designs.
2. It is less accurate than other designs.

18
Basic Statistics

Lesson 11
RBD and LSD

Content
0
Basic Statistics

Course Name Basic Statistics

Lesson 11 RBD and LSD


Content Creator Name Dr. Vinay Kumar
Chaudhary Charan Singh Haryana
University/College Name
Agricultural University,Hisar
Course Reviewer Name Dr Dhaneshkumar V Patel
Unagadh Agricultural
University/college Name
University,Junagadh

1
Basic Statistics

Lesson-11
Objectives of the lesson:
1. Layout of RBD
2. Analysis, Merits, Demerits of RBD
3. Layout of LSD
4. Analysis, Merits, Demerits of LSD
Glossary of the lesson: RBD, LSD, Variance, ANOVA, Source of variation etc.

11.1 Randomized Blocks Design (RBD)


When the experimental material is heterogeneous, the experimental
material is grouped into homogenous sub-groups called blocks. As each
block consists of the entire set of treatments a block is equivalent to a
replication.
If the fertility gradient runs in one direction say from north to south or east
to west then the blocks are formed in the opposite direction. Such an
arrangement of grouping the heterogeneous units into homogenous
blocks is known as randomized blocks design. Each block consists of as
many experimental units as the number of treatments. The treatments are
allocated randomly to the experimental units within each block
independently such that each treatment occurs once. The number of
blocks is chosen to be equal to the number of replications for the
treatments.
The analysis of variance model for RBD is
𝑌𝑖𝑗 = 𝜇 + 𝑡𝑖 + 𝑟𝑗 + 𝑒𝑖𝑗
where
𝜇 = 𝑡ℎ𝑒 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑚𝑒𝑎𝑛
𝑡𝑖 = 𝑡ℎ𝑒 𝑖𝑡ℎ 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑒𝑓𝑓𝑒𝑐𝑡
𝑟𝑗 = 𝑡ℎ𝑒 𝑗𝑡ℎ 𝑟𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑒𝑓𝑓𝑒𝑐𝑡
𝑒𝑖𝑗 = 𝑡ℎ𝑒 𝑒𝑟𝑟𝑜𝑟 𝑡𝑒𝑟𝑚 𝑓𝑜𝑟 𝑖 𝑡ℎ 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑎𝑛𝑑 𝑗 𝑡ℎ 𝑟𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛

2
Basic Statistics

11.2 Analysis of RBD


The results of RBD can be arranged in a two way table according to the
replications (blocks) and treatments. There will be r x t observations in total
where r stands for number of replications and t for number of treatments.
The data are arranged in a two way table form by representing treatments
in rows and replications in columns.

Replication
Treatment Total
1 2 3 ………… r
1 y11 y12 y13 ………… y1r T1
2 y21 y22 y23 ………… y2r T2
3 y31 y32 y33 ………… y3r T3
t yt1 yt2 yt3 …………. ytr Tt
Total R1 R2 R3 Rr G.T

In this design the total variance is divided into three sources of variation
viz., between replications, between treatments and error
(𝐺𝑇)2
𝐶𝐹 =
𝑛
𝑇𝑜𝑡𝑎𝑙 𝑆𝑆 = 𝑇𝑆𝑆 = 𝛴𝛴𝑦𝑖𝑗2 – 𝐶𝐹
𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑆𝑆 = 𝑅𝑆𝑆 = = 𝛴𝑅𝑗2 – 𝐶𝐹
𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 𝑆𝑆 = 𝑇𝑟𝑆𝑆 = 𝛴𝑇𝑖2 − 𝐶𝐹
𝐸𝑟𝑟𝑜𝑟 𝑆𝑆 = 𝐸𝑆𝑆 = 𝑇𝑜𝑡𝑎𝑙 𝑆𝑆 – 𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑆𝑆 – 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑆𝑆
The skeleton ANOVA table for RBD with t treatments and r replications
Sources of variation d.f. SS MS F- Value
Replication r-1 RSS RMS 𝑅𝑀𝑆
𝐸𝑀𝑆
Treatment t-1 TrSS TrMS 𝑇𝑟𝑀𝑆
𝐸𝑀𝑆
Error (r-1) (t-1) ESS EMS
Total rt-1 TSS

3
Basic Statistics

2𝐸𝑀𝑆
𝐶𝐷 = 𝑆𝐸 (𝑑 ). 𝑡 𝑤ℎ𝑒𝑟𝑒 𝑆. 𝐸 (𝑑 ) = √
𝑟

t = critical value of t for a specified level of significance and error degrees


of freedom
Based on the CD value various treatment means can be compared
11.2.1 Advantages of RBD
The precision is more in RBD. The amount of information obtained in RBD
is more as compared to CRD. RBD is more flexible. Statistical analysis is
simple and easy. Even if some values are missing, still the analysis can be
done by using missing plot technique.

11.2.2 Disadvantages of RBD


When the number of treatments is increased, the block size will increase.
If the block size is large maintaining homogeneity is difficult and hence
when more number of treatments is present this design may not be
suitable.
11.3 Latin Square Design
When the experimental material is divided into rows and columns and
the treatments are allocated such that each treatment occurs only once
in each row and each column, the design is known as L S D.
In LSD the treatments are usually denoted by A B C D etc.
For a 5 x 5 LSD the arrangements may be
A B C D E A B C D E A B C D E
B A E C D B A D E C B C D E A
C D A E B C E A B D C D E A B
D E B A C D C E A B D E A B C
E C D B A E D B C A E A B C D
Square 1 Square 2 Square 3

4
Basic Statistics

11.3.1 Statistical Analysis:


The ANOVA model for LSD is
𝑌𝑖𝑗𝑘 = 𝜇 + 𝑟𝑖 + 𝑐𝑗 + 𝑡𝑘 + 𝑒𝑖𝑗𝑘

ri is the ith row effect


cj is the jth column effect

tk is the kth treatment effect and


eijk is the error term

The analysis of variance table for LSD is as follows:

Sources of d.f. SS MS F
Variation
𝑅𝑀𝑆
Rows t-1 RSS RMS
𝐸𝑀𝑆
𝐶𝑀𝑆
Columns t-1 CSS CMS
𝐸𝑀𝑆
𝑇𝑟𝑀𝑆
Treatments t-1 TrSS TrMS
𝐸𝑀𝑆
Error (t-1)(t-2) ESS EMS
Total t2-1 TSS

F table value
𝐹[(𝑡−1),(𝑡−1)(𝑡−2)] degrees of freedom at 5% or 1% level of significance

Steps to calculate the above Sum of Squares are as follows:


(𝐺𝑇)2
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑖𝑜𝑛 𝐹𝑎𝑐𝑡𝑜𝑟 (𝐶𝐹) =
(𝑡 )2
2
𝑇𝑜𝑡𝑎𝑙 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 (𝑇𝑆𝑆) = ∑(𝑦𝑖𝑗𝑘 ) − 𝐶𝐹

5
Basic Statistics

𝑡
1
𝑅𝑜𝑤 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 (𝑅𝑆𝑆) = ∑(𝑅𝑖 )2 − 𝐶𝐹
𝑡
𝑖=1
𝑡
1 2
𝐶𝑜𝑙𝑢𝑚𝑛 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠(𝐶𝑆𝑆) = ∑(𝐶𝑗 ) − 𝐶𝐹
𝑡
𝑗=1

𝑡
1
𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 (𝑇𝑟𝑆𝑆 = ∑(𝑇𝑘 )2 − 𝐶𝐹
𝑡
𝑘=1

𝐸𝑟𝑟𝑜𝑟 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 = 𝑇𝑆𝑆 − 𝑅𝑆𝑆 − 𝐶𝑆𝑆 − 𝑇𝑟𝑆𝑆


These results can be summarized in the form of analysis of variance table.
Calculation of SE, SE (d) and CD values

√𝐸𝑀𝑆
𝑆𝐸 =
𝑟
where r is the number of rows

𝑆𝐸(𝑑 ) = √2 × 𝑆𝐸
𝐶𝐷 = 𝑆𝐸 (𝑑 ) × 𝑡
where t = table value of t for a specified level of significance and error
degrees of
freedom
Using CD value the bar chart can be drawn and the conclusion may be
written.
11.3.2 Advantages
• LSD is more efficient than RBD or CRD. This is because of double
grouping that will result in small experimental error.
• When missing values are present, missing plot technique can be used
and analysed.

6
Basic Statistics

11.3.3 Disadvantages
• This design is not as flexible as RBD or CRD as the number of treatments
is limited to the number of rows and columns. LSD is seldom used when
the number of treatments is more than 12. LSD is not suitable for
treatments less than five.

Because of the limitations on the number of treatments, LSD is not widely


used in agricultural experiments.
Note: The number of sources of variation is two for CRD, three for RBD and
four for LSD.

You might also like