P RQIl YGZ6 T3 Qhesp 2 VKT OWf 5 Xpy CBT H737 SN ZR1 Z
P RQIl YGZ6 T3 Qhesp 2 VKT OWf 5 Xpy CBT H737 SN ZR1 Z
Lesson 1
Introduction Of Statistics, Scope And Types Of
Data
1
Basic Statistics
2
Basic Statistics
3
Basic Statistics
4
Basic Statistics
iii)Statistical
laws do not deal with individual observations
but with a group of observations.
iv) Statistical conclusions are true on an average.
vi) Statistical
results may lead to fallacious conclusions if
quoted out of context or manipulated.
1.5 Concepts, Definitions, Frequency Distributions & Frequency Curves
1.5.1 Raw Data: The data collected by an investigator which have
not been organized numerically and used by anybody else.
1.5.2 Array: An arrangement of raw numerical data in ascending or
descending order of magnitude. The data can also be classified into
Primary data and Secondary data.
1.5.3 Primary data: The data collected directly from the original
source is called the primary data i.e. the data collected for the first
time. The primary data may be collected by:
1.Direct interview method
2.Through mail
3.Through designed experiments
6
Basic Statistics
iv) Ratio Scale: Ratio adds a zero so that ratios are meaningful.
9
Basic Statistics
families:
2, 0, 3, 1, 1, 3, 4, 2, 0, 3, 4, 2, 2, 1, 0, 4,1, 2, 2, 3
Number of 0 1 2 3 4
children
Frequency 3 4 6 4 3
Class Frequenc
y
0-10 2
10-20 4
20-30 5
10
Basic Statistics
30-40 3
40-50 1
Class Frequenc
y
0-9 2
10-19 4
20-29 5
30-39 3
40-49 1
Solution:
Number of observation (N) =
11
Basic Statistics
30
Number of classes(k) = 1 + 3.322 log 30 = 5.9 = 6(approx)
Inclusive Method:
12 - 17 I 1
18 - 23 IIIII II 7
24 - 29 IIIII 5
30 - 35 II 2
36 - 41 IIII 4
42 - 47 IIIII II 7
Exclusive method:
12
Basic Statistics
ss Mar qu
ks enc
y
5.5 – IIII 4
11.5
11.5 – I 1
17.5
17.5 – IIIII II 7
23.5
23.5 – IIIII 5
29.5
29.5 – II 2
35.5
35.5 – IIII 4
41.5
41.5 – IIIII II 7
47.5
13
Basic Statistics
Lesson 2
Measures of Central Tendency
Content
0
Basic Statistics
1
Basic Statistics
2
Basic Statistics
Solution:
2 + 4 + 6 + 8 + 10 30
𝑀𝑒𝑎𝑛 = = =6
5 5
Direct method : If the observations 𝑥1 , 𝑥2 … 𝑥𝑛 have
frequencies 𝑓1 , 𝑓2 , 𝑓3 , … , 𝑓𝑛 respectively, then the mean is given by :
(𝑓1 𝑥1 + 𝑓2 𝑥2 + … + 𝑓𝑛 𝑥𝑛 ∑𝑓𝑖 𝑥𝑖
𝑀𝑒𝑎𝑛(𝑋̅ ) =) =
𝑓1 + 𝑓2 + ⋯ + 𝑓𝑛 ∑𝑓𝑖
This method of finding the mean is called the direct method.
Example 2: Given the following frequency distribution, calculate the
arithmetic mean
Marks (x) 50 55 60 65 70 75
No of Students (f) 2 5 4 4 5 5
Solution:
3
Basic Statistics
No of Students 2 5 4 4 5 5 25
(f)
(𝑓1 𝑥1 + 𝑓2 𝑥2 + … + 𝑓𝑛 𝑥𝑛 ∑𝑓𝑖 𝑥𝑖
𝑀𝑒𝑎𝑛(𝑋̅) =) =
𝑓1 + 𝑓2 + ⋯ + 𝑓𝑛 ∑𝑓𝑖
1600
= = 64
25
(ii) Short cut method: In some problems, where the number of variables is
large or the values of xi or fi are larger, then the calculations become
tedious. To overcome this difficulty, we use short cut or deviation method
in which an approximate mean, called assumed mean is taken. This
assumed mean is taken preferably near the middle, say A, and the
deviation di = xi − A is calculated for each variable Then the mean is
given by the formula:
∑𝑓𝑖 𝑥𝑖
𝑀𝑒𝑎𝑛(𝑋̅) = 𝐴 +
∑𝑓𝑖
Example 3: Given the following frequency distribution, calculate the arithmetic mean
Marks (x) 50 55 60 65 70 75
No of Students (f) 2 5 4 4 5 5
Solution:
x f fx d=x- fd
A
50 2 100 -10 -20
4
Basic Statistics
55 5 275 -5 -25
60 4 240 0 00
65 4 260 +5 20
70 5 350 +10 50
75 5 375 +15 75
25 1600 100
By Direct method:
(𝑓1 𝑑1 + 𝑓2 𝑑2 + … + 𝑓𝑛 𝑑𝑛 ) ∑𝑓𝑖 𝑑𝑖
𝑀𝑒𝑎𝑛(𝑋̅ ) =) =
𝑓1 + 𝑓2 + ⋯ + 𝑓𝑛 ∑𝑓𝑖
1600
= = 64
25
By Short-cut method:
∑𝑓𝑖 𝑑𝑖
𝑀𝑒𝑎𝑛(𝑋̅ ) = 𝐴 +
𝑁
100
= 60 + = 60 + 4 = 64
25
Then
∑𝑓𝑖 𝑥𝑖 ∑𝑓𝑖 𝑑𝑖
𝑋̅ = 𝑜𝑟 𝑋̅ = 𝐴 + , 𝑑𝑖 = 𝑥𝑖 − 𝐴
∑𝑓𝑖 ∑𝑓𝑖
Inco 0 1 2 3 4 5 6
me - 0 0 0 0 0 0
5
Basic Statistics
SR 1 - - - - - -
(100) 0 2 3 4 5 6 7
0 0 0 0 0 0
Numb 6 8 1 1 7 4 3
er of 0 2
perso
ns
Solution:
Persons
(f)
0-10 6 5 -3 -
18
10-20 8 15 -2 -
16
20-30 10 25 -1 -
10
30-40 12 A 0 0
=35
40-50 7 45 1 7
50-60 4 55 2 8
60-70 3 65 3 9
Total 50 -
20
∑𝑓𝑖 𝑑𝑖 (𝑥𝑖 − 𝐴)
𝑋̅ = 𝐴 + × ℎ, 𝑑𝑖 =
∑𝑓𝑖 ℎ
−20
𝐴+ × 10 = 35 − 4 = 31
50
Merits:
1. It is rigidly defined.
2. It is easy to understand and easy to calculate.
3. If the number of items is sufficiently large, it is more accurate
and more reliable.
4. It is a calculated value and is not based on its position in the series.
5. It is possible to calculate even if some of the details of the data are
lacking.
6. Of all averages, it is affected least by fluctuations of sampling.
7. It provides a good basis for comparison.
Demerits:
1. It cannot be obtained by inspection nor located through a frequency
graph.
2. It cannot be in the study of qualitative phenomena not capable of
numerical measurement i.e. Intelligence, beauty, honesty etc.,
3. It can ignore any single item only at the risk of losing its accuracy.
4. It is affected very much by extreme values.
5. It cannot be calculated for open-end classes.
6. It may lead to fallacious conclusions, if the details of the data from
which it is computed are not given.
7
Basic Statistics
𝑛
𝐻𝑀 = 𝑛
1
∑ 𝑓( )
𝑖=1 𝑥𝑖
Example 5: From the given data calculate H. M. 5, 10, 17, 24, and 30.
Solution:
x 1
𝑥
5 0.2000
10 0.1000
17 0.0588
24 0.0417
30 0.0333
Total 0.4338
Hence,
𝑛
𝐻𝑀 =
∑𝑛𝑖=1(1/𝑥𝑖 )
5
= = 11.52
0.4338
Example 6: The marks secured by some students of a class are given below.
Calculate the harmonic mean.
Marks 20 21 22 23 24 25
Number of Students 4 2 7 1 3 1
Solution:
Marks No of 1 𝑓𝑖
𝑥𝑖 𝑥𝑖
x students
20 4 0.0500 0.2000
21 2 0.0476 0.0952
8
Basic Statistics
22 7 0.0454 0.3178
23 1 0.0435 0.0435
24 3 0.0417 0.1251
25 1 0.0400 0.0400
18 0.8216
Hence,
𝑛
𝐻𝑀 =
∑𝑛𝑖=1 𝑓(1/𝑥𝑖 )
6
= 21.91
0.8216
3.5 Geometric Mean (G.M.):
The geometric mean of a series containing n observations is the nth root
of the product of the values. If 𝑥1 , 𝑥2 , … , 𝑥𝑛 are observations then
G.M. = 𝑛√𝑥1 . 𝑥2 … . . 𝑥𝑛
1
(𝑥1 . 𝑥2 … . 𝑥𝑛 )𝑛
1
𝐿𝑜𝑔 𝐺. 𝑀. = (log 𝑥1 + log 𝑥2 + ⋯ + log 𝑥𝑛 )
𝑛
∑ log 𝑥𝑖
𝐿𝑜𝑔 𝐺. 𝑀. =
𝑛
∑ log 𝑥𝑖
𝐺. 𝑀. = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔
𝑛
Example 7: Calculate the geometric mean (G.M.) of the following series of
monthly income of a batch of families 180, 250, 490, 1400, 1050.
Solution:
x Log x
180 2.2553
9
Basic Statistics
250 2.3979
490 2.6902
1400 3.1461
1050 3.0212
13.5107
∑ log 𝑥𝑖 13.5107
𝐺. 𝑀. = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 2.70 = 503.6
𝑛 5
Example 8: Calculate the average income per head from the data given below .Use
geometric mean.
Class of Number Monthly
people of income
families per head
(SR)
Landlords 2 5000
Cultivators 100 400
Landless – labours 50 200
Money – lenders 4 3750
Office Assistants 6 3000
Shop keepers 8 750
Carpenters 6 600
Weavers 10 300
Solution:
10
Basic Statistics
∑ fi log xi
G. M. = Antilog
𝑛
482.257
= 𝐴𝑛𝑡𝑖𝑙𝑜𝑔 = 𝐴𝑛𝑡𝑖𝑙𝑜𝑔(2.5928)
186
= 391.50
Combined Mean:
If the arithmetic averages and the number of items in two or more related
groups are known, the combined or the composite mean of the entire
group can be obtained by
𝑛1 𝑥̅1 + 𝑛2 𝑥̅ 2 + … + 𝑛𝑛 𝑥̅𝑛
𝐶𝑜𝑚𝑏𝑖𝑛𝑒𝑑 𝑀𝑒𝑎𝑛, 𝑋̅ =
𝑛1 + 𝑛2 + ⋯ + 𝑛𝑛
Example 9: Find the combined mean for the data given below:
11
Basic Statistics
12
Basic Statistics
The middle value is the 5th item i.e., 25 is the median value.
When even numbers of values are given:-
Example 11: Find median for the following data
5, 8, 12, 30, 18, 10, 2, 22
Solution:
Arranging the data in the increasing order 2, 5, 8, 10, 12, 18, 22, 30
Here median is the mean of the middle two items (i.e) mean of (10, 12) i.
10+12
e., ( ) = 11
2
Example 12: The following table represents the marks obtained by a batch
of 10 students in certain class tests in statistics and Accountancy.
Serial No 1 2 3 4 5 6 7 8 9 10
Marks (Statistics) 53 55 52 32 30 60 47 46 35 28
Marks (Accountancy) 57 45 24 31 25 84 43 80 32 72
Solution: For such question, median is the most suitable measure of central tendency. The marks in
the two subjects are first arranged in increasing order as follows:
Serial No 1 2 3 4 5 6 7 8 9 10
Marks in Statistics 28 30 32 35 46 47 52 53 55 60
Marks in Accountancy 24 25 31 32 43 45 57 72 80 84
46 + 47
Median value for Statistics = (Mean of 5th and 6th items) = = ( ) = 46.5
2
13
Basic Statistics
X f cf
1 1 1
2 3 4
3 5 9
14
Basic Statistics
4 6 15
5 10 25
6 13 38
7 9 47
8 5 52
9 3 55
10 2 57
11 2 59
12 1 60
N= 60
N+1 60 + 1
Median = Size of ( ) th item = Size of ( ) th item
2 2
= 30.5th item.
The cumulative frequency just greater than 30.5 is 38 and the value of x
corresponding to 38 is 6. Hence the median size is 6 members per family.
Note:
It is an appropriate method because a fractional value given by mean does
not indicate the average number of members in a family.
Continuous Series:
The steps given below are followed for the calculation of median in
continuous series.
Step1: Find cumulative frequencies.
N
Step 2: Find ( )
2
15
Basic Statistics
N
Step3: See in the cumulative frequency the value first greater than ( ) Then the co
2
class interval is called the Median Class. Then apply the formula for Median
𝑁
− 𝑐𝑓
2
𝑀𝑑 = 𝑙 + ×ℎ
𝐹
Where,
𝑙 = 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠
𝛴𝑓𝑖 = 𝑛 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑓 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠
ℎ = 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠 (𝑎𝑠𝑠𝑢𝑚𝑖𝑛𝑔 𝑐𝑙𝑎𝑠𝑠 𝑠𝑖𝑧𝑒 𝑡𝑜 𝑏𝑒 𝑒𝑞𝑢𝑎𝑙)
𝑐𝑓 = 𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑝𝑟𝑒𝑐𝑒𝑑𝑖𝑛𝑔 𝑡ℎ𝑒 𝑚𝑒𝑑𝑖𝑎𝑛 𝑐𝑙𝑎𝑠𝑠.
𝑁 = 𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦.
Note:
If the class intervals are given in inclusive type convert them into
exclusive type and call it as true class interval and consider lower limit
in this.
3.7 Quartiles:
The quartiles divide the distribution in four parts. There are three quartiles.
The second quartile divides the distribution into two halves and therefore
is the same as the median. The first (lower) quartile (Q1) marks off the first
one-fourth, the third (upper) quartile (Q3) marks off the three-fourth.
Raw or ungrouped data:
First arrange the given data in the increasing order and use the formula for
Q1 and Q3 then quartile deviation, Q.D. is given by
𝑄3 − 𝑄1
𝑄. 𝐷. =
2
16
Basic Statistics
N+1 N+1
where, Q1 = ( ) th item and Q3 = 3 ( ) th item
4 4
Example 14: Compute quartiles for the data given below 25,18, 30, 8, 15,
5, 10, 35, 40, 45
Solution:
5, 8, 10, 15, 18, 25, 30, 35, 40, 45
𝑁+1
𝑄1 = ( ) 𝑡ℎ 𝑖𝑡𝑒𝑚
4
10 + 1
= ( ) 𝑡ℎ 𝑖𝑡𝑒𝑚
4
= (2.75)𝑡ℎ 𝑖𝑡𝑒𝑚.
3
= 2𝑛𝑑 𝑖𝑡𝑒𝑚 + ( ) (3𝑟𝑑 𝑖𝑡𝑒𝑚 − 2𝑛𝑑 𝑖𝑡𝑒𝑚)
4
3
= 8 + ( ) (10 − 8)
4
3
= 8 + ( )×2
4
= 9.5
𝑁 + 1 𝑡ℎ
𝑄3 = 3 ( ) 𝑖𝑡𝑒𝑚
4
= 3(2.75)𝑡ℎ 𝑖𝑡𝑒𝑚.
= 8.25𝑡ℎ 𝑖𝑡𝑒𝑚
3
= 8𝑡ℎ 𝑖𝑡𝑒𝑚 + ( ) (9𝑡ℎ 𝑖𝑡𝑒𝑚 − 8𝑡ℎ 𝑖𝑡𝑒𝑚)
4
3
= 35 + ( ) (40 − 35)
4
= 35 + 1.25
17
Basic Statistics
= 36.25
Discrete Series:
𝐒𝐭𝐞𝐩 𝟏: Find cumulative frequencies
N+1
𝐒𝐭𝐞𝐩 𝟐: Find ( )
4
N+1
𝐒𝐭𝐞𝐩 𝟑: See in the cumulative frequencies, the value just greater than ( ) then
4
corresponding value of x is Q1
N+1
𝐒𝐭𝐞𝐩 𝟒: Find 3 ( )
4
N+1
B See in the cumulative frequencies, the value just greater than 3 ( ) then
4
the corresponding value of x is Q3 .
X 5 8 12 15 19 24 30
f 4 3 2 4 5 2 4
Solution:
x f c.f
5 4 4
8 3 7
12 2 9
15 4 13
18
Basic Statistics
19 5 18
24 2 20
30 4 24
Total 24
𝑁 + 1 𝑡ℎ
𝑄1 = ( ) 𝑖𝑡𝑒𝑚
4
24 + 1 𝑡ℎ
= ( ) 𝑖𝑡𝑒𝑚
4
25 𝑡ℎ
= ( ) 𝑖𝑡𝑒𝑚
4
= 6.25𝑡ℎ = 8
𝑁 + 1 𝑡ℎ
𝑄3 = 3 ( ) 𝑖𝑡𝑒𝑚
4
= (3 × 6.25)𝑡ℎ 𝑖𝑡𝑒𝑚
= 18.75𝑡ℎ 𝑖𝑡𝑒𝑚 = 24
Continuous Series:
Step 1: Find cumulative frequencies;
N
𝐒𝐭𝐞𝐩 𝟐: Find ( )
4
𝑁
Step 3: See in the cumulative frequencies, the value just greater( ), then
4
the corresponding class interval is called first quartile class.
𝑁
Step 4: Find 3 ( ), See in the cumulative frequencies the value just greater
4
𝑁
than 3 ( ) then the corresponding class interval is called 3rd quartile
4
class. Then apply the respective formulae
19
Basic Statistics
𝑁
− 𝑐𝑓1
4
𝑄1 = 𝑙1 + × ℎ1
𝑓1
𝑁
3 ( ) − 𝑐𝑓3
4
𝑄3 = 𝑙3 + × ℎ3
𝑓3
𝑤ℎ𝑒𝑟𝑒 𝑙1 = 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
𝑓1 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
ℎ1 = 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
𝑐𝑓1 = 𝑐. 𝑓. 𝑝𝑟𝑒𝑐𝑒𝑑𝑖𝑛𝑔 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
𝑙3 = 1𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑜𝑓 𝑡ℎ𝑒 3𝑟𝑑 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
𝑓3 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 3𝑟𝑑 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
ℎ3 = 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑡ℎ𝑒 3𝑟𝑑 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
𝑐𝑓3 = 𝑐. 𝑓. 𝑝𝑟𝑒𝑐𝑒𝑑𝑖𝑛𝑔 𝑡ℎ𝑒 3𝑟𝑑 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝑐𝑙𝑎𝑠𝑠
3.8 Deciles:
These are the values, which divide the total number of observation into 10
equal parts. These are 9 deciles D1, D2…D9. These are all called first decile,
second decile…etc.
Deciles for Raw data or ungrouped data
Example 16: Compute D5 for the data given below 5, 24, 36, 12, 20, 8.
Solution: Arranging the given values in the increasing order 5, 8, 12, 20, 24,
36
𝑁 + 1 𝑡ℎ
𝐷5 = 5 ( ) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
10
6 + 1 𝑡ℎ
= 5( ) 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
10
20
Basic Statistics
= (3.5)𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
1
= 3𝑟𝑑 𝑖𝑡𝑒𝑚 + ( ) [ 4𝑡ℎ 𝑖𝑡𝑒𝑚 – 3𝑟𝑑 𝑖𝑡𝑒𝑚]
2
1
= 12 + ( ) [ 20 – 12]
2
= 16.
Deciles for Grouped data:
Same as quartile.
3.9 Percentiles:
The percentile values divide the distribution into 100 parts each containing
1 percent of the cases. The percentile (Pk) is that value of the variable up
to which lie exactly k% of the total number of observations.
Relationship:
𝑃25 = 𝑄1 ; 𝑃50 = 𝐷5 = 𝑄2 = 𝑀𝑒𝑑𝑖𝑎𝑛 𝑎𝑛𝑑 𝑃75 = 𝑄3
𝑁 + 1 𝑡ℎ
𝑃15 = 15 ( ) 𝑖𝑡𝑒𝑚
100
6 + 1 𝑡ℎ
= 15 ( ) 𝑖𝑡𝑒𝑚
100
= (1.05)𝑡ℎ 𝑖𝑡𝑒𝑚
21
Basic Statistics
22
Basic Statistics
23
Basic Statistics
Solution:
The highest frequency is 150 and corresponding class interval is 200 – 250,
which is the modal class.
24
Basic Statistics
= 200 + 24.18
= 224.18
Determination of Modal class:
For a frequency distribution modal class corresponds to the maximum
frequency. But in any one (or more) of the following cases-
If the maximum frequency is repeated
If the maximum frequency occurs in the beginning or at the end of the
distribution
If there are irregularities in the distribution, the modal class is determined
by the method of grouping.
Steps for Calculation:
We prepare a grouping table with 6 columns
1) In column I, we write down the given frequencies;
2) Column II is obtained by combining the frequencies two by two;
3) Leave the 1st frequency and combine the remaining frequencies two
by two and write in column III;
4) Column IV is obtained by combining the frequencies three by three;
5) Leave the 1st frequency and combine the remaining frequencies
three by three and write in column V;
6) Leave the 1st and 2nd frequencies and combine the remaining
frequencies three by three and write in column VI.
25
Basic Statistics
Mark the highest frequency in each column. Then form an analysis table to
find the modal class. After finding the modal class, use the formula to
calculate the modal value.
26
Basic Statistics
Lesson 3
Measures of Dispersion
Content
0
Basic Statistics
1
Basic Statistics
Lesson-3
Objectives of the lesson:
1. Characteristics of good measure of Dispersion
2. Various absolute and relative measures of Dispersion
3. Mean deviation, Standard deviation and Coefficient of
variation
Glossary of Terms: Dispersion, Range, Quartile Deviation, Mean Deviation,
Standard Deviation, Coefficient of Variation etc.
4.1 Introduction:
The measures of central tendency serve to locate the center of the
distribution, but they do not reveal how the items are spread out on either
side of the center. This characteristic of a frequency distribution is commonly
referred to as dispersion. In a series all the items are not equal. There is
difference or variation among the values. The degree of variation is evaluated
by various measures of dispersion. Small dispersion indicates high uniformity
of the items, while large dispersion indicates less uniformity. For example
consider the following marks of two students.
Student I Student II
68 85
75 90
65 80
67 25
70 65
2
Basic Statistics
Both have got a total of 345 and an average of 69 each. The fact is that the
second student has failed in one paper. When the averages alone are
considered, the two students are equal. But first student has less variation
than second student. Less variation is a desirable characteristic.
4.2 Characteristics of a good measure of dispersion:
An ideal measure of dispersion is expected to possess the following properties
1. It should be rigidly defined
2. It should be based on all the items.
3. It should not be unduly affected by extreme items.
4. It should lend itself for algebraic manipulation.
5. It should be simple to understand and easy to calculate
3
Basic Statistics
4
Basic Statistics
Example 1: Find the value of range and its co-efficient for the following data.
7, 9, 6, 8, 11, 10, 4
Solution:
𝐿 = 11, 𝑆 = 4.
𝑅𝑎𝑛𝑔𝑒 = 𝐿 – 𝑆
= 11 − 4 = 7
L−S 11 − 4 7
Co − efficient of Range = = = = 0.4667
L+S 11 + 4 15
Example 2: Calculate range and its co efficient from the following
distribution.
Size: 60- 63 63- 66 66- 69 69- 72 72- 75
Number: 5 18 42 27 8
Solution:
𝐿 = 𝑈𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝑜𝑓 𝑡ℎ𝑒 ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑐𝑙𝑎𝑠𝑠 = 75
𝑆 = 𝐿𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑙𝑜𝑤𝑒𝑠𝑡 𝑐𝑙𝑎𝑠𝑠 = 60
𝑅𝑎𝑛𝑔𝑒 = 𝐿 – 𝑆 = 75 – 60 = 15
𝐿−𝑆 75 − 60 15
𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑅𝑎𝑛𝑔𝑒 = = = = 0.1111
𝐿+𝑆 75 + 60 135
4.3.2 Quartile Deviation and Co efficient of Quartile Deviation:
Quartile Deviation (Q.D.):
Definition: Quartile Deviation is half of the difference between the first and
third quartiles. Hence, it is called Semi Inter Quartile Range.
𝑄3 − 𝑄1
𝐼𝑛 𝑠𝑦𝑚𝑏𝑜𝑙, 𝑄. 𝐷. =
2
5
Basic Statistics
Among the quartiles Q1, Q2 and Q3, the range Q3- Q1 is called inter quartile
𝑄3 − 𝑄1
range and , semi inter quartile range.
2
6
Basic Statistics
𝑁+1 52 + 1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄1 𝑖𝑠 = = 13.25𝑡ℎ 𝑖𝑡𝑒𝑚
4 4
𝑄1 = 13𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 + 0.25 (14𝑡ℎ 𝑉𝑎𝑙𝑢𝑒 – 13𝑡ℎ 𝑣𝑎𝑙𝑢𝑒)
= 13𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 + 0.25 (400 – 200)
= 200 + 0.25 (400 – 200)
= 200 + 0.25 (200)
= 200 + 50 = 250
7
Basic Statistics
𝑁+1
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑄3 𝑖𝑠 3 ( ) = 3 × 13.25 = 39.25𝑡ℎ 𝑖𝑡𝑒𝑚
4
𝑄3 = 39𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 + 0.75 (40𝑡ℎ 𝑣𝑎𝑙𝑢𝑒 – 39𝑡ℎ 𝑣𝑎𝑙𝑢𝑒)
= 500 + 0.75 (500 – 500)
= 500 + 0.75 × 0
= 500
𝑄3 − 𝑄1 500 − 250 250
𝑄. 𝐷. = = = = 125
2 2 2
𝑄3 − 𝑄1 500 − 250
𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑄𝑢𝑎𝑟𝑡𝑖𝑙𝑒 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = =
𝑄3 + 𝑄1 500 + 250
250
= = 0.33
750
Example 5: For the date given below, give the quartile deviation and
coefficient of quartile deviation.
X 351 – 501 651 801– 951–
: 500 – – 950 1100
650 800
f 48 189 88 4 28
:
Solution:
x F True Cumulative
class frequency
Intervals
351- 48 350.5- 48
500 500.5
501- 189 500.5- 237
650 650.5
8
Basic Statistics
10
Basic Statistics
11
Basic Statistics
N + 1 th 9 + 1 th
Median, Md = Value of ( ) item = ( ) item = 5th item
2 2
= 360
X 𝑫 |𝑫|
= |𝒙 = |𝒙 − 𝑴𝒅 |
− 𝑴𝒆𝒂𝒏|
100 269 260
150 219 210
200 169 160
250 119 110
360 9 0
490 121 130
500 131 140
600 231 240
671 302 311
3321 1570 1561
∑|𝐷| 1570
𝑀. 𝐷. 𝑓𝑟𝑜𝑚 𝑚𝑒𝑎𝑛 = = = 174.44
𝑛 9
𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑀. 𝐷. ) 174.44
𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑀. 𝐷. = = = 0.47
𝑀𝑒𝑎𝑛 369
∑|𝐷| 1561
𝑀. 𝐷. 𝑓𝑟𝑜𝑚 𝑚𝑒𝑑𝑖𝑎𝑛 = = = 173.44
𝑛 9
𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑀. 𝐷. ) 173.44
𝐶𝑜 − 𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑀. 𝐷. = = = 0.48
𝑀𝑒𝑑𝑖𝑎𝑛 360
Mean Deviation- Discrete Series:
∑ 𝑓 |𝐷|
𝑀. 𝐷. =
𝑛
Example:7
Compute Mean deviation from mean and median from the following data:
12
Basic Statistics
Height in cms 158 159 160 161 162 163 164 165 166
No. of persons 15 20 32 35 33 22 20 10 8
Also compute coefficient of mean deviation.
Solution:
Height No. of 𝒅 𝒇𝒅 |𝑫| = |𝑿 − 𝒎𝒆𝒂𝒏| 𝒇|𝑫|
X persons = 𝒙
f − 𝑨,
𝑨
= 𝟏𝟔𝟐
158 15 -4 - 3.51 52.65
60
159 20 -3 - 2.51 50.20
60
160 32 -2 - 1.51 48.32
64
161 35 -1 - 0.51 17.85
35
162 33 0 0 0.49 16.17
163 22 1 22 1.49 32.78
164 20 2 40 2.49 49.80
165 10 3 30 3.49 34.90
166 8 4 32 4.49 35.92
195 - 338.59
95
∑𝑓 𝑑 −95
𝑀𝑒𝑎𝑛 = 𝐴 + = 162 + = 162 – 0.49 = 161.51
𝑁 195
∑ 𝑓 |𝐷| 338.59
𝑀. 𝐷. = = = 1.74
𝑛 195
𝑀𝐷 1.74
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑀. 𝐷. = = = 0.0108
𝑀𝑒𝑎𝑛 161.51
13
Basic Statistics
𝑁 + 1 𝑡ℎ 195 + 1 𝑡ℎ
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑆𝑖𝑧𝑒 𝑜𝑓 ( ) 𝑖𝑡𝑒𝑚 = ( ) 𝑖𝑡𝑒𝑚 = (96)𝑡ℎ 𝑖𝑡𝑒𝑚 = 161
2 2
∑ 𝑓 |𝐷| 334
𝑀. 𝐷. = = = 1.71
𝑛 195
𝑀𝐷 1.71
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑀. 𝐷. = = = 0.0106
𝑀𝑒𝑑𝑖𝑎𝑛 161
Mean Deviation-Continuous Series:
The method of calculating mean deviation in a continuous series same as the discrete series.
In continuous series we have to find out the mid points of the various classes and take deviation
of these points from the average selected. Thus
∑ 𝑓 |𝐷 |
𝑀. 𝐷. =
𝑛
Example 8: Find out the mean deviation from mean and median from the following series.
14
Basic Statistics
20-30 32
30-40 40
40-50 42
50-60 35
60-70 10
70-80 8
Also compute co-efficient of mean deviation.
Solution:
𝑚−𝐴
𝑑= ,
𝐴 = 35,𝑐𝑐
𝐷
X M F = 10 fd 𝑓|𝐷|
= |𝑚 − 𝑥̅ |
15
Basic Statistics
80
212 32 3193.0
∑𝑓𝑑
𝑀𝑒𝑎𝑛 = 𝐴 + ×𝐶
𝑁
32
= 35 + × 10 = 36.5
212
∑ 𝑓 |𝐷| 3193
𝑀. 𝐷. = = = 15.06
𝑛 212
Calculation of median and M.D. from median:
X M f Cf |𝐷 | f|D|
= |𝑚
− 𝑀𝑑 |
0-10 5 20 20 32.25 645.00
10-20 15 25 45 22.25 556.25
20-30 25 32 77 12.25 392.00
30-40 35 40 117 2.25 90.00
40-50 45 42 159 7.75 325.50
50-60 55 35 194 17.75 621.25
60-70 65 10 204 27.75 277.50
70-80 75 8 212 37.75 302.00
Total N=212 3209.50
𝑁 212
= = 106
2 2
𝑁
𝑙 = 30; = 106; 𝑐𝑓 = 77; 𝑓 = 40; ℎ = 10
2
𝑁
− 𝑐𝑓 106 − 77
2
𝑀𝑒𝑑𝑖𝑎𝑛, 𝑀𝑑 = 𝑙 + × ℎ = 30 + × 10 = 37.25
𝑓 40
16
Basic Statistics
∑ 𝑓 |𝐷| 3209.50
𝑀. 𝐷. = = = 15.14
𝑛 212
𝑀𝐷 15.14
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑀. 𝐷. = = = 0.41
𝑀𝑒𝑑𝑖𝑎𝑛 37.25
4.3.3.1 Merits and Demerits of M.D:
Merits:
1. It is simple to understand and easy to compute.
2. It is rigidly defined.
3. It is based on all items of the series.
4. It is not much affected by the fluctuations of sampling.
5. It is less affected by the extreme items.
6. It is flexible, because it can be calculated from any average.
7. It is better measure of comparison.
4.3.3.2 Demerits:
1. It is not a very accurate measure of dispersion.
2. It is not suitable for further mathematical calculation.
3. It is rarely used. It is not as popular as standard deviation.
4. Algebraic positive and negative signs are ignored. It is mathematically unsound and
illogical.
17
Basic Statistics
Steps:
∑𝑥 2 ∑(x−x̅)2
Thus, Standard Deviation (SD or σ2 ) = √ = √
𝑁 𝑁
∑𝑑2 ∑𝑑 2
𝑆𝐷 𝑜𝑟 𝜎 = √ −( )
𝑁 𝑁
Steps:
1. Find out the deviations from the assumed mean; i.e., X-A denoted by d and also the total of the
deviations Σd
18
Basic Statistics
2. Square the deviations; i.e., 𝑑2 and add up the squares of deviations, i.e, ∑𝑑2
3. Then substitute the values in the following formula:
∑𝑑2 ∑𝑑 2
𝑆𝐷 𝑜𝑟 𝜎 = √ −( )
𝑁 𝑁
Example 9: Calculate the standard deviation from the following data. 14, 22, 9, 15, 20, 17, 12, 11
Solution:
∑𝑥 2 (𝑥 − 𝑥̅ )2 140
𝑇ℎ𝑢𝑠, 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑆𝐷 𝑜𝑟 𝜎) = √ =√ =√ = √17.5 = 4.18
𝑁 𝑁 8
Example 10:The table below gives the marks obtained by 10 students in statistics. Calculate standard
deviation.
Student Nos 1 2 3 4 5 6 7 8 9 10
Marks 43 48 65 57 31 60 37 48 78 59
Solution: (Deviations from assumed mean)
19
Basic Statistics
Nos. Marks 𝑑 = 𝑋 − 𝐴, (𝐴 𝑑2
(x) = 57)
1 43 -14 196
2 48 -9 81
3 65 8 64
4 57 0 0
5 31 -26 676
6 60 3 9
7 37 -20 400
8 48 -9 81
9 78 21 441
10 59 2 4
n= d=-44 d2
10 =1952
∑𝑑2 ∑𝑑 2 195.2 44 2
𝑆𝐷 𝑜𝑟 𝜎 = √ −( ) = √ − ( ) = √195.2 − 19.36 = √175.84 = 13.26
𝑁 𝑁 10 10
Discrete Series:
There are three methods for calculating standard deviation in discrete series:
a) Actual mean methods: If the actual mean in fractions, the calculation takes lot of time and labour;
and as such this method is rarely used in practice.
b) Assumed mean method: Here deviations are taken not from an actual mean but from an assumed
mean. Also this method is used, if the given variable values are not in equal intervals.
c) Step-deviation method: If the variable values are in equal intervals, then we adopt this method.
Example 11:Calculate Standard deviation from the following data.
X: 20 22 25 31 35 40 42 45
20
Basic Statistics
f 5 12 15 20 25 14 10 6
Solution:
X F 𝑑 𝑑2 fd 𝑓𝑑 2
= 𝑥 − 𝐴,
(𝐴
= 31)
20 5 -11 121 -55 605
22 12 -9 81 -108 972
25 15 -6 36 -90 540
31 20 0 0 0 0
35 25 4 16 100 400
40 14 9 81 126 1134
42 10 11 121 110 1210
45 6 14 196 84 1176
N=107 fd=167 fd2=
6037
In the continuous series the method of calculating standard deviation is almost the same as in a discrete
series. But in a continuous series, mid-values of the class intervals are to be found out. The step- deviation
method is widely used.
Coefficient of Variation:
The Standard deviation is an absolute measure of dispersion. It is expressed in terms of units in which the
original figures are collected and stated. The standard deviation of heights of students cannot be
compared with the standard deviation of weights of students, as both are expressed in different units, i.e
heights in centimeter and weights in kilograms. Therefore the standard deviation must be converted into
21
Basic Statistics
a relative measure of dispersion for the purpose of comparison. The relative measure is known as the
coefficient of variation.
The coefficient of variation is obtained by dividing the standard deviation by the mean and multiplies it
by 100. Symbolically,
𝑆𝐷 𝜎
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 (𝐶𝑉) = × 100 = × 100
𝑀𝑒𝑎𝑛 𝑋
If we want to compare the variability of two or more series, we can use C.V. The series or groups of data
for which the C.V. is greater indicate that the group is more variable, less stable, less uniform, less
consistent or less homogeneous. If the C.V. is less, it indicates that the group is less variable, more stable,
more uniform, more consistent or more homogeneous.
Example 12: In two factories A and B located in the same industrial area, the average weekly wages (in
SR) and the standard deviations are as follows:
Solution:
22
Basic Statistics
Example 13: Prices of a particular commodity in five years in two cities are
given below:
Price in city A Price in city B
20 10
22 20
19 18
23 12
16 15
Which city has more stable prices?
City A City B
Pric Deviations from Pri Deviation
dx2 s from Y̅ Dy2
es X̅ = 20 dx ces
(X)
20 0 0 (Y)
10 = 15-5 dy 25
23
Basic Statistics
22 2 4 20 5 25
19 -1 1 18 3 9
23 3 9 12 -3 9
16 -4 16 15 0 0
City A:
∑X 100
Mean = = = 20;
N 5
∑dx2 30
SD (σ) = √ = √ = √6 = 2.45
N 5
SD 2.45
CV (City A) = = × 100 = × 100 = 12.25%
Mean 20
City B:
∑𝑋 75
𝑀𝑒𝑎𝑛 = = = 15;
𝑁 5
∑𝑑𝑥 2 68
𝑆𝐷 (𝜎) = √ = √ = √13.6 = 3.69
𝑁 5
𝑆𝐷 3.69
𝐶𝑉 (𝐶𝑖𝑡𝑦 𝐵) = = × 100 = × 100 = 24.6%
𝑀𝑒𝑎𝑛 15
Therefore, City A had more stable prices than City B, because the coefficient
of variation is less in City A.
24
Basic Statistics
25
Basic Statistics
Lesson 4
Probability Theory and Distribution
Content
0
Basic Statistics
1
Basic Statistics
Lesson-4
Objectives of the Lesson:
1. Probability – Basic concepts
2. Equally likely, mutually exclusive,
independent event
3. Additive and Multiplicative laws
4. Normal Distribution and its properties
Glossary of Terms: Sample Space, Event, Addison Law, Conditional
Probability, Normal Distribution etc.
4.1 Introduction:
The concept of probability is difficult to define in precise terms. In ordinary
language, the word probable means likely (or) chance. Generally the word,
probability, is used to denote the happening of a certain event, and the
likelihood of the occurrence of that event, based on past experiences. By
looking at the clear sky, one will say that there will not be any rain today.
On the other hand, by looking at the cloudy sky or overcast sky, one will
say that there will be rain today. In the earlier sentence, we aim that there
will not be rain and in the latter we expect rain. On the other hand a
mathematician says that the probability of rain is ‘0’ in the first case and
that the probability of rain is ‘1’ in the second case. In between 0 and 1,
there are fractions denoting the chance of the event occurring. In ordinary
language, the word probability means uncertainty about happenings. In
Mathematics and Statistics, a numerical measure of uncertainty is
provided by the important branch of statistics – called theory of
probability. Thus we can say, that the theory of probability describes
certainty by 1 (one), impossibility by 0 (zero) and uncertainties by the co-
efficient which lies between 0 and 1.
Trial and Event An experiment which, though repeated under essentially
identical (or) same conditions does not give unique results but may result
Basic Statistics
Favourable Events
The number of cases favourable to an event in a trial is the number of
outcomes which entail the happening of the event.
Example 2:
1. When a seed is sown if we observe non germination of a seed, it is a
favourable event. If we are interested in germination of the seed then
germination is the favourable event.
Mutually Exclusive Events
Events are said to be mutually exclusive (or) incompatible if the happening
of any one of the events excludes (or) precludes the happening of all the
others i.e.) if no two or more of the events can happen simultaneously in
the same trial. (i.e.) The joint occurrence is not possible.
Example 3:
In observation of seed germination the seed may either germinate or it will
not germinate. Germination and non germination are mutually exclusive
events.
Equally Likely Events
Outcomes of a trial are said to be equally likely if taking in to consideration
all the relevant evidences, there is no reason to expect one in preference
to the others. (i.e.) Two or more events are said to be equally likely if each
one of them has an equal chance of occurring.
Basic Statistics
Independent Events
Several events are said to be independent if the happening of an event is
not affected by the happening of one or more events.
Example
When two seeds are sown in a pot, one seed germinates. It would not
affect the germination or non germination of the second seed. One event
does not affect the other event.
Dependent Events
If the happening of one event is affected by the happening of one or more
events, then the events are called dependent events.
Example 4:
If we draw a card from a pack of well shuffled cards, if the first card drawn
is not replaced then the second draw is dependent on the first draw.
Note: In the case of independent (or) dependent events, the joint
occurrence is possible.
4.2 Definition of Probability
4.2.1 Mathematical (or) Classical (or) a-priori Probability
If an experiment results in ‘n’ exhaustive cases which are mutually
exclusive and equally likely cases out of which ‘m’ events are favourable to
the happening of an event ‘A’, then the probability ‘p’ of happening of ‘A’
is given by
𝐹𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑠𝑒 𝑚
𝑝 = 𝑃(𝐴) = =
𝐸𝑥ℎ𝑎𝑢𝑠𝑡𝑖𝑣𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑠𝑒𝑠 𝑛
Note
1. If m = 0 ⇒ P(A) = 0, then ‘A’ is called an impossible event. (i.e.) also
by P(ϕ) = 0.
Basic Statistics
Example 5: Two dice are tossed. What is the probability of getting (i) Sum
6 (ii) Sum 9?
Solution
When 2 dice are tossed. The exhaustive number of cases is 36 ways.
(i) Sum 6 = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)}
5
𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑠𝑒𝑠 = 5, 𝑃(𝑆𝑢𝑚 6) =
36
(ii) Sum 9 = {(3,6), (4,5), (5,4), (6,3)}
∴ Favourable number of cases = 4
4 1
𝑃(𝑆𝑢𝑚 9) = =
36 9
Example 6: A card is drawn from a pack of cards. What is a probability of
getting (i) a king (ii) a spade (iii) a red card (iv) a numbered card?
Solution
52
There are 52 cards in a pack. One can be selected in 𝐶1ways.
52
∴ Exhaustive number of cases is = 𝐶1 = 52.
(i) A king
There are 4 kings in a pack.
One king can be selected in 4 𝐶1ways.
4
∴ Favourable number of cases is = 𝐶1 = 4
4 1
Hence the probability of getting a king = =
52 13
(ii) A spade
There are 13 kings in a pack.
One spade can be selected in 13C1 ways.
∴ Favourable number of cases is = 13C1 = 13
13 1
Hence the probability of getting a spade = =
52 4
Basic Statistics
Proof
𝒏(𝑨 ∪ 𝑩) 𝒏(𝑨 ∪ 𝑩)
𝑷(𝑨 ∪ 𝑩) = =
𝒏(𝑺) 𝑵
From the diagram, using the axiom for the mutually exclusive events, we
write
𝑛(𝐴) + 𝑛(𝐴̅ ∩ 𝐵)
𝑃(𝐴 ∪ 𝐵) =
𝑁
Adding and subtracting 𝑛(𝐴 ∩ 𝐵) in the numinator
𝑛(𝐴) + 𝑛(𝐴̅ ∩ 𝐵) + 𝑛(𝐴 ∩ 𝐵) − 𝑛(𝐴 ∩ 𝐵)
=
𝑁
𝑛(𝐴) + 𝑛(𝐵) − 𝑛(𝐴 ∩ 𝐵)
𝑁
𝑛(𝐴) 𝑛(𝐵) 𝑛(𝐴 ∩ 𝐵)
= + −
𝑁 𝑛 𝑁
𝑃 (𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)
(iii) Lat A and B be any two events which are mutually excusive
Basic Statistics
events then
𝑃 (𝐴 𝑜𝑟 𝐵) = 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴 + 𝐵) = 𝑃(𝐴) + 𝑃(𝐵)
Proof:
spot is also 1/6. The probability of rolling a die and getting a side that has
both a one-spot with a six-spot is 0. There is no side on a die that has both
these events. So substituting these values into the equation gives the
following result
1 1 2 1
+ − 0 = = = 0.3333
6 6 6 3
Finding the probability of drawing a 4 of hearts or a 6 or any suit using the
additive law of probability would give the following:
1 4 5
+ −0= = 0.0962
52 52 52
There is only a single 4 of hearts, there are 4 sixes in the deck and there
isn't a single card that is both the 4 of hearts and a six of any suit.
Now using the additive law of probability, you can find the probability of
drawing either a king or any club from a deck of shuffled cards. The
equation would be completed like this:
4 13 1 16
+ − = = 0.3077
52 52 52 52
There are 4 kings, 13 clubs, and obviously one card is both a king and a
club. We don't want to count that card twice, so you must subtract one of
it's occurrences away to obtain the result.
4.3.2 Multiplication Theorem on Probability
(i) If A and B be any two events which are not independent then
𝑃(𝐴 𝑎𝑛𝑑 𝐵) = 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴𝐵) = 𝑃(𝐴). 𝑃(𝐵/𝐴)
= 𝑃(𝐵). 𝑃(𝐴/𝐵)
Where P(B/A) and P(A/B) are the conditional probability of B given A
and A given B, respectively.
Proof
Let n is the total number of events
Basic Statistics
Example 8:
So in finding the probability of drawing a 4 and then a 7 from a well shuffled
deck of cards, this law would state that we need to multiply those separate
probabilities together. Completing the equation above gives
4 4 16
𝑃(4 𝑎𝑛𝑑 7) = × = = 0.0059
52 52 2704
Given a well shuffled deck of cards, what is the probability of drawing a
Jack of Hearts, Queen of Hearts, King of Hearts, Ace of Hearts, and 10 of
Hearts?
1 1 1 1
𝑃(10, 𝐽, 𝑄, 𝐾, 𝐴 𝑜𝑓 ℎ𝑒𝑎𝑟𝑡𝑠) = × × × = 0,000000026
52 52 52 52
In any case, given a well shuffled deck of cards, obtaining this assortment
of cards, drawing one at a time and returning it to the deck would be highly
unlikely (it has an exceedingly low probability).
4.4 Normal distribution
Continuous Probability distribution is normal distribution. It is also known
as error law or Normal law or Laplacian law or Gaussian distribution. Many
of the sampling distribution like student-t, f distribution and χ 2
distribution.
Definition
A continuous random variable x is said to be a normal distribution with
parameters μ and σ2, if the density function is given by the probability law
1 1 𝑥−𝜇 2
𝑓(𝑥 ) = 𝑒 −2( 𝜎 ) ; −∞ < 𝑥 < ∞, = ∞ < 𝜇 < ∞, 𝜎 > 0
𝜎√2𝜋
Note
The mean m and standard deviation s are called the parameters of Normal
distribution. The normal distribution is expressed by X ~ N(μ , σ2)
Basic Statistics
1
5. The maximum ordinate occurs at x = μ and its value is =
𝜎 √2𝜋
6. Area Property
𝑃(𝜇 − 𝜎 < 𝑥 < 𝜇 + 𝜎) = 0.6826
[From the normal table where 0.4115 lies is rthe value of Z1]
Form the normal table we have Z1=1.35
𝑋1 − 2
∴ 1.35 =
3
⇒ 3(1.35) + 2 = 𝑋1
= 𝑋1 = 6.05
(i.e) 41 % of the observation converged between 2 and 6.05
Basic Statistics
Basic Statistics
Lesson 5
Sampling Theory
Content
0
Basic Statistics
1
Basic Statistics
Lesson-5
Objectives of the Lesson:
1. Sampling-basic concepts
2. Sampling methods
3. Simple random sampling
4. Stratified random sampling
Glossary of Terms: Sampling, Simple random Sampling, Stratified
Sampling, Census Survey etc.
5.1 Basic Terminology of Sampling Theory:
Population (Universe)
Population means aggregate of all possible units. It need not be
human population. It may be population of plants, population of insects,
population of fruits, etc.
Finite population
When the number of observation can be counted and is definite, it is
known as finite population
No. of plants in a plot.
No. of farmers in a village.
All the fields under a specified crop.
Infinite population
When the number of units in a population is innumerably large, that we
cannot count all of them, it is known as infinite population.
The plant population in a region.
The population of insects in a region.
Frame
Basic Statistics
Statistic
A summary measure that describes the characteristic of the sample is
known as statisitic. Thus sample mean, sample standard deviation etc is
statistic. The statistic is usually denoted by roman letter.
𝑥̅ − 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛
𝑠 − 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
The statistic is a random variable because it varies from sample to sample.
Sampling
The method of selecting samples from a population is known as sampling.
Basic Statistics
of the population.
5.6 Stratified Sampling
When the population is heterogeneous with respect to the characteristic
in which we are interested, we adopt stratified sampling.
When the heterogeneous population is divided into homogenous sub-
population, the sub-populations are called strata. From each stratum a
separate sample is selected using simple random sampling. This sampling
method is known as stratified sampling.
We may stratify by size of farm, type of crop, soil type, etc.
The number of units to be selected may be uniform in all strata (or) may
vary from stratum to stratum.
There are four types of allocation of strata
1. Equal allocation
2. Proportional allocation
3. Neyman’s allocation
4. Optimum allocation
1. It is more representative.
2. It ensures greater accuracy.
3. It is easy to administrate as the universe is sub-divided.
5.6.2 Demerits
1. To divide the population into homogeneous strata, it requires more
money, time and statistical experience which is a difficult one.
2. If proper stratification is not done, the sample will have an effect of
bias.
References:
1. Cochran, W.G. (1977), Sampling techniques, Wiley Eastern Limited.
2. Des Raj and Chandok. P. ( 1998 ), Sampling Theory. Narosa Publishing
House. New Deihi.
3. Murthy, M.N. ( 1967), Sampling Theory and methods. Statistical
Publishing Society. Calcutta.
Basic Statistics
Lesson 6
Testing of Hypothesis
Content
0
Basic Statistics
1
Basic Statistics
Lesson-6
Objectives of the Lesson:
1. Test of significance and Basic concepts
2. Null hypothesis, alternative hypothesis and level of
significance
3. Standard error and its importance
4. Steps in testing of hypothesis with different tests
6.1 Sampling Distribution
By drawing all possible samples of same size from a population we can
calculate the statistic, for example, x̅ for all samples. Based on this we can
construct a frequency distribution and the probability distribution of x̅. Such
probability distribution of a statistic is known a sampling distribution of that
statistic. In practice, the sampling distributions can be obtained theoretically
from the properties of random samples.
6.2 Standard Error
As in the case of population distribution the characteristic of the sampling
distributions are also described by some measurements like mean & standard
deviation. Since a statistic is a random variable, the mean of the sampling
distribution of a statistic is called the expected valued of the statistic. The SD
of the sampling distributions of the statistic is called standard error of the
Statistic. The square of the standard error is known as the variance of the
statistic. It may be noted that the standard deviation is for units whereas the
standard error is for the statistic.
6.3 Theory of Testing Hypothesis
6.3.1 Hypothesis
Hypothesis is a statement or assumption that is yet to be proved.
Basic Statistics
Consider for example, the hypothesis may be put in a form ‘paddy variety A
will give the same yield per hectare as that of variety B’ or there is no
difference between the average yields of paddy varieties A and B. These
hypotheses are in definite terms. Thus these hypothesis form a basis to work
with. Such a working hypothesis in known as null hypothesis. It is called null
hypothesis because if nullities the original hypothesis, that variety A will give
more yield than variety B.
The null hypothesis is stated as ‘there is no difference between the effect of
two treatments or there is no association between two attributes (ie) the two
attributes are independent. Null hypothesis is denoted by H0.
Eg:-
There is no significant difference between the yields of two paddy varieties
(or) they give same yield per unit area. Symbolically, H0: µ1=µ2.
6.3.6 Alternative Hypothesis
When the original hypothesis is µ1>µ2 stated as an alternative to the null
hypothesis is known as alternative hypothesis. Any hypothesis which is
complementary to null hypothesis is called alternative hypothesis, usually
denoted by H1.
Eg:-
There is a significance difference between the yields of two paddy
varieties.
Symbolically,
𝐻1 : µ1 ≠ µ2 (𝑡𝑤𝑜 𝑠𝑖𝑑𝑒𝑑 𝑜𝑟 𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛𝑙𝑒𝑠𝑠 𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒)
If the statement is that A gives significantly less yield than B (or) A gives
significantly more yield than B. Symbolically,
Basic Statistics
Accept Ho Reject Ho
Ho is true Correct Decision Type I error
Ho is false Type II error Correct Decision
The relationship between type I & type II errors is that if one increases the
other will decrease. The probability of type I error is denoted by α. The
probability of type II error is denoted by β. The correct decision of rejecting
the null hypothesis when it is false is known as the power of the test. The
probability of the power is given by 1-β.
6.8 Critical Region
The testing of statistical hypothesis involves the choice of a region on the
sampling distribution of statistic. If the statistic falls within this region, the null
hypothesis is rejected: otherwise it is accepted. This region is called critical
region.
Let the null hypothesis be 𝐻0 : µ1 = µ2 and its alternative be 𝐻1 : µ1 ≠ µ2 .
Suppose Ho is true.
Based on sample data it may be observed that statistic (𝑥̅1 − 𝑥̅ 2 ) follows
a normal distribution given by
(𝑥̅1 − 𝑥̅ 2 ) − (𝜇1 − 𝜇2 )
𝑍=
𝑆𝐸 (𝑥̅1 − 𝑥̅ 2 )
We know that 95% values of the statistic from repeated samples will fall in
the range (𝑥̅1 − 𝑥̅ 2 ) ± 1.96 𝑡𝑖𝑚𝑒𝑠 𝑆𝐸(𝑥̅1 − 𝑥̅ 2 ). This is represented by a
diagram
Basic Statistics
The border line value ±1.96 is the critical value or tabular value of Z. The area
beyond the critical values (shaded area) is known as critical region or region
of rejection. The remaining area is known as region of acceptance.
If the statistic falls in the critical region we reject the null hypothesis and, if it
falls in the region of acceptance we accept the null hypothesis.
In other words if the calculated value of a test statistic (Z, t, χ 2 etc) is more
than the critical value in magnitude it is said to be significant and we reject
Ho and otherwise we accept Ho. The critical values for the t and are given in
the form of readymade tables. Since the criticval values are given in the form
of table it is commonly referred as table value. The table value depends on
the level of significance and degrees of freedom.
Example: Zcal < Ztab -We accept the H0 and conclude that there is no significant
difference between the means
Test Statistic
The sampling distribution of a statistic like Z, t, & χ2 are known as test statistic.
Generally, in case of quantitative data
𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 − 𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟
𝑇𝑒𝑠𝑡 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑠 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 (𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 )
Note
The choice of the test statistic depends on the nature of the variable (ie)
qualitative or quantitative, the statistic involved (i.e) mean or variance and
the sample size, (i.e) large or small.
6.9 Level of Significance
Basic Statistics
𝛼 𝛼
The probability that the statistic will fall in the critical region is + =𝛼.
2 2
This is nothing but the probability of committing type I error. Technically the
probability of committing type I error is known as level of Significance.
6.10 One and two tailed test
The nature of the alternative hypothesis determines the position of the
critical region. For example, if 𝐻1 𝑖𝑠 𝜇1 ≠ 𝜇2 it does not show the direction
and hence the critical region falls on either end of the sampling distribution.
If H1 is 𝜇1 < 𝜇2 𝑜𝑟 𝜇1 > 𝜇2 the direction is known. In the first case the
critical region falls on the left of the distribution whereas in the second case
it falls on the right side.
6.10.1 One tailed test – When the critical region falls on one end of the
sampling distribution, it is called one tailed test.
6.10.2 Two tailed test – When the critical region falls on either end of the
sampling distribution, it is called two tailed test.
For example, consider the mean yield of new paddy variety (μ 1) is compared
with that of a ruling variety (μ2). Unless the new variety is more promising
that the ruling variety in terms of yield we are not going to accept the new
variety. In this case 𝐻1 : 𝜇1 > 𝜇2 for which one tailed test is used. If both the
varieties are new our interest will be to choose the best of the two. In this
case 𝐻1 : 𝜇1 ≠ 𝜇2 for which we use two tailed test.
Degrees of freedom
The number of degrees of freedom is the number of observations that are
free to vary after certain restriction have been placed on the data. If there are
n observations in the sample, for each restriction imposed upon the original
observation the number of degrees of freedom is reduced by one.
Basic Statistics
Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may
reject the Null Hypothesis H0 otherwise accept it.
6.13.3 Sampling from variable
In sampling for variables, the test are as follows
1. Test for single Mean
2. Test for single Standard Deviation
3. Test for equality of two Means
4. Test for equality of two Standard Deviation
6.13.3.1 Test for single Mean
In a sample of large size n, we examine whether the sample would have come
from a population having a specified mean
1. Null Hypothesis (H0)
H0: There is no significance difference between the sample mean ie.,
µ=µo
or
The given sample would have come from a population having a
specified mean ie., µ=µ0
2. Alternative Hypothesis(H1)
3. Test statistics
|𝑥̅ − 𝜇 |
𝑍= 𝜎
√𝑛
(∑𝑥)2
|𝑥̅ − 𝜇 | ∑𝑥 2 −
𝑍= , 𝑤ℎ𝑒𝑟𝑒 𝑠 = √ 𝑛
𝑠
𝑛−1
√𝑛
4. Level of Significance
The level may be fixed at either 5% or 1%
5. Expected vale
The expected value is given by
Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.33 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Ze we may
reject the Null Hypothesis H0 otherwise accept it.
6.13.3.2 Test for equality of two Means
Given two sets of sample data of large size n1 and n2 from variables. We may
examine whether the two samples come from the populations having the
same mean. We may proceed as follows
Basic Statistics
4. Level of Significance
The level may be fixed at either 5% or 1%
5. Expected vale
The expected value is given by
The expected value is given by
Ze = 1.96 at 5% level
2.58 at 1% level Two tailed test
Ze = 1.65 at 5% level
2.34 at 1% level One tailed test
6. Inference
If the observed value of the test statistic Z exceeds the table value Z we
may reject the Null Hypothesis H0 otherwise accept it.
Basic Statistics
Lesson 7
T-Test and F-Test
Content
0
Basic Statistics
1
Basic Statistics
Sample etc.
|x̅−μ|
When the sample size is smaller, the ratio Z = s will follow t distribution
√n
and not the standard normal distribution. Hence the test statistic is given as
|x̅−μ|
t= s which follows normal distribution with mean 0 and unit standard
√n
4. Test statistic
x̅ − μ
t=| s |
√n
2 (∑x)2
x =
∑xi
and s = √∑x − n
n n−1
Basic Statistics
5. Find the table value of t corresponding to (n-1) d.f. and the specified level
of significance.
6. Inference
If t < ttab we accept the null hypothesis H0. We conclude that there is no
significant difference sample mean and population mean
(or) if t > ttab we reject the null hypothesis H0. (ie) we accept the alternative
hypothesis and conclude that there is significant difference between the
sample mean and the population mean.
Example 1
Based on field experiments, a new variety of green gram is expected to
given a yield of 12.0 quintals per hectare. The variety was tested on 10
randomly selected farmer’s fields. The yield (quintals/hectare) were
recorded as
14.3,12.6,13.7,10.9,13.7,12.0,11.4,12.0,12.6,13.1.
Do the results conform to the expectation?
Solution
1. Null hypothesis
H0: =12.0
(i.e) the average yield of the new variety of green gram is 12.0
quintals/hectare.
2. Alternative Hypothesis:
H1:μ ≠ 12.0
Basic Statistics
(i.e) the average yield is not 12.0 quintals/hectare, it may be less or more
than 12 quintals
/ hectare
3. Level of significance: 5 %
4. Test statistic:
x̅ − μ
t=| s |
√n
∑x = 126.3 , ∑x 2 = 1605.77
∑x 126.3
x̅ = = = 12.63
n 10
2 (∑x)2
s= √∑x − n
=√
1605.77 − 1595.169
=√
10.601
= 1.0853
n−1 9 9
s 1.0853
= = 0.3432
√n √10
x̅ − μ
Now, t = | s |
√n
12.63 − 12
t= = 1.836
0.3432
t < ttab
We conclude that the new variety of green gram will give an average
yield of 12 quintals/hectare.
Note
Before applying t test in case of two samples the equality of their
variances
has to be tested by using F-test
s12
F = 2 ∼ Fn1 −1,n2−1 d, f, if s12 > s22
s2
or
s12
F = 2 ∼ Fn2 −1,n1−1 d, f, if s12 < s22
s2
where s12 is the variance of the first sample whose size in n1
s22 is the variance of the second sample whose size is n2
It may be noted that the numerator is always the greater variance. The
critical value for F is read from the F table corresponding to a specified
d.f. and level of significance Inference
F <Ftab
We accept the null hypothesis H0.(i.e) the variances are equal otherwise
the variances are unequal.
7.2.2 Test for equality of two means (Independent Samples)
Given two sets of sample observation x11, x12, x13 … x1n , and
Basic Statistics
x21, x22, x23 … x2n of sizes n1 and n2 respectively from the normal
population.
1. Using F-Test , test their variances
2. Test statistics
|(x̅1 − x̅2 )|
t=
1 1
√s 2 (n + n )
1 2
(∑xi )2 (∑xi )2
[∑xi2 − ] + [∑xi2 − ]
2 n1 n2
where the combined variance s =
n1 + n2 − 2
The test statistics t follows a t distribution with n1 + n2 -2 d.f.
ii) Variance are unequal and n1=n2
|(x̅1 − x̅2 )|
t=
1 1
√s 2 (n + n )
1 2
𝑛1 +𝑛2
It follows a t distribution with ( ) − 1 𝑑. 𝑓.
2
(𝑥̅1 − 𝑥̅2 )
𝑡 = || ||
𝑠 2 𝑠 2
√( 1 + 2 )
𝑛1 𝑛2
where t1is the critical value for t with (n1-1) d.f. at a dspecified level of
significance and t2 is the critical value for t with (n2-1) d.f. at a specified
level of significance and
Example 2
In a fertilizer trial the grain yield of paddy (Kg/plot) was observed as follows
Under ammonium chloride 42,39,38,60 &41 kgs
Under urea 38, 42, 56, 64, 68, 69,& 62 kgs.
Find whether there is any difference between the sources of nitrogen?
Solution
𝐻𝑜 : µ1 = µ2 (i.e) there is no significant difference in effect between the
sources of nitrogen.
𝐻1 : µ1 ≠ µ2 (i.e) there is a significant difference between the two sources
Level of significance = 5%
Before we go to test the means first we have to test their variances by using
F-test. F-test
𝐻0 : 𝜎12 = 𝜎22
𝐻1 : 𝜎12 ≠ 𝜎22
Basic Statistics
(∑𝑥1 )2
∑𝑥12 −
𝑛1
𝑠12 = = 82.5
𝑛1 − 1
(∑𝑥2 )2
∑𝑥22 −
𝑛1
𝑠22 = = 154.33
𝑛2 − 1
∴
𝑠12
𝐹 = 2 ∼ 𝐹𝑛2 −1,𝑛1−1 𝑑, 𝑓, 𝑖𝑓 𝑠12 < 𝑠22
𝑠2
154.33
𝐹= = 1.8707
32.5
F < Ftab
We accept the null hypothesis H0. (i.e) the variances are equal. Use the test
statistic
|(𝑥̅1 − 𝑥̅2 )|
𝑡=
1 1
√𝑠 2 (𝑛 + 𝑛 )
1 2
(∑𝑥1 )2 (∑𝑥2 )2
[∑𝑥12 − ]+ [∑𝑥22 − ] 330 + 992
𝑛1 𝑛2
𝑠= = = 125.6
𝑛1 + 𝑛2 − 1 10
|(44 − 57)|
𝑡= = 1.98
1 1
√125.7 ( + )
7 75
value of t is 2.228
Basic Statistics
Inference:
t <ttab
We conclude that the two sources of nitrogen do not differ significantly with
7.3 F-Test:
7.3.1 F-Statistic:
Let X1i (i=1,2,..,n1) be a random sample of size n1from the first population with
variance σ12 and X2j (j=1,2,….,n2) be another independent random sample of
size n2 from the second normal population with variance σ22. The F- statistic
is defined as the ratio of estimates of two variances as given below:
Basic Statistics
where, S12 > S22 and are unbiased estimates of population variances which
are given by:
It follows Snedecor’s F- distribution with (n1-1, n2-1) d.f. i.e., F~F (n1 - 1, n2 -
1). Further, if X is a χ2-variate with n1 d.f. and Y is another independent χ2-
variate with n2 d.f., then F-statistic is defined as:
Lesson 8
Ch-Square Distribution
Content
0
Basic Statistics
1
Basic Statistics
𝑛
2
(𝑂𝑖 − 𝐸𝑖 )2
𝜒 =∑
𝐸𝑖
𝑖=1
It follow a χ2 distribution with n-1 d.f.. In case of χ2 only one tailed test is
used.
For Example :In plant genetics, our interest may be to test whether the
observed segregation ratios deviate significantly from the mendelian
ratios. In such situations we want to test the agreement between the
observed and theoretical frequency, such test is called as test of goodness
of fit.
8.3 Conditions for the validity of χ2 –test
χ2- test is an approximate test for large values of ‘n’ for the validity of χ 2-
test of goodness of fit between theory and experiment, the following
conditions must be satisfied.
1. The sample observations should be independent.
2. Constraints on the cell frequencies, if any, should be linear
Example ∑𝑂𝑖 = ∑𝐸𝑖
3. N, the total frequency should be reasonable large, say greater than
(>) 50.
4. No theoretical cell frequency should be less than 5. If any theoretical
cell frequency is < 5, then for the application of chi-square test, it is
pooled with the preceding or succeeding frequency so that the
pooled frequency is more than 5 and finally adjust for degree’s of
freedom lost in pooling
Example 1 :
The number of yeast cells counted in a haemocyto meter is compared to
the theoretical value is given below. Does the experimental result support
the theory?
Basic Statistics
Solution
H0: the experimental results support the theory
H1: the experimental results does not support the theory. Level of
significance=5%
Test Statistic:
𝑛
2
(𝑂𝑖 − 𝐸𝑖 )2
𝜒 =∑ = 3.1779
𝐸𝑖
𝑖=1
Table value
2
𝜒6−1 = 5 𝑎𝑡 5 % 𝑙𝑒𝑣𝑒𝑙 𝑜𝑓 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 = 10.070
Inference
2 2
𝐴𝑠 𝜒𝑐𝑎𝑙 < 𝜒𝑡𝑎𝑏
Basic Statistics
.
Bi 𝑂𝑖1 𝑂𝑖2 𝑂𝑖𝑗 𝑂𝑖𝑚 𝑟𝑖
.
Bn 𝑂𝑛1 𝑂𝑛2 𝑂𝑛𝑖 𝑂𝑛𝑚 𝑟𝑛
𝑛 = ∑𝑟𝑖
Col
𝑐1 𝑐2 𝑐2 𝑐𝑛 = ∑𝑐𝑗
Total
𝑟𝑖 𝑐𝑗
The expected frequencies corresponding to Oij is calculated as . 𝑇ℎ𝑒 𝜒 2
𝑛
is computed as
𝑛 𝑚 2
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )
𝜒2 = ∑ ∑
𝐸𝑖𝑗
𝑖=1 𝑗=1
where
Oij – observed frequencies
Eij – Expected frequencies
n - number of rows
m - number of columns
It can be verified that ∑𝑂𝑖𝑗 = ∑𝐸𝑖𝑗
𝐵1 𝐵1 Row Total
𝐵1 𝑎 𝑏 𝑎+𝑏 𝑟1
𝐵1 𝑐 𝑑 𝑐+𝑑 𝑟2
𝑎 𝑏
Col 𝑎+𝑏+𝑐
+𝑐 +𝑑 =𝑛
Tot +𝑑
𝑐1 𝑎
Where a, b, c and d are cell frequencies c1 and c2 are column totals, r1 and
r2 are row Totals and n is the total number of observations.
In case of 2 x 2 contingency table, χ2 can be directly found using the short
cut formula
Basic Statistics
𝑛(𝑎𝑑 − 𝑏𝑐 )2
2
𝜒 =
𝑐1 𝑐2 𝑟1 𝑟2
The d.f. associated with χ2 is (2-1)(2-1) = 1
8.5 Yates correction for continuity
If anyone of the cell frequency is < 5, we use Yates correction to make χ2 as
continuous. The Yates correction is made by adding 0.5 to the least cell
frequency and adjusting the other cell frequencies so that the column and
row totals remain same. Suppose, the first cell frequency is to be corrected
then the contingency table will be as follows:
𝐵1 𝐵1 Row Total
𝐵1 𝑎 + 0.5 𝑏 − 0.5 𝑎 + 𝑏 = 𝑟1
𝐵1 𝑐 − 0.5 𝑑 + 0.5 𝑐 + 𝑑 = 𝑟2
Col Tot 𝑎+𝑐 𝑏+𝑑 𝑎+𝑏+𝑐
= 𝑐1 = 𝑐2 +𝑑
Severe 51 40 10 9 110
Moderate 105 103 25 17 250
Mild 384 527 125 104 1140
Total 540 670 160 130 1500
Solution
H0: The severity of the disease is not associated with blood group.
H1: The severity of the disease is associated with blood group.
Blood Groups
Condition Total
O A B AB
Severe 51 40 10 9 110
Moderate 105 103 25 17 250
Mild 384 527 125 104 1140
Total 540 670 160 130 1500
Test statistics:
𝑛 𝑚 2
(𝑂𝑖𝑗 − 𝐸𝑖𝑗 )
𝜒2 = ∑ ∑
𝐸𝑖𝑗
𝑖=1 𝑗=1
2
𝑛(𝑎𝑑 − 𝑏𝑐 )2
𝜒 = 𝑤𝑖𝑡ℎ 1 𝑑. 𝑓.
(𝑎 + 𝑏)(𝑐 + 𝑑)(𝑎 + 𝑐 )(𝑛 + 𝑑)
2
300(118 × 40 − 22 × 120)2
𝜒 = = 3.927
140 × 160 × 62 × 238
Table value
𝜒 2 (1)𝑑. 𝑓. 𝑎𝑡 1 % 𝑙𝑒𝑣𝑒𝑙𝑜𝑓 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 = 6.635
Inference
2
𝜒 2 < 𝜒𝑡𝑎𝑏
We accept the null hypothesis.
The chemical treatment will not improve the germination rate of cotton
seeds significantly.
Example 4
In an experiment on the effect of a growth regulator on fruit setting in
muskmelon the following results were obtained. Test whether the fruit
Basic Statistics
Solution
H0:Fruit setting in muskmelon does not depend on the application of
growth regulator.
H1: Fruit setting in muskmelon depend on the application of growth
regulator.
Level of significance = 1%
After Yates correction we have
Fruit set Fruit not set Total
Treated 15.5 9.5 25
Control 4.5 20.5 25
Total 20 30 50
Test statistic
𝑛 2
𝑛 (|𝑎𝑑 − 𝑏𝑐 | − )
2
𝜒2 =
(𝑎 + 𝑏)(𝑐 + 𝑑)(𝑎 + 𝑐 )(𝑛 + 𝑑)
50 2
50 (|15.5 × 20.5 − 9.5 × 4.5| − |)
2
𝜒2 = = 8.33
25 × 25 × 20 × 30
Table value
Basic Statistics
Lesson 9
Correlation and Regression
Content
0
Basic Statistics
1
Basic Statistics
Lesson-9
9.1 Introduction:
The term correlation is used by a common man without knowing that he
is making use of the term correlation. For example when parents advice
their children to work hard so that they may get good marks, they are
correlating good marks with hard work.
The study related to the characteristics of only variable such as height,
weight, ages, marks, wages, etc., is known as univariate analysis. The
statistical Analysis related to the study of the relationship between two
variables is known as Bivariate Analysis. Sometimes the variables may be
inter-related. In health sciences we study the relationship between blood
pressure and age, consumption level of some nutrient and weight gain,
total income and medical expenditure, etc. The nature and strength of
relationship may be examined by correlation and Regression analysis.
Thus Correlation refers to the relationship of two variables or more. (e-
g) relation between height of father and son, yield and rainfall, wage and
price index, share and debentures etc.
Correlation is statistical Analysis which measures and analyses the degree
or extent to which the two variables fluctuate with reference to each
2
Basic Statistics
3
Basic Statistics
direct correlation. Price and supply, height and weight, yield and rainfall,
are some examples of positive correlation.
If the two variables tend to move together in opposite directions so that
increase (or) decrease in the value of one variable is accompanied by a
decrease or increase in the value of the other variable, then the
correlation is called negative (or) inverse correlation. Price and demand,
yield of crop and price, are examples of negative correlation.
9.4.2 Linear and Non-linear correlation:
If the ratio of change between the two variables is a constant then
there will be linear correlation between them.
Consider the following.
X 2 4 6 8 10 12
Y 3 6 9 12 15 18
Here the ratio of change between the two variables is the same. If we
plot these points on a graph we get a straight line.
If the amount of change in one variable does not bear a constant
ratio of the amount of change in the other. Then the relation is called
Curvilinear (or) non-linear correlation. The graph will be a curve.
9.4.3 Simple and Multiple correlations:
When we study only two variables, the relationship is simple correlation.
For example, quantity of money and price level, demand and price. But
in a multiple correlation we study more than two variables
simultaneously. The relationship of price, demand and supply of a
commodity are an example for multiple correlations.
9.4.4 Partial and total correlation:
The study of two variables excluding some other variable is called Partial
correlation. For example, we study price and demand eliminating supply
side. In total correlation all facts are taken into account.
9.5 Computation of correlation:
When there exists some relationship between two variables, we have to
measure the degree of relationship. This measure is called the measure
5
Basic Statistics
6
Basic Statistics
9.7 Limitations:
1. Correlation coefficient assumes linear relationship regardless of the
assumption is correct or not.
2. Extreme items of variables are being unduly operated on correlation
coefficient.
3. Existence of correlation does not necessarily indicate cause- effect
relation.
9.8 Interpretation:
The following rules helps in interpreting the value of ‘ r’ .
1. When r = 1, there is perfect +ve relationship between the variables.
2. When r = -1, there is perfect –ve relationship between the variables.
3. When r = 0, there is no relationship between the variables.
4.If the correlation is +1 or –1, it signifies that there is a high degree of
correlation. (+ve or –ve) between the two variables.
If r is near to zero (ie) 0.1,-0.1, (or) 0.2 there is less correlation.
Example 1:
Find Karl Pearson’s coefficient of correlation from the following data
between height of father (x) and son (y).
X 64 65 66 67 68 69 70
Y 66 67 65 68 70 68 72
Comment on the result.
Solution:
x y (𝑥 − 𝑥̅ ) (𝑥 − 𝑥̅ )2 (𝑦 − 𝑦̅) (𝑦 − 𝑦̅)2 (𝑥 − 𝑥̅ )(𝑦
− 𝑦̅)
64 66 -3 9 -2 4 6
65 67 -2 4 -1 1 2
66 65 -1 1 -3 9 3
67 68 0 0 0 0 0
68 70 1 1 2 4 2
69 68 2 4 0 0 0
7
Basic Statistics
70 72 3 9 4 16 12
469 476 0 28 0 34 25
∑𝑋 469
𝑀𝑒𝑎𝑛 𝑜𝑓 𝑋 = = = 67;
𝑁 7
∑𝑌 476
𝑀𝑒𝑎𝑛 𝑜𝑓 𝑌 = = = 68.
𝑁 7
Hence, Karl Pearson’s Coefficient of Correlation,
∑𝑥𝑦 25 25 25
𝑟 = = = = = 0.81
√∑𝑥2. ∑𝑦2 √28 × 34 √952 30.85
𝑆𝑖𝑛𝑐𝑒 𝑟 = + 0.81, the variables are highly positively correlated i. e., tall
fathers have tall sons.
Example:
Calculate coefficient of correlation from the following data.
X 1 2 3 4 5 6 7 8 9
Y 9 8 1 1 1 1 1 1 1
0 2 1 3 4 6 5
Example:
Calculate Pearson’s Coefficient of correlation.
X 4 5 5 5 6 6 6 7 7 8 8
5 5 6 8 0 5 8 0 5 0 5
Y 5 5 4 6 6 6 6 7 7 8 9
6 0 8 0 2 4 5 0 4 2 0
8
Basic Statistics
6∑D2
ρ = 1− 3
N −N
where, ρ (rho) = rank correlation coefficient;
∑D2 = sum of squares of differences between the pairs of ranks; and
N = number of pairs of observations.
The value of ρ lies between –1 and +1. If ρ = +1, there is complete
agreement in order of ranks and the direction of ranks is also same. If ρ =
-1, then there is complete disagreement in order of ranks and they are in
opposite directions.
Computation for tied observations: There may be two or more items
having equal values. In such case the same rank is to be given. The
ranking is said to be tied. In such circumstances an average rank is to
be given to each individual item. For example if the value so is repeated
twice at the 5th rank, the common rank to be assigned to each item is
5+6
= = 5.5
2
which is the average of 5 and 6 given as 5.5, appeared twice.
If the ranks are tied, it is required to apply a correction factor which is
1
(m3 − m). A slight formula is used when there is more than one item
12
having the same value. The formula is-
6[∑D2 + (m3 − m) + (m3 − m) + (m3 − m)]
1 1 1
12 12 12
ρ = 1− 3
N −N
Where m is the number of items whose ranks are common and should
be repeated as many times as there are tied observations.
Example 2:
In a marketing survey the price of tea and coffee in a town based on quality
was found as shown below. Could you find any relation between and tea
and coffee price.
Pric 8 9 9 7 6 7 5
e of 8 0 5 0 0 5 0
9
Basic Statistics
tea
Pric 1 1 1 1 1 1 1
e of 2 3 5 1 1 4 0
coff 0 4 0 5 0 0 0
ee
∑D2
=6
6∑D2 6×6 36
ρ = 1− 3 = 1− 3 =1− = 1 − 0.1071
N −N 7 −7 336
= 0.8929
The relation between price of tea and coffee is positive at 0.89.
Based on quality the association between price of tea and price
of coffee is highly positive.
Example 3:
In an evaluation of answer script the following marks are awarded by
the examiners.
1 8 9 7 9 5 8 7 8
10
Basic Statistics
st 8 5 0 6 0 0 5 5
0
2 8 9 8 5 4 8 8 7
n 4 0 8 5 8 5 2 2
d
x R1 y R2 D D2
88 2 84 4 2 4
95 1 90 1 0 0
70 6 88 2 4 16
60 7 55 7 0 0
50 8 48 8 0 0
75 5 82 5 0 0
80 4 85 3 1 1
85 3 75 6 3 9
30
6∑D2 6 × 30 180
ρ = 1− 3 = 1− 3 =1− = 1 − 0.357 = 0.643
N −N 8 −8 504
ρ = 0.643 shows fair in awarding marks in the sense that uniformity has
arisen in evaluating the answer scripts between the two examiners.
Example 4:
Rank Correlation for tied observations. Following are the marks
obtained by 10 students in a class in two tests.
St A B C D E F G H I J
ud
en
ts
11
Basic Statistics
Te 7 6 6 5 6 6 7 6 6 7
st 0 8 7 5 0 0 5 3 0 2
1
Te 6 6 8 6 6 5 7 6 6 7
st 5 5 0 0 8 8 5 3 0 0
2
Calculate the rank correlation coefficient between the marks of two tests.
Solution:
12
Basic Statistics
-
H 63 6 62 7.0 1. 1.00
0
0.
I 60 8 60 8.5 0.25
5
-
J 72 2 70 3.0 1. 1.00
0
∑D2
=
50.0
0
9.10 Regression
Regression is the functional relationship between two variables and of the
two variables one may represent cause and the other may represent effect.
The variable representing cause is known as independent variable and is
denoted by X. The variable X is also known as predictor variable or
repressor. The variable representing effect is known as dependent variable
and is denoted by Y. Y is also known as predicted variable. The relationship
13
Basic Statistics
𝑎𝑛𝑑 𝑎 = 𝑦̅ − 𝑏𝑥̅
The regression line indicates the average value of the dependent variable
Y associated with a particular value of independent variable X.
9.11 Assumptions
1. The x’s are non-random or fixed constants
2. At each fixed value of X the corresponding values of Y have a
normal distribution about a mean.
3. For any given x, the variance of Y is same.
4. The values of y observed at different levels of x are completely
independent.
9.12 Properties of Regression coefficients
1. The correlation coefficient is the geometric mean of the two
regression coefficients
14
Basic Statistics
15
Basic Statistics
Lesson 10
ANOVA and CRD
Content
0
Basic Statistics
1
Basic Statistics
Lesson-10
Objectives of the lesson:
1. Assumption of the ANOVA
2. One way Classification of ANOVA
3. Two way Classification of ANOVA
Glossary of the lesson: Test-Statistic, Variance, ANOVA, Source of variation
etc.
10.1 Introduction:
In hypothesis testing, we test the significance of difference between two
sample means. For this, one test statistic employed was the t-test where
we assumed that the two populations from which the samples were drawn
had the same variance. But in real life, there may be situations when
instead of comparing two sample means, a researcher has to compare
three or more than three sample means (specifically, more than two). A
researcher may have to test whether the three or more sample means
computed from the three populations are equal. In other words, the null
hypothesis can be that three or more population means are equal as
against the alternative hypothesis that these population means are not
equal. For example, suppose that a researcher wants to measure work
attitude of the employees in four organizations. The researcher has
prepared a questionnaire consisting of 10 questions for measuring the
work attitude of employees. A five-point rating scale is used with 1 being
the lowest score and 5 being the highest score. So, an employee can score
10 as the minimum score and 50 as the maximum score. The null
hypothesis can be set as all the means are equal (i.e., there is no difference
in the degree of work attitude of the employees) as against the alternative
hypothesis that at least one of the means is different from the others
(there is a significant difference in the degree of work attitude of the
employees). In this situation, analysis of variance technique is used. “In its
simplest form analysis of variance may be regarded as an extension or
development of the t test .” Analysis of variance technique makes use of
2
Basic Statistics
3
Basic Statistics
4
Basic Statistics
Variance within
Basic Statistics
After calculating the test statistic F, it should be compared with the critical
value of F at a specified level of significance for k 1, n k degrees of
freedom and on the basis of this comparison; accordingly the decision to
accept or reject the null hypothesis is taken.
(a) Direct Method
I. Calculation of variance between samples
It is the sum of squares of the deviations of the means of various samples
from the grand mean. The procedure of calculating the variance between
the samples is as shown below:
1 2 ……j ………… k
1 1
𝑇𝑗 = ∑𝑛𝑖=1 𝑥𝑖𝑗 , 𝑇 = ∑𝑘𝑗=1 𝑇𝑗 , 𝑥̅𝑗 = 𝑛 ∑𝑘𝑖=1 𝑇𝑗 𝑎𝑛𝑑 𝑥̿ = 𝑛𝑘 ∑𝑘𝑗=1 𝑥̅𝑗
6
Basic Statistics
8
Basic Statistics
When the null hypothesis is true, both mean squares MSB (MSC) and MSW
(MSE) are the independent unbiased estimates of the same population
variance 2 . Hence, the test statistic is
𝑀𝑆𝐵 𝑀𝑆𝐶
𝐹= 𝑜𝑟 𝐹 =
𝑀𝑆𝑊 𝑀𝑆𝐸
which follows F-distribution with degrees of freedom ( k 1, N k )
F is the ratio between the greater variance to the smaller variance.
Generally, variance between the samples (MSB or MSC) is greater than the
variance within the samples (MSW or MSE). But if MSW>MSB then the
reverse ratio of F will be used i.e.,
𝑀𝑆𝑊 𝑀𝑆𝐸
𝐹= 𝑜𝑟, 𝐹 =
𝑀𝑆𝐵 𝑀𝑆𝐶
IV. Conclusion:
To compare the calculated value of F with the critical value (tabulated) of
F for ( k 1, N k ) degrees of freedom at the specified level of significance,
usually 5% or 1% level. If the calculated value is greater than the tabulated
value, the null hypothesis H 0 is rejected and conclude that all population
means are not equal. Otherwise, the null hypothesis is accepted.
For analyzing variance in case of one way classification the following table
known as the Analysis of Variance Table (or ANOVA Table) is constructed.
ANOVA Table
Sources of Sum of Degrees of Mean Test
Variation squares (SS) Freedom (df) Squares(MS) Statistic
Between samples SSB 𝑘−1 𝑆𝑆𝐵 𝑀𝑆𝐵
𝑀𝑆𝐵 = 𝐹=
𝑘−1 𝑀𝑆𝑊
Within samples SSW 𝑁−𝑘 𝑆𝑆𝑊
𝑀𝑆𝑊 =
𝑁−𝑘
Total SST N 1
10
Basic Statistics
25 4 20 4 24 1
22 1 17 1 26 1
24 1 16 4 30 25
11
Basic Statistics
21 4 19 1 20 25
10 10 52
12
Basic Statistics
X1 X 12 X2 X 22 X3 X 32
25 625 20 400 24 576
22 484 17 289 26 676
24 576 16 256 30 900
21 441 19 361 20 400
92 2126 72 1306 100 2552
doses of fertilizer affects the grain yield. Variable whose change we wish
to study is known as response variable. Variable whose effect on the
response variable we wish to study is known as factor.
Treatment: Objects of comparison in an experiment are defined as
treatments. Examples are Varieties tried in a trail and different chemicals.
Experimental unit: The object to which treatments are applied or basic
objects on which the experiment is conducted is known as experimental
unit.
Example: piece of land, an animal, etc
Experimental error: Response from all experimental units receiving the
same treatment may not be same even under similar conditions. These
variations in responses may be due to various reasons. Other factors like
heterogeneity of soil, climatic factors and genetic differences, etc also may
cause variations (known as extraneous factors). The variations in response
caused by extraneous factors are known as experimental error.
Our aim of designing an experiment will be to minimize the experimental
error.
14
Basic Statistics
15
Basic Statistics
16
Basic Statistics
(𝐺𝑇)2
𝐶. 𝐹. = , 𝑤ℎ𝑒𝑟𝑒 𝑛 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
𝑛
𝑘 𝜈
17
Basic Statistics
18
Basic Statistics
Lesson 11
RBD and LSD
Content
0
Basic Statistics
1
Basic Statistics
Lesson-11
Objectives of the lesson:
1. Layout of RBD
2. Analysis, Merits, Demerits of RBD
3. Layout of LSD
4. Analysis, Merits, Demerits of LSD
Glossary of the lesson: RBD, LSD, Variance, ANOVA, Source of variation etc.
2
Basic Statistics
Replication
Treatment Total
1 2 3 ………… r
1 y11 y12 y13 ………… y1r T1
2 y21 y22 y23 ………… y2r T2
3 y31 y32 y33 ………… y3r T3
t yt1 yt2 yt3 …………. ytr Tt
Total R1 R2 R3 Rr G.T
In this design the total variance is divided into three sources of variation
viz., between replications, between treatments and error
(𝐺𝑇)2
𝐶𝐹 =
𝑛
𝑇𝑜𝑡𝑎𝑙 𝑆𝑆 = 𝑇𝑆𝑆 = 𝛴𝛴𝑦𝑖𝑗2 – 𝐶𝐹
𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑆𝑆 = 𝑅𝑆𝑆 = = 𝛴𝑅𝑗2 – 𝐶𝐹
𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 𝑆𝑆 = 𝑇𝑟𝑆𝑆 = 𝛴𝑇𝑖2 − 𝐶𝐹
𝐸𝑟𝑟𝑜𝑟 𝑆𝑆 = 𝐸𝑆𝑆 = 𝑇𝑜𝑡𝑎𝑙 𝑆𝑆 – 𝑅𝑒𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑆𝑆 – 𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑆𝑆
The skeleton ANOVA table for RBD with t treatments and r replications
Sources of variation d.f. SS MS F- Value
Replication r-1 RSS RMS 𝑅𝑀𝑆
𝐸𝑀𝑆
Treatment t-1 TrSS TrMS 𝑇𝑟𝑀𝑆
𝐸𝑀𝑆
Error (r-1) (t-1) ESS EMS
Total rt-1 TSS
3
Basic Statistics
2𝐸𝑀𝑆
𝐶𝐷 = 𝑆𝐸 (𝑑 ). 𝑡 𝑤ℎ𝑒𝑟𝑒 𝑆. 𝐸 (𝑑 ) = √
𝑟
4
Basic Statistics
Sources of d.f. SS MS F
Variation
𝑅𝑀𝑆
Rows t-1 RSS RMS
𝐸𝑀𝑆
𝐶𝑀𝑆
Columns t-1 CSS CMS
𝐸𝑀𝑆
𝑇𝑟𝑀𝑆
Treatments t-1 TrSS TrMS
𝐸𝑀𝑆
Error (t-1)(t-2) ESS EMS
Total t2-1 TSS
F table value
𝐹[(𝑡−1),(𝑡−1)(𝑡−2)] degrees of freedom at 5% or 1% level of significance
5
Basic Statistics
𝑡
1
𝑅𝑜𝑤 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 (𝑅𝑆𝑆) = ∑(𝑅𝑖 )2 − 𝐶𝐹
𝑡
𝑖=1
𝑡
1 2
𝐶𝑜𝑙𝑢𝑚𝑛 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠(𝐶𝑆𝑆) = ∑(𝐶𝑗 ) − 𝐶𝐹
𝑡
𝑗=1
𝑡
1
𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 (𝑇𝑟𝑆𝑆 = ∑(𝑇𝑘 )2 − 𝐶𝐹
𝑡
𝑘=1
√𝐸𝑀𝑆
𝑆𝐸 =
𝑟
where r is the number of rows
𝑆𝐸(𝑑 ) = √2 × 𝑆𝐸
𝐶𝐷 = 𝑆𝐸 (𝑑 ) × 𝑡
where t = table value of t for a specified level of significance and error
degrees of
freedom
Using CD value the bar chart can be drawn and the conclusion may be
written.
11.3.2 Advantages
• LSD is more efficient than RBD or CRD. This is because of double
grouping that will result in small experimental error.
• When missing values are present, missing plot technique can be used
and analysed.
6
Basic Statistics
11.3.3 Disadvantages
• This design is not as flexible as RBD or CRD as the number of treatments
is limited to the number of rows and columns. LSD is seldom used when
the number of treatments is more than 12. LSD is not suitable for
treatments less than five.