0% found this document useful (0 votes)
39 views108 pages

Business Statistics Class

The document outlines a syllabus for a Business Statistics course, detailing various units covering topics such as the definition and meaning of statistics, measures of central tendency, correlation, regression, and hypothesis testing. It includes a comprehensive breakdown of data collection methods, data presentation techniques, and the characteristics and limitations of statistics. Additionally, the document provides a structured approach to tabulation and graphical representation of data for effective analysis.

Uploaded by

kenedyadriko81
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views108 pages

Business Statistics Class

The document outlines a syllabus for a Business Statistics course, detailing various units covering topics such as the definition and meaning of statistics, measures of central tendency, correlation, regression, and hypothesis testing. It includes a comprehensive breakdown of data collection methods, data presentation techniques, and the characteristics and limitations of statistics. Additionally, the document provides a structured approach to tabulation and graphical representation of data for effective analysis.

Uploaded by

kenedyadriko81
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

BUSINESS STATISTICS

SYLLABUS

UNIT Details
Introduction – Meaning and Definition of Statistics – Collection and
I Tabulation of Statistical Data – Presentation of Statistical Data – Graphs
and Diagrams-
II Measures of Central Tendency – Arithmetic Mean, Median and
Mode – Harmonic Mean and Geometric Mean.
Measures of Variation –– Quartile deviation Mean deviation –
III
Standard Deviation
Simple Correlation – Scatter Diagram – Karl Pearson‘s
IV
Correlation – Rank Correlation – Regression.
Testing of hypothesis – Chi-Square test, T Test, F Test, ANOVA
V

Units Contents Page. No.


Introduction - Meaning and Definition of Statistics 4
1.1 Meaning and Definition of Statistics 4
I
1.2 Collection and Tabulation of Statistical Data 6
1.3 Presentation of Statistical Data 8
1.4 Graphs and Diagrams 12
Measures of Central Tendency 24
2.1 Arithmetic Mean 25
2.2 Median 32
II
2.3 Mode 36
2.4 Geometric Mean 42

1
2.5 Harmonic Mean 46
Measures of Variation 53
3.1 Quartile deviation 54
III
3.2 Mean deviation 59
3.3 Standard Deviation 63
Simple Correlation 75
4.1 Scatter Diagram 77
IV
4.2 Karl Pearson‘s Correlation 79
4.3 Rank Correlation 81
4.4 Regression 86
Testing of hypothesis 103
5.1 Chi-Square test 107
V
5.2 T Test 112
5.3 F Test 118
5.4 ANOVA 120

UNIT I
INTRODUCTION - MEANING AND DEFINITION OF STATISTICS
1.1 Introduction
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. In
other words, it is a mathematical discipline to collect, summarize data.

The word ‘Statistics ‘ is derived from a Latin term “Status’ or Italian term ‘Statistics’ or the German term
‘Statistick’ is the French term ‘Statistique’ each of which means a political state. The term statistics was
applied to mean facts and figures and figures which were needed the state in respect of the division of the
state, their respective population birth rate, income and the like.

Meaning of statistics
Statistics refers to numerical facts and figures collected in a systematic manner with a specific purpose in
any field of study. In this sense, statistics is also aggregates of facts expressed in numerical form.

2
In singular sense, statistics refers to a science which comprises methods that are used in the collection,
analysis, interpretation and presentation of numerical data. These methods are used to draw conclusion
about the population parameters.

Definition of statistics
• AL Bowley defines statistics as "Statistics is numerical statement of facts in any development of
enquiry placed in relation to each other"
• According to Croxton and Cowden, Statistics may be defined as the science of collection,
presentation, analysis and interpretation of numerical data.
• Gottfried Achenwall defined statistics as "Statistics are collection of noteworthy facts concerning
state both historical and descriptive".
• According to Yule and Kendall, “Statistics means quantitative data affected to a marked extent by
multiplicity of causes.”
• “Statistics” - as defined by the American Statistical Association (ASA) - “is the science of
learning from data, and of measuring, controlling and communicating uncertainty.”

Nature of statistics
• Statistics is both a science and an art
• As a science statistical methods are generally systematic and based on fundamental ideas and
processes
• It also works as a base for all other sciences.
• As an art it explores the merits and demerits, guides about the means to achieve the objective

Scope of statistics
• Statistics is a mathematical science pertaining to the collection, analysis, interpretation or
explanation and presentation of data.
• It provides tools for predicting and forecasting the economic activities.
• It is useful for an academician, government, business etc.

Characteristics of statistics
• Statistics are Aggregate of facts
• It is numerically expressed
• The statistical Data affected by multiplicity of causes
• It is enumerated according to reasonable standard of accuracy
• It is collected in systematic accuracy
• It is collected for pre-determined purpose

3
• It is placed in relation to other

Uses of statistics
Statistics helps in
• Providing a better understanding
• Exact description
• efficient planning of a statistical inquiry in any field of study
• Collecting appropriate quantitative data
• Business forecasting
• Decision making
• Quality control
• Search of new ventures
• Study of market
• Study of business cycles
• Useful for planning
• Useful for finding averages
• Useful for bankers, brokers, insurance, etc.

Limitations of statistics
• It is not useful for individual cases
• It ignores qualitative aspects
• It deals with average only
• Improper use of statistics can be dangerous
• It is only a mean, not an end
• It do not distinguish between cause and effect
• Its results are not always dependable

1.2 Collection and tabulation of data Collection of data


In Statistics, data collection is a process of gathering information from all the relevant sources to find a
solution to the research problem. It helps to evaluate the outcome of the problem. The data collection
methods allow a person to conclude an answer to the relevant question. Most of the organizations use data
collection methods to make assumptions about future probabilities and trends. Once the data is collected,
it is necessary to undergo the data organization process.
The main sources of the data collections methods are “Data”. Data can be classified into two types, namely
primary data and secondary data. The primary importance of data collection in any research or business

4
process is that it helps to determine many important things about the company, particularly the
performance. So, the data collection process plays an important role in all the streams. Depending on the
type of data, the data collection method is divided into two categories namely,
a) Primary Data Collection methods
b) Secondary Data Collection methods

Primary Data Collection Methods


Primary data or raw data is a type of information that is obtained directly from the first-hand source through
experiments, surveys or observations. There are several methods to collect this type of data. They are
Observation Method
Observation method is used when the study relates to behavioural science. This method is planned
systematically. It is subject to many controls and checks. The different types of observations are:
• Structured and unstructured observation
• Controlled and uncontrolled observation
• Participant, non-participant and disguised observation
Interview Method
The method of collecting data in terms of verbal responses. It is achieved in two ways, such as
• Personal Interview – In this method, a person known as an interviewer is required to ask questions
face to face to the other person. The personal interview can be structured or unstructured, direct
investigation, focused conversation, etc.
• Telephonic Interview – In this method, an interviewer obtains information by contacting people
on the telephone to ask the questions or views, verbally.
Questionnaire Method
In this method, the set of questions are mailed to the respondent. They should read, reply and subsequently
return the questionnaire. The questions are printed in the definite order on the form. A good survey should
have the following features:
• Short and simple
• Should follow a logical sequence
• Provide adequate space for answers
• Avoid technical terms
• Should have good physical appearance such as colour, quality of the paper to attract the attention
of the respondent

Schedule method

5
This method is similar to the questionnaire method with a slight difference. The enumerations are specially
appointed for the purpose of filling the schedules. It explains the aims and objects of the investigation and
may remove misunderstandings, if any have come up. Enumerators should be trained to perform their job
with hard work and patience.

Secondary Data Collection Methods


Secondary data is data collected by someone other than the actual user. It means that the information is
already available, and someone analyses it. The secondary data includes magazines, newspapers, books,
journals, etc. It may be either published data or unpublished data. Published data are available in various
resources including
• Government publications
• Public records
• Historical and statistical documents
• Business documents
• Technical and trade journals
Unpublished data includes
• Diaries
• Letters
• Unpublished biographies, etc.

1.3 Presentation of statistical data Presentation of data


Presentation of data refers to an exhibition or putting up data in an attractive and useful manner such that
it can be easily interpreted. The three main forms of presentation of data are: a) Textual presentation
b) Tabular presentation
c) graphical and Diagrammatic presentation

a) Textual presentation
When presenting data in this way, researcher use words to describe the relationship between information.
Textual presentation enables researchers to share information that cannot display on a graph. An example
of data, the researcher present textually is findings in a study. When a researcher wants to provide additional
context or explanation in their presentation, they may choose this format because, in text, information may
appear more clear.
Textual presentation is common for sharing research and presenting new ideas. It only includes paragraphs
and words, rather than tables or graphs to show data. b) Tabular presentation

6
Tabular presentation is using a table to share large amounts of information. When using this method,
researcher organise and classify the data in rows and columns according to the characteristics of the data.
Tabular presentation is useful in comparing data, and it helps visualise information. Researches use this
type of presentation in analysis, such as classify and tabulate them.

Classification of data
Classification is a process of arranging things or data in groups or classes according to the common
characteristics. It is based on
• Geographical (i.e. on the basis of area or region wise)
• Chronological (On the basis of Historical, i.e. with respect to time)
• Qualitative (on the basis of character / attributes)
• Numerical, quantitative (on the basis of magnitude)

1) Geographical Classification
In geographical classification, the classification is based on the geographical regions.
Ex: Sales of the company (In Million Rupees) (region – wise)
Region Sales

North 285

South 300

East 185

West 235

2) Chronological Classification
If the statistical data are classified according to the time of its occurrence, the type of classification is
called chronological classification.
Sales reported by a departmental store
Month Sales (Rs.) in
lakhs

January 22

February 26

March 32

April 25

May 27

7
June 30

3) Qualitative Classification
In qualitative classifications, the data are classified according to the presence or absence of attributes in
given units. Thus, the classification is based on some quality characteristics / attributes.
Ex: Sex, Literacy, Education, Class grade etc.
Further, it may be classified as
a) Simple classification b) Manifold classification
i) Simple classification:
If the classification is done into only two classes then classification is known as simple classification.
Ex: a) Population in to Male / Female
b) Population into Educated / Uneducated
ii) Manifold classification:
In this classification, the classification is based on more than one attribute at a time.

4) Quantitative Classification
In Quantitative classification, the classification is based on quantitative measurements of some
characteristics, such as age, marks, income, production, sales etc. The quantitative phenomenon under
study is known as variable and hence this classification is also called as classification by variable.
Ex:
For a 50 marks test, Marks obtained by students as classified as follows

Marks No.of students

8
0 – 10 5

10 – 20 7

20 – 30 10

30 – 40 25

40 – 50 3

Total Students = 50

In this classification marks obtained by students is variable and number of students in each class represents
the frequency.

Tabulation of data
Tabulation may be defined, as systematic arrangement of data is column and rows. It is designed to simplify
presentation of data for the purpose of analysis and statistical inferences. Objectives of Tabulation
• To simplify the complex data
• To facilitate comparison
• To economise the space
• To draw valid inference / conclusions
• To help for further analysis

Components of Data Tables

• Table Number: Each table should have a specific table number for ease of access and
locating.
• Title: A table must contain a title that clearly tells the readers about the data.
• Head notes: A head note further aids in the purpose of a title and displays more
information about the table.
• Stubs: These are titles of the rows in a table.
• Caption: A caption is the title of a column in the data table. Body or field: The body of a
table is the content of a table in its entirety. Each item in a body is known as a ‘cell’.
• Footnotes: Footnotes are rarely used. In effect, they supplement the title of a table if
required.

Types of tabulation

9
In general, the tabulation is classified in two parts, that is simple tabulation, and a complex tabulation.
Simple tabulation, gives information regarding one or more independent questions complex tabulation
gives information regarding two manually dependent questions.

Simple tabulation
Data are classified based on only one characteristic.

Distribution of marks
Class Marks No. of students

30 – 40 20

40 – 50 20

50 – 60 10

Total 50

Complex tabulation
Data are classified based on two or more characteristics. Two-way table: Classification is based on two
characteristics.
Number of students
Class Marks
Boys Girls Total

30 – 40 10 10 20

40 – 50 15 5 20

50 – 60 3 7 10

Total 28 22 50

10
1.4 Graphs and diagrams
Graphical and diagrammatic representations of data are visual aids that can help people understand data
more easily. This method of displaying data uses diagrams, graphs and images. Graphs and diagrams are
both visual representations of data, but they have different purposes and uses.
Graphs
Data can also be effectively presented by means of graphs. A graph consists of curves or straight lines.
Graphs provide a very good method of showing fluctuations and trends in statistical data. Graphs can also
be used to make predictions and forecasts.
• Line graphs
• Histogram
• Frequency Polygon
• Frequency Curve
• Cumulative Frequency Polygon (Ogive)

Line graphs: Show changes and trends in data over time by connecting data points with lines.

Histograms: Similar to column charts, histograms are made up of vertical columns whose length is
proportional to the frequency of a variable.

11
Frequency polygon
A frequency polygon is a many sided closed figure. It can also be obtained by joining the mid-points of the
tops of rectangles in the histograms.

Frequency Curve

When the frequency polygon is smoothed out as a curve then it becomes frequency curve. OR when the
mid-points are potted against the frequencies then a smooth curve passes through these points is called a
frequency curve.

12
Cumulative Frequency Polygon (Ogive)
When a curve is based on cumulative frequencies then it is called a cumulative frequency polygon or ogive.

Diagrams

It is a technique of presenting numeric data through pictograms, cartograms, bar diagrams, and pie
diagrams. It is the most attractive and appealing way to represent statistical data. Diagrams help in visual
comparison and they have a bird's eye view. Diagrams classified as;
• Bar diagram/Charts
• Rectangle and Sub-divided Rectangle
• Pie diagram/Chart

13
• scatter plots
• Cartograms
• Pictograms

Bar diagram/Charts

Simple bar chart


This chart consists of vertical or horizontal bars of equal width.

Multiple bar charts or cluster charts


By multiple bar charts two or more sets of inter-related data are represented. Multiple bar charts facilities
comparison between more than one phenomenon.

Component bar chart or sub divided bar chart

14
A component bar chart is an effective technique in which each bar is sub-divided into two or more parts.
The component parts are shaded or coloured differently to increase the overall effectiveness of the diagram.

Pie chart
A pie chart is a type of a chart that visually displays data in a circular graph. It is one of the most commonly
used graphs to represent data using the attributes of circles, spheres, and angular data to represent real-
world information.

Scatter plots
A scatter plot, also called a scatter plot, scatter graph, scatter chart, scattergram, or scatter diagram. It uses
dots to represent values for two different numeric variables. Scatter plots are used to observe relationships
between variables.

15
Cartograms
This includes any type of map that shares the location of a person, place or object. For example, cartograms
help navigate theme parks so you can find attractions, food and gift shops.

Pictograms: This diagram uses images to represent data. For example, to show the number students and
their favourite games shown below using pictogram.

Applications of statistics in Business Decisions:

16
The field of statistics has numerous applications in business. Because of technological
advancements, large amounts of data are generated by business these days. These data are now being
used to make decisions. These better decisions we make help us improve the running of a department,
a company, or the entire economy.
“Statistics is extensively used to enhance Business performance through Analytics”

❖ Marketing: As per Philip Kotler and Gary Armstrong marketing ― identifies customer needs and
wants , determine which target markets the organisations can serve best, and designs appropriate
products, services and Programs to serve these markets.

Marketing is all about creating and growing customers profitably. Statistics is used in almost every
aspect of creating and growing customers profitably. Statistics is extensively used in making decisions
regarding how to sell products to customers. Also, intelligent use of statistics helps managers to design
marketing campaigns targeted at the potential customers. Marketing research is the systematic and
objective gathering, recording and analysis of data about aspects related to marketing. IMRB
international, TNS India, RNB Research, The Nielson, Hansa Research and Ipsos Indica Research are
some of the popular market research companies in India. Web analytics is about the tracking of online
behaviour of potential customers and studying the behaviour of browsers to various websites. Use of
Statistics is indispensable in forecasting sales, market share and demand for various types of Industrial
products. Factor analysis, conjoint analysis and multidimensional scaling are invaluable tools which
are based on statistical concepts, for designing of products and services based on customer response.

❖ Finance: Uncertainty is the hallmark of the financial world. All financial decisions are based on
―Expectation‖ that is best analysed with the help of the theory of probability and statistical techniques.
Probability and statistics are used extensively in designing of new insurance policies and in fixing of
premiums for insurance policies. Statistical tools and technique are used for analysing risk and
quantifying risk, also used in valuation of derivative instruments, comparing return on investment in
two or more instruments or companies. Beta of a stock or equity is a statistical tool for comparing
volatility, and is highly useful for selection of portfolio of stocks. The most sophisticated traders in
today‘s stock markets are those who trade in ―derivatives.. i.e financial instruments whose
underlying price depends on the price of some other asset.

❖ Economics: Statistical data and methods render valuable assistance in the proper understanding of the
economic problem and the formulation of economic policies. Most economic phenomena and indicators
can be quantified and dealt with statistically sound logic. In fact, Statistics got so much integrated with

17
Economics that it led to development of a new subject called Econometrics which basically deals with
economics issues involving use of Statistics.

❖ Operations: The field of operations is about transforming various resources into product and services
in the place, quantity, cost, quality and time as required by the customers. Statistics plays a very useful
role at the input stage through sampling inspection and inventory management, in the process stage
through statistical quality control and six sigma method, and in the output stage through sampling
inspection. The term Six Sigma quality refers to situation where there is only 3.4 defects per million
opportunities.

❖ Human Resource Management or Development: Human Resource departments are inter alia
entrusted with the responsibility of evaluating the performance, developing rating systems, evolving
compensatory reward and training system, etc. All these functions involve designing forms, collecting,
storing, retrieval and analysis of a mass of data. All these functions can be performed efficiently and
effectively with the help of statistics.

❖ Information Systems: Information Technology (IT) and statistics both have similar systematic
approach in problem solving. IT uses statistics in various areas like, optimisation of server time,
assessing performance of a program by finding time taken as well as resources used by the program. It
is also used in testing of the software.

❖ Data Mining: Data Mining is used in almost all fields of business. In Marketing, Data mining can be
used for market analysis and management, target marketing, CRM, market basket analysis, cross
selling, market segmentation, customer profiling and managing web based marketing, etc. In Risk
analysis and management, it is used for forecasting, customer retention, quality control, competitive
analysis and detection of unusual patterns.

In Finance, it is used in corporate planning and risk evaluation, financial planning and asset
evaluation, cash flow analysis and prediction, contingent claim analysis to evaluate assets, cross
sectional and time series analysis, customer credit rating, detecting of money laundering and other
financial crimes. In Operations, it is used for resource planning, for summarising and comparing the
resources and spending. In Retail industry, it is used to identify customer behaviours, patterns and
trends as also for designing more effective goods transportation and distribution policies, etc.

18
UNIT II
MEASURES OF CENTRAL TENDENCY

Measures of central tendency are a typical value of the entire group or data. It describes the
characteristics of the entire mass of data. It reduces the complexity of data and makes them to
compare. Human mind is incapable of remembering the entire mass of unwieldy data. So a simple
figure is used to describe the series which must be a representative number. It is generally called, "a
measure of central tendency or the average".

A central tendency is a central or typical value for a probability distribution. It may also be
called a center or location of the distribution. Colloquially, measures of central tendency are often
called averages. The term central tendency dates from the late 1920s. If a large volume of data is
summarized and given is one simple term. Then it is called as the ‗Central Value‘ or an ‗average‘.
In other words an average is a single value that represents group of values.

Characteristics of Ideal Measures:

A measure of central tendency is a typical value around which other figures congregate.
Average condenses a frequency distribution in one figure. According to the statisticians, an average
will be termed good or efficient if possesses the following characteristics:

➢ It should be rigidly defined. It means that the definition should be so clear that the interpretation
of the definition does not differ from person to person.
➢ It should be easy to understand and simple to calculate.
➢ It should be such that it can be easily determined.
➢ The average of a variable should be based on all the values of the variable. This means that in
the formula for average all the values of the variable should be incorporated.
➢ The value of average should not change significantly along with the change in sample. This
means that the values of the averages of different samples of the same size drawn from the same
population should have small variations.

➢ It should be amenable to algebraic treatment.


➢ It should be unduly affected by extreme values. i.e, the formula for average should be such that
it does not show large due to the presence of one or two very large or very small values of the
variable.

19
➢ It should be properly defined, preferably by a mathematical formula, so that different individuals
working with the same data should get the same answer unless there are mistakes in calculations.
➢ It should be based on all the observations so that if we change the value of any observation, the
value of the average should also be changed.
➢ It should not be unduly affected by extremely large or extremely small values.
➢ It should be capable of algebraic manipulation. By this we mean that if we are given the average
heights for different groups, then the average should be such that we can find the combined
average of all groups taken together.
➢ It should have quality of sampling stability. That is, it should not be affected by the fluctuations
of sampling. For example, if we take ten or twelve samples of twenty students‘ each and find
the average height for each sample, we should get approximately the same average height for
each sample.

2.1 MEAN

Mean is one of the types of averages. Mean is further divided into three kinds, which are the
arithmetic mean, the geometric mean and the harmonic mean. These kinds are explained as follows;

i) Arithmetic Mean: Simple Arithmetic Average:


A. Individual Observation:
Direct Method:

The arithmetic mean is most commonly used average. It is generally referred as the average
or simply mean. The arithmetic mean or simply mean is defined as the value obtained by dividing the
sum of values by their number or quantity. It is denoted as 𝐗 ̅ (read as X-bar). Therefore, the
mean for the values X1, X2, X3,……….., Xn shall be denoted by 𝐗̅. Following is the mathematical
representation for the formula for the arithmetic mean or simply, the mean.

Where, 𝑋̅̅ = Arithmetic Mean; 𝛴x = Sum of all the values of the variables i.e., X1 + X2 + X3+ … + Xn

20
N = Number of observations. Illustration

1: Calculate mean from the following data:

Roll Numbers 1 2 3 4 5 6 7 8 9 10
Marks 40 50 55 78 58 60 73 35 43 48

Solution: Calculation of mean

Roll Numbers Marks (x)


1 40

2 50

3 55

4 78

5 58

6 60

7 73

8 35

9 43

10 48
N = 10 𝚺𝑿 = 540

= 54 marks.

Short cut method:

21
The arithmetic mean can also be calculated by short cut method. This method reduces the amount
of calculation. Formula for calculation

Where, 𝑋̅̅ = Arithmetic Mean; A = Assumed mean; 𝛴𝑑 = Sum of the deviations; N = Number of items.

Illustration 2: (Solving the previous problem)

Roll Numbers Marks (X) d = X -A


1 40 -10

2 50 0

3 55 5

4 78 28

5 58 8

6 60 10

7 73 23

8 35 -15

9 43 -7

10 48 -2
N = 10 𝚺𝒅 = 40

Let the assumed mean, A = 50

= 54 marks.

22
B. Discrete Series: Direct
Method:
To find out the total of items in discrete series, frequency of each value is multiplied with the
respective size. The values so obtained are totaled up. This total is then divided by the total number
of frequencies to obtain the arithmetic mean. The formula is

Where, 𝑋̅̅ = Arithmetic Mean; 𝛴𝑓𝑥 = the sum of products; N = Total frequency.
Illustration 3: Calculate mean from the following data:

Value 1 2 3 4 5 6 7 8 9 10
Frequency 21 30 28 40 26 34 40 9 15 57

Solution: Calculation of Mean

x f Fx

1 21 21

2 30 60

3 28 84

4 40 160

5 26 130

6 34 204

7 40 280

8 9 72

9 15 135

10 57 570
𝚺𝒇 = 𝐍 = 𝟑𝟎𝟎 𝚺𝐟𝐱 = 1716

23
Short cut Method: Formula:

Where, 𝑋̅̅ = Arithmetic Mean; A = Assumed mean; 𝛴𝑓𝑑 = Sum of total deviations; N = Total frequency.

Illustration: 4 (Solving the previous problem)

X F d = X -A fd
1 21 -4 -84

2 30 -3 -90

3 28 -2 -56

4 40 -1 -40

5 26 0 0

6 34 1 34

7 40 2 80

8 9 3 27

9 15 4 60

10 57 5 285
𝚺𝐟 = N = 300 𝚺𝐟𝐝 = + 216

Let the assumed mean, A = 5

C. Continuous Series
In continuous frequency distribution, the value of each individual frequency distribution is
unknown. Therefore an assumption is made to make them precise or on the assumption that the frequency

24
of the class intervals is concentrated at the centre that the midpoint of each class interval has to be found
out. In continuous frequency distribution, the mean can be calculated by any of the following methods:

1. Direct Method
2. Short cut method
3. Step Deviation Method
1. Direct Method: The formula is

Where, 𝑋̅̅ = Arithmetic Mean; 𝛴𝑓𝑚 = Sum of the product of f & m; N = Total frequency.
Illustration 5: From the following find out the mean:

Class Interval 0 – 10 10 – 20 20 – 30 30 – 40 40 - 50
Frequency 6 5 8 15 7

Solution: Calculation of Mean

Class Interval Mid Point (m) Frequency (f) fm


0 – 10 0 + 10 6 30
=5
2
10 – 20 10 + 20 5 75
= 15
2
20 – 30 20 + 30 8 200
= 25
2
30 – 40 30 + 40 15 525
= 35
2
40 - 50 40 + 50 7 315
= 45
2
𝚺𝐟 = N = 41 𝚺𝐟𝐦 = 1145

25
2. Short cut method: Formula:

Where, 𝑋̅̅ = Arithmetic Mean; A = Assumed mean; 𝛴𝑓𝑑 = Sum of total deviations; N = Total frequency.

Illustration: 6 (Solving the previous problem)

Class Interval M d = m -A F fd
0 – 10 0 + 10 5 – 25 = -20 6 -120
=5
2
10 – 20 10 + 20 15 – 25 = - 10 5 -50
= 15
2
20 – 30 20 + 30 25 – 25 = 0 8 0
= 25
2
30 – 40 30 + 40 35 – 25 = 10 15 150
= 35
2
40 - 50 40 + 50 45 – 25 = 20 7 140
= 45
2
𝚺𝐟 = N = 41 𝚺𝐟𝐝 = +120
d = m – A; here A = 25

= 25 + 2.93

= 27.93

3. Step Deviation Method Formula:

26
xC

Where, 𝑋̅̅ = Arithmetic Mean; A = Assumed mean; 𝛴𝑓𝑑′ = Sum of total deviations;

N = Total frequency; C = Common Factor

Illustration: 7 (Solving the previous problem)

Class Interval Mid Point (m) Frequency (f) d = m -A d' = fd’


0 – 10 0 + 10 6 5 – 25 = -20 -2 -12
=5
2
10 – 20 10 + 20 5 15 – 25 = - -1 -5
= 15 10
2
20 – 30 20 + 30 8 25 – 25 = 0 0 0
= 25
2
30 – 40 30 + 40 15 35 – 25 = 10 1 15
= 35
2
40 - 50 40 + 50 7 45 – 25 = 20 2 14
= 45
2
𝚺𝐟 = N = 41 𝚺𝐟𝐝′ = +12

Here A = 25; C = 10

xC

= 25 + 2.93

= 27.93

2.2 MEDIAN

27
Median is the value of item that goes to divide the series into equal parts. It may be defined
as the value of that item which divides the series into equal parts, one half containing values greater
that it and the other half containing values less than it. Therefore, the series has to be arranged in
ascending or descending order, before finding the median. If the items of a series are arranged in
ascending or descending order of magnitude, the item which falls in the middle of it is called median.
Hence it is the ―middle most‖ or ―most central‖ value of a set of number.

Calculation of Median – Individual Series:


Illustration 1: Find out the median of the following items. X: 10, 15, 9, 25, 19.

Solution: Computation of Median

S. No. Size of ascending order Size of descending order


1 9 25
2 10 19
3 15 15
4 19 10
5 25 9

(𝑁+1)
th
Median = Size of 2 item

(5+1)
th
= Size of 2 item
= 3rd item = 15.

Illustration 2: Find out the median of the following items. X: 8, 10, 5, 9, 12, 11.

Solution: Computation of Median

S. No. X
1 5
2 8
3 9
4 10
5 11

6 12

28
(𝑁+1)
th
Median = Size of 2 item

th
= Size of item

= Size of 3.5th item

= Size of

Calculation of Median – Discrete Series

Illustration 3: Locate median from the following:


Size of shoes 5 5.5 6 6.5 7 7.5 8
Frequency 10 16 28 15 30 40 34

Solution: Computation of Median

Size of F c.f
shoes
5 10 10
5.5 16 26
6 28 54
6.5 15 69
7 30 99
7.5 40 139
8 34 173

th
Median = Size of item

(173 +1)
th
= Size of 2 item

= Size of 87th item

=7

Median – Continuous Series Illustration 4: Calculate the


median of the following table:

29
Marks 10 – 25 25 - 40 40 – 55 55 - 70 70 – 85 85 - 100
Frequency 6 20 44 26 3 1

Solution: Computation of Median

x F c.f
10 – 25 6 6
25 – 40 20 26
40 – 55 44 70
55 – 70 26 96
70 – 85 3 99
85 - 100 1 100

Median

50;
L = 40; f = 44; cf = 26; i = 15

Median

= 40 + 8.18

= 48. 18 marks

Merits:
1. It is easy to compute and understand.
2. It eliminates the effect of extreme items.
3. The value of median can be located graphically.
4. It is amenable to further algebraic process as it is used in the measurement of dispersion. 5.
It can be computed even if the items at the extremes are unknown.
Demerits:
1. For calculating median, it is necessary to arrange the data; other averages do not need any
arrangement.
2. Typical representative of the observations cannot be computed if the distribution of item is
irregular.
3. It is affected more by fluctuation of sampling than the arithmetic mean.

30
2.3 MODE

Mode is the value which occur the greatest number of frequency in a series. It is derived from
the French word “La mode” meaning the fashion. It is the most fashionable or typical value of a
distribution, because it is repeated the highest number of times in the series.

Mode or the modal value is defined as the value of the variable which occur more number of times
or most frequently in a distribution.

Types of Mode:

i) Unimoda
l:
If there is only one mode in series, it is called unimodal.

Eg., 10, 15, 20, 25, 18, 12, 15 (Mode is 15)

ii) Bi –
modal:
If there are two modes in the series, it is called bi - modal.

Eg., 20, 25, 30, 30, 15, 10, 25 (Modes are 25, 30)

iii) Tri –
modal:
If there are three modes in the series, it is called Tri - modal.

Eg., 60, 40, 85, 30, 85, 45, 80, 80, 55, 50, 60 (Modes are 60, 80, 85)
iv) Multi – modal:
If there are more than three modes in the series it is called multi-modal.
Merits:
1. It can be easily ascertained without much mathematical calculation.
2. It is not essential to know all the items in a series to compute mode.
3. Open – end classes do not disturb the position of the mode.
4. Its values can be ascertained graphically as well as empirically.
5. It may be very well applied to qualitative as well quantitative data. 6. It is not affected by
extreme values as in the average.
Demerits:
1. The mode becomes less useful as an average which the distribution is bi-modal.

31
2. It is not suitable for further mathematical treatment.
3. It is stable only when the sample is large. 4. Mode is influenced by magnitude of the class-
intervals.

Mode - Individual Series Illustration : 1. Calculate the mode from the following data of the
marks obtain by 10 students.

Serial No. 1 2 3 4 5 6 7 8 9 1
Marks obtained 60 77 74 62 77 77 70 68 65 80

Solution:
Marks obtained by 10 students 60, 77, 74, 62, 77, 77, 70, 68, 65, and 80.

Here 77 is repeated three times.


The Mode mark is 77.

DISCRETE SERIES:

A grouping Table has six columns


Column 1: In column 1 rite the actual frequencies and mark the highest frequency.

Column 2: Frequencies are grouped in twos, adding frequencies of items 1 and 2; 3 and 4; 5 and 6; and
so on.

Column 3: Leave the first frequency and then add the remaining in twos.

Column 4: Group of frequencies in threes.


Column 5: Leave the first frequency and group the remaining in threes.

Column 6: Leave the first two frequencies and then group the remaining the threes.

The maximum frequencies in all six columns are marked with a circle and an analysis table is prepared
as follows:

1. Put column number on the left – hand side


2. Put the various probable values of mode on the right – hand side.
3. Enter the highest marked frequencies by means of a bar in the relevant box corresponding to the
values they represent.

32
Illustration: 2. Calculate the mode from the following:

Size 10 11 12 13 14 15 16 17 18
Frequency 10 12 15 19 20 8 4 3 2

Solution: Grouping Table

Size Frequency
1 2 3 4 5 6

10 10

22
37
11 12
27

46
12 15
34

54
13 19

39

14 20 47
28

32
15 8
12
15

16 4
7
9

17 3
5

18 2

33
Analysis Table

Column No. Size of item containing maximum frequency


11 12 13 14 15
1 1
2 1 1
3 1 1
4 1 1 1
5 1 1 1
6 1 1 1
1 3 5 4 1

The mode is 13, as the size of item repeats 5 times. But through inspection, we say the mode is 14,
because the size 14 occurs 20 times. But this wrong decision is revealed by analysis table.

Calculation of Mode – Continuous Series

Z xi

Where,
Z = Mode; L1 = Lower limit of the modal class; f1 = Frequency of the modal; f0 = Frequency of the class
preceding the modal class; f2 = Frequency of the class succeeding the modal class; i = Class interval;

Illustration: 3. Calculate the mode from the


following:

Size of item Frequency


0–5 20
5 – 10 24
10 – 15 32
15 – 20 28
20 – 25 20
25 – 30 16
30 – 35 34
35 – 40 10
40 – 45 8

Solution:

34
Grouping Table
Size of item Frequency
1 2 3 4 5 6
0–5 20

44 76
5 – 10 24

56

10 – 15 32 84

60 80
15 – 20 28

48
20 – 25 20 64

36
25 – 30 16 70

50

30 – 35 34 60
44

52
35 – 40 10
18

40 - 45 8

Analysis Table

Column No. Size of item containing maximum frequency


0–5 5 – 10 10 – 15 15 – 20 20 - 25 30 - 35

35
1 1

2 1 1

3 1 1

4 1 1 1

5 1 1 1
6 1 1 1

1 3 5 3 1 1

Z xi

L1 = 10; f1 = 32; f0 = 24; f2 = 28; i = 5

∴ The Mode is 13. 33

The relationship among mean, median and mode:

The three averages (mean, median and mode) are identical, when the distribution is
symmetrical. In an asymmetrical distribution, the values of mean, median and mode are not
equal. In a moderately asymmetrical distribution the distance between the mean and median
is about one-third of the distance between mean and mode.

Mean - Median = 1/3 (Mean - Mode)

Mode = 3 Median - 2 Mean

Median = Mode + 2/3 (Mean - Mode)

36
2.4 GEOMATRIC
The geometric mean, G, of a set of n positive values X1, X2, ……,Xn is the Nth
root of the product of N items. Mathematically the formula for geometric mean will be as
follows;
G. M = 𝒏√𝑿𝟏,𝑿𝟐, … , 𝑿𝒏 = (𝑿𝟏, 𝑿𝟐, … … , 𝑿𝒏𝟏/𝒏

G.M = Geometric Mean; n = number of items; X1, X2, X3,….. = are various values

Illustration1: The geometric mean of the values 2, 4 and 8 is the cubic root of 2 x 4 x 8 or

In practice, it is difficult to extract higher roots. The geometric mean is, therefore, computed
using logarithms. Mathematically, it will be represented as follows;

Geometric Mean = Antilog of G. M. = Antilog of

Here we assume that all the values are positive, otherwise the logarithms will be not defined.

Geometric Mean – Individual Series

Illustration 2: Calculate the geometric mean of the following:

Solution: Calculation of Geometric Mean

X log of X
50 1.6990
72 1.8573

54 1.7324

82 1.9138

93 1.9685

N=5 ∑ 𝐥𝐨𝐠 𝑿 = 𝟗. 𝟏𝟕𝟏𝟎

G.M. = √50 × 72 × 54 × 82 × 93 or

37
G. M. = Antilog

G.M = Antilog

= Antilog of 1.8342

= 68.26

Geometric Mean – Discrete Series

G.M. = antilog

Where, f = frequency value; log x = logarithm of each value; N = Total frequencies

Illustration 3: The following table gives the weight of 31 persons in a sample survey.
Calculate geometric mean.

Weight (lbs) 130 135 140 145 146 148 149 150 157
No. of persons 3 4 6 6 3 5 2 1 1

Solution: Calculation of Geometric Mean

Size of item (X) Frequency (f) log X f log X


130 3 2.1139 6.3417

135 4 2.1303 8.5212

140 6 2.1461 12.8766

145 6 2.1614 12.9684

146 3 2.1644 6.4932

148 5 2.1703 10.8515

149 2 2.1732 4.3464

150 1 2.1761 2.1761

38
157 1 2.1959 2.1959
N = ∑𝐟 = 𝟑𝟏 ∑𝐟𝐥𝐨𝐠𝐗 = 𝟔𝟔. 𝟕𝟕𝟏𝟎

G.M. = antilog

G.M. = antilog antilog of 2.1539

G.M. Weight = 142.5 lbs

Geometric Mean – Continuous Series

G.M. = antilog

Where, f = frequency; m = mid value; log m = logarithm of each mid value; N = Total frequencies

Illustration 4: Find out the geometric mean:

Yield of wheat (mounds) No. of farms


7.5 – 10.5 5
10.5 – 13.5 9

13.5 – 16.5 19
16.5 – 19.5 23

19.5 – 22.5 7
22.5 – 25.5 4

25.5 – 28.5 1

Solution: Calculation of Geometric Mean

Yield of wheat No. of farms (f) m log m f log m


(mounds)
7.5 – 10.5 5 9 0.9542 4.7710

10.5 – 13.5 9 12 1.0792 9.7128

13.5 – 16.5 19 15 1.1761 22.3459

16.5 – 19.5 23 18 1.2553 28.8719

39
19.5 – 22.5 7 21 1.3222 9.2554

22.5 – 25.5 4 24 1.3802 5.5208

25.5 – 28.5 1 27 1.4314 1.4314


N = ∑𝐟 = 𝟔𝟖 ∑𝐟𝐥𝐨𝐠𝐦 = 𝟖𝟏. 𝟗𝟎𝟗𝟐

G.M. = antilog

G.M. = antilog antilog of 1.2045

= 16.02 maunds

Uses:
• Geometric mean is highly useful in averaging ratios, percentages and rate of increase between two
periods.

• Geometric mean is important in the construction of index numbers.

• In economic and social sciences, where we want to give more weight to smaller items and smaller
weight to large items, geometric mean is appropriate.

• It is the only useful average that can be employed to indicate rate of change.

Merits:

• Every item in the distribution is included in the calculation.


• It can be calculated with mathematical exactness, provided that all the quantities are greater than
zero and positive.
• Large items have less effect on it than the arithmetic average.
• It is amenable to further algebraic manipulation.

Demerits:

• It is very difficult to calculate.


• It is impossible to use it when any item is zero or negative.
• The value of the geometric mean may not correspond with any actual value in the distribution.
• If cannot be used in the series in which the end values of the classes are left open.

40
2.5 HARMONIC MEAN

Harmonic Mean, like geometric mean is a measure of central tendency in solving special types of problems.
Harmonic Mean is the reciprocal of the arithmetic average of the reciprocal of values of various items in
the variable. The reciprocal of a number is that value, which is obtained dividing one by the value.

For example, the reciprocal of 5 is 1/5. The reciprocal can be obtained from logarithm tables.

Harmonic Mean – Individual Series

H.M. = or H.M.

41
X1, X2, X3 ......... Xn, refer to the various in the observations

Illustration 5: The monthly incomes of 10 families in rupees are given below:

Family: 1 2 3 4 5 6 7 8 9 10
Income: 85 70 10 75 500 8 42 250 40 36

Solution: Calculation of Harmonic Mean

Family Income (x)

1 85
Reciprocals
2 70 1/70 = 0. 0143

3 10 1/10 = 0. 1000

4 75 1/75 = 0. 0133

5 500 1/500 = 0. 0020

6 8 1/8 = 0. 1250

7 42 1/42 = 0. 0232

8 250 1/250 = 0. 0040

9 40 1/40 = 0. 0250

10
N =10

H.M.

= Rs. 28.87/-

Harmonic Mean – Discrete Series


H.M. =

Illustration 6: Calculate harmonic mean from the following data.

Size of items: 6 7 8 9 10 11
Frequency: 4 6 9 5 2 8

Solution: Calculation of harmonic mean

Size Frequency Reciprocal


of (f) (𝟏)
items 𝑿

(x)
6 4 0.1667

7 6 0.1429 0.8574

8 9 0.1250 1.1250

9 5 0.1111 0.5555

10 2 0.1000 0.2000

11 8 0.0909
N = ∑𝐟 =
𝟑𝟒

H.M. =

Harmonic Mean - Continuous Series

43
H.M.

Illustration 7: Calculate H.M. of the following data:

Size: 0 – 10 10 – 20 20 – 30 30 – 40 40 – 50
Frequency: 5 8 12 6 4

Size (x) Frequency (f) Mid Value (m)

0 – 10 5 5 1/5 = 0.2000 1.0000

10 – 20 8 15 1/15 = 0.0667 0.5336

20 – 30 12 25 1/25 = 0.0400 0.4800

30 – 40 6 35 1/35 = 0.0286 0.1716

40 - 50 4 45 1/45 = 0.0222 0.0888


N = ∑𝐟 = 𝟑𝟓 𝟏
∑𝐟 ( ) = 2.274
𝐦
UNIT III

DISPERSION (Measures of Variation)


Dispersion is studied to have an idea of the homogeneity or heterogeneity of the distribution.
Measures of dispersion are the measures of scatter or spread about an average.
Measures of dispersion are called the averages of the second order.

Methods of Measuring Dispersion:

There are various methods of studying variation or dispersion important methods studying
dispersion are as follows:

1. Range
2. Inter - quartile range
3. Mean Deviation
4. Standard Deviation
5. Lorenz curve

1. Range
Range is the simplest and crudest measure of dispersion. It is a rough measure of dispersion.
It is the difference between the highest and the lowest value in the distribution.

Range = L – S

Where, L = Largest Value; S = Smallest Value.

The Relative measure of range is called as the Co – efficient of Range.

Co – efficient of Range

Illustration 1:
Find the range of weights of 7 students from the following.

27, 30, 35, 36, 38, 40, 43


Solution:
Range = L – S

Here L = 43; S = 27

Range = 43 – 27 = 16

Co – efficient of Range =

Practical utility of Range


1. It is used in industries for the statistical quality control of the manufactured product.
2. It is used to study the variations such as stock, shares and other commodities. 3. It
facilitates the use of other statistical measures.

Advantages

1. It is the simplest method


2. It is easy to understand and the easiest to compute. 3. It takes
minimum time to calculate and accurate.

Disadvantages
1. Range is completely dependent on the two extreme values.
2. It is subject to fluctuations of considerable magnitude from sample to sample. 3.
Range cannot tell us anything about the character of the distribution.

3.1 Quartile Deviation (Q.D)

Quartile deviation is an absolute measure of dispersion. Co-efficient of quartile deviation is


known as relative measure of dispersion.

In the series, four quartiles are there. By eliminating the lowest items (25%) and the highest
items (25%) of a series we can obtain a measure of dispersion and can find out the half of the distance
between the first and the third quartiles. That is, [Q3 (third quartiles) – Q1 (first quartiles). The inter-
quartile range is reduced to the form of the semi – inter quartile range (or) quartile deviation by
dividing it by 2.

Inter quartile range = Q3 – Q1

Inter quartile range or Quartile deviation

Coefficient of Quartile deviation

Quartile Deviation – Individual Series Illustration 2: Find out the value of Quartile Deviation
and its coefficient from the following data:

Roll No. 1 2 3 4 5 6 7
Marks 20 28 40 30 50 60 52

Solution: Calculation of Q.D.

Marks arranged in ascending order: 20 28 30 40 50 52 60

Q1 = Size of 𝑵+𝟏th item


𝟒

Q1 = size of 7+1th item


4

th
= size of item

= size of 2nd item

= 28

th
Q3 = Size of item
th
Q3 = Size of item

th
= Size of item

th
= size of item

= size of 6th item

= 52

Q. D.

= 12

Coefficient of Q.D

= 0.3

Quartile Deviation – Discrete Series Illustration 3: Find out the value of Quartile Deviation and its
coefficient from the following data:

Age in years 20 30 40 50 60 70 80
No. of members 3 61 132 153 140 51 3

Solution:

Calculation of Q.D.

48
x F c.f.
20 3 3
30 61 64
40 132 196
50 153 349
60 140 489
70 51 540
80 3 543

Q1 = Value of 𝑵+𝟏th item


𝟒

Q1 = value of 543+1th item = value of 544 th item


4 4

= value of 136th item = 40 years

th
Q3 = Value of item

th th
Q3 = value of item = value of item

= value of 3(136)th item = value of 408th item = 60 years

Q. D. years

Coefficient of Q.D

Quartile Deviation – Continuous Series Illustration 4: Find out the value of Quartile Deviation and
its coefficient from the following data:

Wages (Rs.) 30 – 32 32 – 34 34 - 36 36 - 38 38 - 40 40 - 42 42 - 44
Labourers 12 18 16 14 12 8 6

Solution: Calculation of Q.D.

49
Wages (x) Labourers (f) c.f.
30 – 32 12 12
32 – 34 18 30
34 – 36 16 46
36 – 38 14 60
38 – 40 12 72
40 – 42 8 80
42 – 44 6 86

th
Q1 = size of item

th
= size of item

= 21.5th item

Q1 lies in the group 32 - 34

th
Q3 = size of item

th
= size of item

= 64.5 th item

Q3 lies in the group 38 – 40.

50
Q

Q. D.

Coefficient of Q.D

Merits:
1. It is simple to calculate.
2. It is easy to understand.
3. Risk of excrement item variation is eliminated, as it depends upon the central 50 percent items.

Demerits

1. Items below Q1 and above Q3 are ignored.


2. It is not capable of further mathematical treatment.
3. It is affected much by the fluctuations of sampling. 4. It is not calculated from a computed average,
but from a positional average.

3.2 Mean Deviation


The mean deviation is also known as the average deviation. It is the average difference between the
items in a distribution computed from the mean, median or mode of that series counting all such
deviation as positive. Median is preferred to the average because the sum of deviation of items from
median is minimum when signs are ignored. But, the arithmetic mean is more frequently used in
calculating the value of average deviation. Hence, it is commonly called Mean deviation.

Mean Deviation – Individual Series

M. D. (mean or median or

Coefficient of Mean Deviation:

Illustration 5: Calculate mean deviation from mean and median for the following data:

51
100 150 200 250 360 490 500 600 671

Solution: Calculation of Mean Deviation

X │D│= X - 𝐗̅; │D│= X – median;

X – 369 X - 360
100 269 260

150 219 210

200 169 160

250 119 110

360 9 0

490 121 130

500 131 140

600 231 240

671 302 311


∑X = 3321 ∑│D│= 1570 ∑│D│= 1561

Mean Median = Value of th item

= Value of th item
= Value of 5th item = 360

M.D. from mean M.D. from median

Coefficient of M.D. Coefficient of M.D.


52
Mean Deviation – Discrete Series

M. D.

Illustration 6: Calculate mean deviation from mean from the following data:

X 2 4 6 8 10
F 1 4 6 4 1

Solution: Calculation of Mean Deviation

x F fx
2 1 2 4 4

4 4 16 2 8

6 6 36 0 0

8 4 32 2 8

10 1 10 4 4
N = ∑f = 16 ∑fx = 96

Mean

M.D. from mean

Coefficient of M.D.

Mean Deviation – Continuous Series

M. D.

53
Illustration 7: Calculate mean deviation from mean from the following
data:

Class interval 2-4 4-6 6-8 8 - 10


Frequency 3 4 2 1

Solution:

Calculation of Mean Deviation

x M f Fm │𝑫│ = 𝒎 − 𝐗̅ 𝒇│𝑫│
2–4 3 3 9 2.2 6.6

4–6 5 4 20 0.2 0.8

6–8 7 2 14 1.8 3.6

8 - 10 9 1 9 3.8 3.8
N = ∑f = 10 ∑fm = 52 ∑𝒇│𝑫│ = 𝟏𝟒. 𝟖

Mean

M.D. from mean =

Coefficient of M.D.

Merits
1. It is clear and easy to understand.
2. It is based on each and every item of the data.
3. It can be calculated from any measure of central tendency and as such is flexible too. 4. It is
not disturbed by the values of extreme items as in the case of range.
Demerits:
1. It is not suitable for further mathematical processing.
2. It is rarely used in sociological studies.

3.3 Standard Deviation

54
Karl Pearson introduced the concept of Standard deviation in 1893. Standard deviation is the square
root of the means of the squared deviation from the arithmetic mean. So, it is called as Root - Mean Square
Deviation or Mean Error or Mean Square Error. The Standard deviation is denoted by the small Greek
letter ‗𝜎‘ (read as sigma)

Standard Deviation – Individual Observation

Deviation taken from Actual Mean

Illustration 8: Calculate the standard deviation from the following data;

14, 22, 9, 15, 20, 17, 12, 11

Solution: Calculation of standard deviation from actual mean

2
Values (X) X X - 𝐗̅ ; (X - 15) (X - 𝐗̅ )2
14 196 -1 1

22 484 7 49

9 81 -6 36

15 225 0 0

20 400 5 25

17 289 2 4

12 144 -3 9

11 121 -4 16
∑X = 120 ∑X2 = ∑(X - 𝐗̅ )2 =
1940 140

55
Alternatively:

We can find out standard deviation by using variables directly, i.e., no deviation is found
out.

= √242.5 − 225

= √17.5

= 4.18

Deviation taken from Assumed Mean

Where d = X – A
Illustration 9:

Calculate the standard deviation from the following data;

30, 43, 45, 55, 68, 69, 75.

Solution:
56
Calculation of standard deviation from assumed mean

2
X d = X – A= X - 55 d
30 -25 625

43 -12 144

45 -10 100

55 0 0

68 13 169

69 14 196

75 20 400
N=7 ∑d = 0 ∑d2= 1634

Standard Deviation – Discrete Series: Actual Mean Method:

Illustration 10:

Calculate the standard deviation from the following data;

Marks 10 20 30 40 50 60
No. of students 8 12 20 10 7 3

Solution:

Calculation of standard deviation (from actual mean)


x F Fx d = x -𝐗̅ d2 fd2
x – 30.8

57
10 8 80 -20.8 432.64 3461.12

20 12 240 -10.8 116.64 1399.68

30 20 600 -0.8 0.64 12.80

40 10 400 9.2 84.64 846.40

50 7 350 19.2 368.64 2580.48

60 3 180 29.2 852.64 2557.92


𝐍 = ∑𝐟 = 𝟔𝟎 ∑𝐟𝐱 = 𝟏𝟖𝟓𝟎 ∑𝐟𝐝𝟐 = 𝟏𝟎𝟖𝟓𝟖. 𝟒𝟎

Mean:

= 30.8

Standard Deviation:

Assumed Mean Method:

Where d = X – A

Illustration 11: (Solving the previous problem) Solution:

Calculation of standard deviation (from assumed mean)

x f d = x -𝟑𝟎 d fd fd2
10 8 -20 400 -160 3200

58
20 12 -10 100 -120 1200

30 20 0 0 0 0

40 10 10 100 100 1000

50 7 20 400 140 2800

60 3 30 900 90 2700
2
𝐍 = ∑𝐟 = 𝟔𝟎 ∑fd = 50 ∑𝐟d = 10900

= √181.67 − 0.69

= √180.98

= 𝟏𝟑. 𝟒𝟓

Step Deviation Method

xC

Where d‘ Common Factor

Illustration 12:

(Solving the previous problem)

Solution:
Calculation of standard deviation (from step deviation)
59
x f 𝑿−𝟑𝟎 d'2 fd' fd'2
d' = 𝟏𝟎
10 8 -2 4 -16 32

20 12 -1 1 -12 12

30 20 0 0 0 0

40 10 1 1 10 10

4
50 7 2 14 28
9
60 3 3 9 27
𝐍 = ∑𝐟 = 𝟔𝟎 ∑fd’ = 5 ∑𝐟d’2 = 109

xC

= √1.817 − 0.0069 x 10

Standard Deviation – Continuous Series

xC

Where d Common Factor

Illustration13: Compute the standard deviation from the following


data:

Class 0 - 10 10 - 20 20 - 30 30 - 40 40-50
Frequency 5 8 15 16 6

Solution:

60
Computation of standard deviation

2
x M F d= d fd fd2
𝒎−𝟐𝟓

𝟏𝟎
0 - 10 5 5 -2 4 -10 20

10 - 20 15 8 -1 1 -8 8

20 - 30 25 15 0 0 0 0

30 - 40 35 16 1 1 16 16

40 - 50 45 6 2 4 12 24
𝐍 = ∑𝐟 ∑fd = 10 ∑𝐟d2 = 68
= 𝟓𝟎

xC

= √1.36 − 0.04 x 10 = √1.32 x 10 =


1.1489 x 10 = 11.49

Merits:
1. It is rigidly defined determinate.
2. It is based on all the observations of a series.
3. It is less affected by fluctuations of sampling and hence stable.
4. It is amenable to algebraic treatment and is less affected by fluctuations of sampling most other
measures of dispersion.
5. The standard deviation is more appropriate mathematically than the mean deviation, since the
negative signs are removed by squaring the deviations rather than by ignoring

Demerits:
1. It lacks wide popularity as it is often difficult to compute, when big numbers are involved, the
process of squaring and extracting root becomes tedious.

2. It attaches more weight to extreme items by squaring them.

61
3. It is difficult to calculate accurately when a grouped frequency distribution has extreme groups with
no definite range.

Uses:
1. Standard deviation is the best measure of dispersion.
2. It is widely used in statistics because it possesses most of the characteristics of an ideal measure of
dispersion.
3. It is widely used in sampling theory and by biologists. 4. It is used in coefficient of correlation
and in the study of symmetrical frequency distribution.

Co - efficient of variation (Relative Standard Deviation)

The Standard deviation is an absolute measure of dispersion. The corresponding relative measure
is known as the co - efficient of variation. It is used to compare the variability of two or more than two
series. The series for which co-efficient or variation is more is said to be more variable or conversely less
consistent, less uniform less table or less homogeneous.

Variance: Square of standard deviation is called variance.

Variance

Co – efficient of standard deviation

Co – efficient of variation

Illustration 14: The following are the runs scored by two batsmen A and B in ten innings:

A 101 27 0 36 82 45 7 13 65 14
B 97 12 40 96 13 8 85 8 56 15

Who is the more consistent batsman?

Solution: Calculation of Co-efficient of Variation

62
Batsman A Batsman B

Runs Scored dx = X - 𝑿̅ dx2 Runs Scored dx = Y - 𝒀̅̅ dy2


X Y
101 62 3844 97 54 2916

27 -12 144 12 -31 961

0 -39 1521 40 -3 9

36 -3 9 96 53 2809

82 43 1849 13 -30 900

45 6 36 8 -35 1225

7 -32 1024 85 42 1764

13 -26 676 8 -35 1225

65 26 676 56 13 169

14 -25 625 15 -28 784


∑X = 390 ∑ dx2=10404 ∑Y = 430 ∑ dy2= 12762

Batsman A Batsman B

C.V. C.V.

= 82.72% = 83.07%

Batsman A is more consistent in his batting, because the co-efficient of variation of runs is less for
him.

63
UNIT IV SIMPLE
CORRELATION
Meaning:

Correlation refers to the relationship of two or more variables. For example, there exists some
relationship between the height of a mother and the height of a daughter, sales and cost and so on. Hence,
it should be noted that the detection and analysis of correlation between two statistical variables requires
relationship of some sort which associates the observation in pairs, one of each pair being a value of the
two variables. The word relationship is of important and indicates that there is some connection between
the variables under observation. Thus, the association of any two variates is known as correlation.

Significance:
Correlation is useful in physical and social sciences. We can study the uses of correlation in business
and economics. The following are the significance of study of correlation:
➢ Correlation is very useful to economics to study the relationship between variables, like price and
quantity demanded. To the businessmen, it helps to estimate costs, sales, price and other related
variables.
➢ Some variables show some kind of relationship; correlation analysis helps in measuring the degree
of relationship between the variables like supply and demand, price and supply, income and
expenditure, etc.
➢ The relation between variables can be verified and tested for significance, with the help of the
correlation analysis. The effect of correlation is to reduce the range of uncertainty of our prediction.
➢ The coefficient of correlation is a relative measure and we can compare the relationship between
variables which are expressed in different units.
➢ Sampling error can also be calculated. ➢ Correlation is the basis for the concept of regression and
ratio of variation.

Types of Correlation:

64
Correlation is classified into many types but the important are:

1. Positive and Negative Correlation


2. Simple and Multiple Correlations
3. Partial and Total Correlation 4. Linear and Non-linear Correlation. 1. Positive
and Negative Correlation :

The correlation is said to be positive when the values of two variables move in the same
direction, so that an increase in the value of one variable is accompanied by an increase in the value of
the other variable or a decrease in the value of one variable is followed by a decrease in the value of
the other variable. Example: Height and weight, rainfall and yield of crops, etc.,
The correlation is said to be negative when the values of two variables move in opposite
direction, so that an increase or decrease in the values of one variable is followed by a decrease or
increase in the value of the other. Example: Price and demand, yield of crops and price, etc.,

2. Simple and multiple Correlation :


When we study only two variables, the relationship is described as simple correlation;
Example: The study of price and demand of an article.

When more than two variables are studied simultaneously, the correlation is said to be
multiple correlation. Example: the relationship of price, demand and supply of a commodity.

3. Partial and total Correlation:

Partial correlation coefficient provides a measure of relationship between a dependent variable


and a particular independent variable when all other variables involved are kept constant. i.e., when
the effect of all other variables are removed.

Example: When we study the relationship between the yield of rice per acre and both the
amount of rainfall and the amount of fertilizers used. In these relationship if we limit our
correlation analysis to yield and rainfall. It becomes a problem relating to simple correlation.

4. Linear and Non-linear Correlation :

The correlation is said to be linear, if the amount of change in one variable tends to bear a
constant ratio to the amount of change in the other variable.

65
The correlation is non-linear, if the amount of change in one variable does not bear a
constant ratio to the amount of change in the other related variable.

4.1 Scatter Diagram Method Methods of

Studying Correlation:

The following correlation methods are used to find out the relationship between two variables.

A. Graphic Method :
i. Scatter diagram (or) Scattergram method.
ii. Simple Graph or Correlogram method. B. Mathematical Method
:
i. Karl Pearson‘s Coefficient of Correlation. ii.
Spearman‘s Rank Correlation of Coefficient iii.
Coefficient of Concurrent Deviation iv. Method of Least
Squares.
C. Graphic Method

i. Scatter diagram (or) Scattergram method

This is the simplest method of finding out whether there is any relationship present between two
variables by plotting the values on a chart, known as scatter diagram. In this method, the given data are
plotted on a graph paper in the form of dots. X variables are plotted on the horizontal axis and Y variables
on the vertical axis. Thus we have the dots and we can know the scatter or concentration of various points.

If the plotted dots fall in a narrow band and the dots are rising from the lower left hand corner to
the upper right-hand corner it is called high degree of positive correlation.

If the plotted dots fall in a narrow band from the upper left hand corner to the lower right
hand corner it is called a high degree of negative correlation.

If the plotted dots line scattered all over the diagram, there is no correlation between the two
variables.

66
Merits:
1. It is easy to plot even by beginner.
2. It is simple to understand.
3. Abnormal values in a sample can be easily detected. 4. Values of some dependent variables can be
found out.

Demerits:
1. Degree of correlation cannot be predicted.
2. It gives only a rough idea. 3. The method is useful only when number of terms is
small.
ii. Simple Graph Method of Correlation:

In this method separate curves are drawn for separate series on a graph paper. By examining the
direction and closeness of the two curves we can infer whether or not variables are related. If both the
curves are moving in the same direction correlation is said to be positive. On the other hand, if the curves
are moving in the opposite directions is said to be negative.

Merits:

1. It is easy to plot
2. Simple to understand 3. Abnormal values can easily be deducted.
Demerits:
1. This method is useless when number of terms is very big. 2.
Degree of correlation cannot be predicted.
B. Mathematical Method:

4.2 Karl Pearson’s Coefficient of Correlation

Karl Pearson, a great biometrician and statistician, introduced a mathematical method for
measuring the magnitude of linear relationship between two variables. This method is most widely used in
practice. This method is known as Pearsonian Coefficient of Correlation. It is denoted by the symbol ;
the formula for calculating Pearsonian r is:

r=

67
The value of the coefficient of correlation shall always lie between +1 and -1.

When r = +1, then there is perfect positive correlation between the variables.

When r = -1, then there is perfect negative correlation between the variables.

When r = 0, then there is no relationship between the variables.

Illustration 1: calculate Karl Pearson coefficient of correlation from the following.

X 100 101 102 102 100 99 97 98 96 95


Y 98 99 99 97 95 92 95 94 90 91

Solution: Calculation of coefficient of correlation

X x = X - 𝑿̅ x Y y = Y - 𝒀̅̅ y Xy
=X – 99 =Y - 95
100 1 1 98 3 9 3

101 2 4 99 4 16 8

102 3 9 99 4 16 12

102 3 9 97 2 4 6

100 1 1 95 0 0 0

99 0 0 92 -3 9 0

97 -2 4 95 0 0 0

98 -1 1 94 -1 1 1

96 -3 9 90 -5 25 15

95 -4 16 91 -4 16 16

68
∑X= ∑ x2 = 54 ∑Y= ∑ y2 = 96 ∑ xy =
990 950 61

𝑿̅= 𝚺𝑿 = 𝟗𝟗𝟎 = 𝟗𝟗; 𝒀̅̅ = 𝚺𝒀̅ = 𝟗𝟓𝟎 = 𝟗𝟓;


𝑵 𝟏𝟎 𝑵 𝟏𝟎

= + 0.85

Illustration 2: calculate Karl Pearson coefficient of correlation from the following.

X: 6 2 10 4 8
69
Y: 9 11 5 8 7

Solution: Calculation of coefficient of correlation

2 2
X X Y Y XY
6 36 9 81 54

2 4 11 121 22

10 100 5 25 50

4 16 8 64 32

8 64 7 49 56
∑ X = 30 ∑ X2 = 220 ∑ Y = 40 ∑ Y2= 340 ∑ XY = 214

4.3 RANK CORRELATION CO-EFFICIENT

Spearman’s Rank Correlation Co-efficient:

In 1904, a famous British psychologist Charles Edward Spearman found out the method of
ascertaining the coefficient of correlation by ranks. This method is based on rank. Rank correlation is
applicable only to individual observations. This measure is useful in dealing with qualitative characteristics
such as intelligence, beauty, morality, character, etc.,

The formula for Spearman‘s rank correlation which is denoted by P is;

70
or

Where, P = Rank co-efficient of correlation

D = Difference of the two ranks

∑D2 = Sum of squares of the difference of two ranks

N = Number of paired observations Like the Karl Pearson‘s coefficient of

correlation, the value of P lies between + 1 and – 1.

Where ranks are given

Illustration 3: Following are the rank obtained by 10 students in two subjects, Statistics and Mathematics.
To what extent the knowledge of the students in the two subjects is related?

Statistics 1 2 3 4 5 6 7 8 9 10
Mathematics 2 4 1 5 3 9 7 10 6 8

Solution:

Calculation of Pearman’s rank correlation coefficient

Rank of Statistics Rank of Mathematics D=x–y D2


(x) (y)
1 2 -1 1

2 4 -2 4

3 1 +2 4

4 5 -1 1

71
5 3 +2 4

6 9 -3 9

7 7 0 0

8 10 -2 4

9 6 +3 9

10 8 +2 4
N = 10 𝚺𝑫𝟐 = 𝟒𝟎

= 1 – 0.24

= + 0.76

Where Ranks are not given:


Illustration 4:
A random sample of 5 college students is selected and their grades in Mathematics and Statistics
are found to be:

Mathematics 85 60 73 40 90
Statistics 93 75 65 50 80

Solution:

Calculation of spearman’s rank correlation coefficient

72
2
Mathematics Rank x Statistics Rank y D=x–y D
(x) (y)
85 2 93 1 +1 1

60 4 75 3 +1 1

73 3 65 4 -1 1

40 5 50 5 0 0

90 1 80 2 -1 1
𝚺𝑫𝟐 = 𝟒

= + 0. 8

Equal or Repeated Ranks:


When two or more items have equal values, it is difficult to give ranks to them. In that case the
items are given the average of the ranks they would have received, if they are not tied. A slightly different
formula is used when there is more than one item having the same value.

m = the number of items whose ranks are common

Illustration 5: From the following data calculate the rank correlation coefficient after making adjustment
for tied ranks.

X 48 33 40 9 16 16 65 24 16 57
73
Y 13 13 24 6 15 4 20 9 6 19

Solution: Calculation of spearman’s rank correlation coefficient

2
X Rank x Y Rank y D =R(x) – R(y) D
48 8 13 5.5 2.5 6.25

33 6 13 5.5 0.5 0.25

40 7 24 10 -3.0 9.00

9 1 6 2.5 -1.5 2.25

16 3 15 7 4.0 16.00

16 3 4 1 2.0 4.00

65 10 20 9 1.0 1.00

24 5 9 4 1.0 1.00

16 3 6 2.5 0.5 025

57 9 19 8 1.0 1.00
𝚺𝑫𝟐 = 𝟒𝟏

= + 0. 733

Merits:
1. It is simple to understand and easier to apply.

74
2. It can be used to any type of data, qualitative or quantitative.
3. It is the only method that can be used where we are given the ranks and not the actual data.
4. Even where actual data are given, rank method can be applied for ascertaining correlation by assigning
the ranks to each data.

Demerits:
1. This method is not useful to find out correlation in a grouped frequency distribution.
2. For large samples it is not convenient method. If the items exceed 30 the calculations become quite
tedious and require a lot of time. 3. It is only an approximately calculated measure as actual values are not
used for calculations.

4.4 REGRESSION ANALYSIS

The statistical method which helps us to estimate the unknown value of one variable from the
known value of the related variable is called Regression. The dictionary meaning of the word regression is
“return or going back”. In 1877, Sir Francis Galton, first introduced the word
‘Regression’. The tendency to regression or going back was called by Galton as the ‘Line of Regression’.
The line describing the average relationship between two variables is known as the line of regression. The
regression analysis confined to the study of only two variables at a time is termed as simple regression.
The regression analysis for studying more than two variables at a time is known as multiple regressions.

Regression Vs Correlation:

S. No. Regression Correlation


1 It is a mathematical measure showing It is the relationship between two or more
the average relationship between two variable, which vary in sympathy with the
variables. other in the same or the opposite direction.

2 Here x is a random variable and y is a Both x and y are random variables.


fixed variable.
3 In indicates the cause and effect It finds out the degree of relationship
relationship between the variables. between two variables.
4 It is the prediction of one value, in It is used for testing and verifying the
relationship to the other given value. relation between two variables.
5 It is an absolute figure. It is a relative measure. The range of
relationship lies between + 1.
6 Here there is no such nonsense There may be nonsense correlation
regression between two variables.

75
7 It has wider application, as it studies It has limited application, because it is
linear and non-linear relationship confined only to linear relationship
between the variables. between the variables.
8 It is widely used for further mathematical It is not very useful for mathematical
treatment. treatment.
9 It explains that the decrease in one If the coefficient of correlation is positive,
variable is associated with the increase then the two variables are positively
in the other variable. correlated and vice - versa.
10 There is a functional relationship It is immaterial whether X depends upon Y
between the two variables so that we or Y depends upon X.
may identify between the independent
and dependent variables.

Linear Regression:
Linear regression attempts to model the relationship between two variables by fitting a linear
equation to observed data. One variable is considered to be an explanatory variable, and the other is
considered to be a dependent variable. For example, a modeler might want to relate the weights of
individuals to their heights using a linear regression model.
Before attempting to fit a linear model to observed data, a modeler should first determine whether
or not there is a relationship between the variables of interest. This does not necessarily imply that one
variable causes the other (for example, higher SAT scores do not cause higher college grades), but that
there is some significant association between the two variables. A scatter plot can be a helpful tool in
determining the strength of the relationship between two variables. If there appears to be no association
between the proposed explanatory and dependent variables (i.e., the scatter plot does not indicate any
increasing or decreasing trends), then fitting a linear regression model to the data probably will not provide
a useful model. A valuable numerical measure of association between two variables is the correlation
coefficient, which is a value between -1 and 1 indicating the strength of the association of the observed
data for the two variables.

A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable
and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x =
0).

Regression lines:
If we take two variables X and Y we have two regression lines:

76
i) Regression of X on Y and ii) Regression
of Y on X

The regression line of X on Y gives the most probable value of X for any given value of Y. The
regression of Y on X gives the most probable value of Y for any given value of X. There are two regression
lines in the case of two variables.

Regression Equations: The algebraic expressions of the two regression lines are called regression
equations.

Regression Equation of X on Y:

Xc = a + by

To determine the values of ‗a‘ and ‗b‘, the following two normal equations are to be solved
simultaneously.

∑X = Na + b∑Y

∑XY = a∑Y + b∑Y2

Regression Equation of Y on X:

Yc = a + bx

To determine the value of ‗a‘ and ‗b‘, the following two normal equations are to be solved
simultaneously.

∑Y = Na + b∑X

∑XY = a∑X + b∑X2

We can call these equations as normal equations.


Illustration 1: Determine the two regression equations of a straight line which best fits the data.

X 10 12 13 16 17 20 25
Y 10 22 24 27 29 33 37

77
Solution: Calculation of Regression

X X Y Y XY
0 100 10 100 100

12 144 22 484 264

13 169 24 576 312

16 256 27 729 432

17 289 29 841 493

20 400 33 1089 660

25 625 37 1369 925


∑X = 113 ∑ X2 = 1983 ∑Y = 182 ∑ Y2 = 5188 ∑ XY = 3186
Regression Equation of Y on X:
The two normal equations are:

∑Y = Na + b∑X

∑XY = a∑X + b∑X2

Substituting the values,

N = 7; ∑X = 113; ∑ X2 = 1983; ∑Y = 182; ∑ XY = 3186;

7a + 113b = 182 …(1)

113a + 1983b = 3186 …(2)

Multiplying (1), by 113,

791a + 12769b = 20566 …(3)

78
Multiplying (2), by 7,

791a + 13881b = 22302 …(4)

Subtracting (4) from (3)

- 1112 b = - 1736

Put b = 1.56 in (1) we get

7a + 113(1.56)b = 182

7a + 176.28 = 182 => 7a =5.72

The equation of straight line is Yc = a + bX

Put a = 0.82, b = 1.56

∴The equation of the required straight line is Yc = 0.82 + 1.56 X

This is called regression of y on

x Regression Equation of X on Y:
The two normal equations are:

∑X = Na + b∑Y

∑XY = a∑Y + b∑Y2

Substituting the values,


∑ XY = 3186;

79
N = 7; ∑X = 113; ∑ Y2 = 5188; ∑Y = 182;

80
7a + 182b = 113 …(1)

182a + 5188b = 3186 …(2)

Multiplying (1), by 182,

1274a + 33124b = 20566 …(3)

Multiplying (2), by 7,

1274a + 36316b = 22302 …(4)

Subtracting (4) from (3)

3192 b = 1736

Put b = 0.54 in (1) we get

7a + 182(0.54) = 113

7a + 98.28 = 113 => 7a = 14.72

The equation of straight line is Xc = a + bY

Put a = 2.1, b = 0.54

∴The equation of the required straight line is Xc = 2.1 + 0.54 Y

This is called regression of x on y


Deviation taken from Actual Means:
Regression equation of X on Y:

X-

81
Where, X = the value of x to be estimated for the given y value. 𝑿 ̅ = Mean value of X
variable. Y = the value of y given in the problem; 𝒀̅̅ = Mean value of y variables.

r Regression co - efficient of X on Y. x = X - 𝑿 ̅ ; y = Y- 𝒀̅̅

Regression equation of Y on X:

Y-

r Regression co-efficient of Y on X.

Illustration 2: Find regression lines from the following data:

X 3 5 6 8 9 11
Y 2 3 4 6 5 10

And also estimate Y when X is 15.

Solution: Calculation of Regression Equations (by actual mean)

2 2
X x= x Y y= y xy
X -𝑿 ̅ 𝐘 − 𝒀̅̅
3 -4 16 2 -3 9 12

5 -2 4 3 -2 4 4

6 -1 1 4 -1 1 1

8 1 1 6 1 1 1

9 2 4 5 0 0 0

11 4 16 10 5 25 20
∑X = ∑x = 0 ∑ x2 = ∑Y = ∑y = 0 ∑ y2 = ∑xy =
42 42 30 40 38

82
Regression equation of X on Y: Regression equation of Y on X:

X- Y-

r r

X – 7 = 0.95(Y – 5) Y – 5 = 0.90(X – 7)

X – 7 = 0.95Y – 4.75 Y – 5 = 0.90X – 6.30

X = 0.95Y + 2.25 Y = 0.90X – 1.30

When X is 15, Y will be, Y = 0.90 x 15 – 1.30

= 13.5 – 1.30

Y = 14.8
Deviation taken from the Assumed Mean:

Regression equation of X on Y:

X-

Where, r

dx = X - A; dy = Y - A; ( A = assumed mean)

Regression equation of Y on X :

Y–

83
r

Illustration: 3. Find regression lines from the following data:

X 40 38 35 42 30
Y 30 35 40 36 29

Also calculate Karl Pearson‘s coefficient of correlation.

Solution: Calculation of Regression Equations (by assumed mean)

X dx = X – dx2 Y dy = Y – dy2 dx.dy


A A
40 5 25 30 0 0 0

38 3 9 35 5 25 15

35 0 0 40 10 100 0

42 7 49 36 6 36 42

30 -5 25 29 -1 1 5
∑ dx = ∑ dx2 ∑ dy ∑ dy2 ∑dx.dy
10 =108 =1020 =162 = 62

Regression equation of X on Y Regression equation of Y on X

X - 𝑿̅ = bxy (Y - 𝒀̅̅ ) Y – 𝒀̅̅ = byx (X -𝑿̅ )

bxy byx

bxy = 0.27 byx = 0.25

84
X – 37 = 0.27(Y – 34) Y – 34 = 0.25(X – 37)

X – 37 = 0.27Y – 9.18 Y – 34 = 0.25X – 9.25

X = 0.27Y + 27.82 Y = 0.25X + 24.75

= √2.2 x 0.37

= √0.814

r = 0.9 Illustration 4: Given the following


data, calculate the expected value of Y when X = 12.

X Y
Arithmetic Mean (̅X) 7.6 14.8
Standard Deviation (σ) 3.6 2.5
Coefficient of correlation(r) = 0.99

Solution:

Regression of Y on X

Y-

Y–

Y – 14.8 = 0.688 (X – 7.6)

Y – 14.8 = 0.688 X – 5.23

Y = 0.688 X – 5.23 + 14.8

Y = 0.688 X + 9.57

85
When X = 12 => Y = 0.688 (12) + 9.57 = 17.826

Hence the expected value of Y is 17.83.

Standard Error of Estimate:


We found it necessary to supplement an average for a series with a measure of dispersion or
variation to show how representative the average is. The regression equations help us to predict the values
of Y for values X or the value of X far values of Y . These are only estimations or predictions; but cannot
be treated as a precise value. If we have a wide scatter or variation of the dots about the regression line,
then it would be considered a poor representative of the relationship. The more closely the dots cluster
around the line, the more representative it is and better the estimate based on the equation for this line. This
variation about the line of average relationship can be measured in a manner analogous to the measuring
of the variation of the items about an average. Thus, we use here a measure of variation similar to the
standard deviation - the standard error of estimates. It is computed as is a standard, being also a square root
of the mean of squared deviations. But the deviations here are not the deviations of the items from the
arithmetic mean, they are rather the vertical distances of every dot from the line of average relationship.

It measures the scattering of the observations the regression line. It is calculated as follows :

Standard Error of X values from

√∑(Y-YC)2 Standard
Error of Y values from 𝑌𝑐 [𝑆𝑦𝑥 ] = N

Interpretation of Standard Error of Estimate :


1. Smaller the value, precision of the estimate is better.
2. Larger the value, lesser is correctness of the estimate.
3. If it is zero, there is no variation about the line and both the lines will coincide and correlation will
be perfect.

Illustration:

86
Given the regression equation of Y on X as Y = 3 + 9X for the following data series, calculate (i) Standard
error of estimate (ii) Explained variation in Y (iii) unexplained variation in Y.

X 1 2 3 4 5
Y 10 20 30 50 40

Solution :

X Y y=Y-𝑌 𝑦2 𝑦𝑐 [𝑌 − 𝑌𝑐 ] [𝑌 − 𝑌𝑐 ] 2

[3 + 9𝑋̅]
1 10 -20 400 12 -2 4

2 20 -10 100 21 -1 1

3 30 0 0 30 0 0

4 50 20 400 39 11 121

5 40 10 100 48 -8 64
0
Y 150 Y2= Y Y
1000 2
c

=190

(i) Standard error of estimate

(ii) Unexplained variation in Y = ∑[𝑌 − 𝑌𝑐 ] 2 = 190


(iii) Total variation 𝑦2 = 1000
Explained variation = Total variation– Unexplained variation

= 𝑦2 - ∑[𝑌 − 𝑌𝑐 ] 2
= 1000 - 190 = 810.

87
UNIT V TESTING OF HYPOTHESIS
Hypothesis
Hypothesis is a precise, testable statement of what the researchers predict will be outcome of the study.
Hypothesis usually involves proposing a relationship between two variables: the independent variable
(what the researchers change) and the dependent variable (what the research measures).

Hypothesis is usually considered as the principal instrument in research. The main goal in many research
studies is to check whether the data collected support certain statements or predictions. A statistical
hypothesis is an assertion or conjecture concerning one or more populations. Test of hypothesis is a process
of testing of the significance regarding the parameters of the population on the basis of sample drawn from
it. Thus, it is also termed as “Test of Significance’.

In short, hypothesis testing enables us to make probability statements about population parameter. The
hypothesis may not be proved absolutely, but in practice it is accepted if it has withstood a critical testing.
Points to be considered while formulating Hypothesis
Hypothesis should be clear and precise.
• Hypothesis should be capable of being tested.
• Hypothesis should state relationship between variables.
• Hypothesis should be limited in scope and must be specific.
• Hypothesis should be stated as far as possible in most simple terms so that the same is easily
understandable by all concerned.
• Hypothesis should be amenable to testing within a reasonable time.
• Hypothesis must explain empirical reference.

Procedure of Testing a Hypothesis


After having completed collection, processing and analysis of data a test procedure has to be followed for
determining if the null hypothesis is to be accepted or rejected. The test procedure or the rule is based upon
a test statistic and a rejection region. The procedure of testing hypothesis is briefly described below:

▪ Setting up a hypothesis:

At the very outset, we take certain hypothesis with regard to related variables under the assumptions defined
for the study. Generally, there are two forms of hypotheses which must be constructed; and if one hypothesis
is accepted, the other one is rejected.

88
i. Null Hypothesis: It is very useful tool to test the significance of difference. Any hypothesis concerned
to a population is called statistical hypothesis. In the process of statistical test, the rejection or acceptance
of hypothesis depends on sample drawn from population. The statistician tests the hypothesis through
observation and gives a probability statement. The simple hypothesis states that the statistical measures of
sample and those of the population under study do not differ significantly. Similarly it may assume no
relationship or association between two variables or attributes. In case of assessing the effectiveness of a
literacy campaign on the awareness of rural people we assume “There is no effect of the campaign on
public awareness”. It is denoted by 𝐻𝑜
For example, if we want to find out whether extra coaching has benefitted the students or not: the null
hypothesis would be;

Ho (1): The extra coaching has not benefitted the students.

Similarly, if we want to find out whether a particular drug is effective in curing Malaria we will take the
null hypothesis:
Ho (2): The drug used under experimentation is not effective in curing Malaria. Similarly for testing the
significance of difference between two sample, null hypothesis would be:
Ho (3): “There is no significant difference between the variation in data of two samples taken from the
same parent population.” i.e. σ1 = σ2
The rejection of the null hypothesis indicates that the differences have statistical significance and the
acceptance of null hypothesis indicates that the differences are due to chance and arised because of sampling
fluctuation. Since many practical problems aim at establishment of statistical significance of differences,
rejection of the null hypothesis may thus indicate success in statistical project.

ii. Alternative Hypothesis: As against the null hypothesis, the alternative hypothesis specifies those values
that the researcher believes to hold true, and, of course, he hopes that the sample data lead to acceptance of
this hypothesis as true.
Rejection of Null hypothesis 𝐻𝑜 leads to the acceptance of alternative hypothesis, which is denoted by 𝐻1.

With respect to the three null hypotheses as stated above, researcher might establish the following alternative
hypotheses:
Thus H1 (1): The extra coaching has benefitted the students.

H1 (2): The drug used under experimentation is effective in curing Malaria


H1 (3): “There is significant difference between the variation in data of two samples taken from the same
parent population.” i.e. σ1 ≠ σ2

89
𝑇ℎ𝑒 𝑛𝑢𝑙𝑙 𝑎𝑛𝑑 𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑠𝑒𝑠 𝑐𝑎𝑛 𝑎𝑙𝑠𝑜 𝑏𝑒 𝑤𝑟𝑖𝑡𝑡𝑒𝑛 𝑎𝑠 ∶

H0: (𝜎1 − 𝜎2= 0)

H1: ( 𝜎1 − 𝜎2 ≠ 0)

𝐻𝑜 : µ1 − µ2=0 H1: µ1 − µ2 ≠ 0

Type I and Type II Errors

When two hypotheses are set up, the acceptance or rejection of a null hypothesis is based on a sample study.
While we make a decision on the basis of the data analysis and testing of the significance difference, it may
lead to wrong conclusions in two ways (i.) Rejecting a true null hypothesis

(ii) Accepting 𝑎 false hypothesis. This can be presented in the following table:
Decision

Accepted 𝐻𝑜 Rejected 𝐻𝑜

𝐻𝑜 true Correct Decision Type I error (α


error)
𝐻𝑜 false Type II error (β Correct Decision
error)
By rewriting;
Reject 𝐻𝑜 when it is true (Type I error) = α

Accept 𝐻𝑜 when it is false (Type II error) = β

Accept 𝐻𝑜= when it is true (Correct decision)

Reject 𝐻𝑜= when it is false (Correct decision)

▪ Setting up a Suitable Significance Level:

The maximum possibility of committing type I error, which we use to specify in a test, is known as the
level of significance. Generally, 5% level of significance is fixed in statistical tests. This implies that we
can have 95% confidence in accepting a hypothesis or we could be wrong 5% in taking the decision. The
range of variation has two regions-acceptance region and critical region or rejection region. If the sample
statistic falls in critical region, we reject the hypothesis, as it leads to false decision. We go with H1, if the
computed value of sample statistic falls in the rejection region.

90
The critical region under a normal curve, as stated earlier can be divided into two ways; (a) two sides under
a curve (Two Tailed Test) (b) one side under a curve; either on the right tail or left tail (One Tailed Test).

Acceptance and rejection regions in case of a two- tailed Test (with 5% significance level)

▪ Setting a Test Criterion:

The third step in hypothesis testing procedure is to construct a test criterion. This involves selecting an
appropriate probability distribution for the particular test, that is, a probability distribution which can
properly be applied. Some probability distributions that are commonly used in testing procedures are Z, t,

F and ᵡ2.

▪ Computation:

After completing first three steps we have completely designed a statistical test. We now proceed to the
fourth step- computing various measures from a random sample of size n, which are necessary for applying
the test. These calculations include the test statistic and the standard error of the test statistic.

▪ Making a decision or Conclusion:

Finally we come to a conclusion stage where either we accept or reject the null hypothesis. The decision is
based on computed value of test statistic, whether it lies in the acceptance region or rejection region. If the
computed value of the test statistic falls in the acceptance region (it means computed value is less than
critical value), the null hypothesis is accepted. On the contrary, if the computed value of the test statistic is
greater than the critical value, the computed value of the statistic falls in the rejection region and the null
hypothesis is rejected.
Tests of Hypotheses

91
Hypothesis testing determines the validity of the assumption (technically described as null hypothesis) with
a view to choose between two conflicting hypotheses about the value of a population parameter. Hypothesis
testing helps to decide on the basis of a sample data, whether a hypothesis about the population is likely to
be true or false. Statisticians have developed several tests of hypotheses (also known as the tests of
significance) for the purpose of testing of hypotheses which can be classified as:

a) Parametric tests or standard tests of hypotheses; and


b) Non-parametric tests or distribution-free test of hypotheses.

Parametric tests usually assume certain properties of the parent population from which we draw samples.
Assumptions like observations come from a normal population, sample size is large, assumptions about the
population parameters like mean, variance, etc., must hold good before parametric tests can be used. But
there are situations when the researcher cannot or does not want to make such assumptions. In such
situations we use statistical methods for testing hypotheses which are called non-parametric tests because
such tests do not depend on any assumption about the parameters of the parent population. Besides, most
non-parametric tests assume only nominal or ordinal data, whereas parametric tests require measurement
equivalent to at least an interval scale. As a result, non-parametric tests need more observations than
parametric tests to achieve the same size of Type I and Type II errors.
Non parametric Tests
Non parametric tests are used when the data isn't normal. Therefore, the key is to figure out if you have
normally distributed data. The only non-parametric test you are likely to come across in elementary stats
is the chi-square test. However, there are several others.

5.1 Chi-Square Test (ᵡ2)


The chi-square test is an important test amongst the several tests of significance developed by statisticians.
Chi-square, symbolically written as χ2 (Pronounced as Ki-square), is a statistical measure used in the context
of sampling analysis for comparing a variance to a theoretical variance. As a non-parametric test, it “can be
used to determine if categorical data shows dependency or the two classifications are independent. It can
also be used to make comparisons between theoretical populations and actual data when categories are
used.” Thus, the chi-square test is applicable in large number of problems. The test is, in fact, a technique
through the use of which it is possible for all researchers to
▪ test the goodness of fit;

▪ test the significance of association between two attributes, and


▪ test the homogeneity or the significance of population variance.
92
Chi-Square Test for Goodness of Fit

The Chi-Square Test for Goodness of Fit tests claims about population proportions. It is a nonparametric
test that is performed on categorical (nominal or ordinal) data.
Illustration: 1
In the 2000 US Census, the ages of individuals in a small town were found to be the following:
Less than 18 18–35 Greater than 35

20% 30% 50%


In 2010, ages of n = 500 individuals from the same small town were sampled. Below are the results:
Less than 18 18–35 Greater than 35

121 288 91
Using 5% level of significance (alpha = 0.05), would you conclude that the population distribution of ages
has changed in the last 10 years?
Solution:
Using our sample size and expected percentages, we can calculate how many people we expected to fall
within each range. We can then make a table separating observed values versus expected values:
Less than 18 18–35 Greater than 35
Expected 20% 30% 50%

Less than 18 18–35 Greater than 35

Observed 121 288 91

Expected 500*0.20 = 100 500*0.30 = 150 500*50 = 250

Less than 18 18-35 Greater than 35


Observed 121 288 91

Expected 100 150 250


Let’s perform a hypothesis test on this new table to answer the original question. Steps
for Chi-Square Test for Goodness of Fit

1. Define Null and Alternative Hypotheses

2. State Alpha
3. Calculate Degrees of Freedom

4. State Decision Rule

93
5. Calculate Test Statistic

6. State Results

7. State Conclusion

1. Define Null and Alternative Hypotheses

H0; the data meet the expected distribution

H1; the data do not meet the expected distribution

1. State Alpha

Alpha = 0.05

2. Calculate Degrees of Freedom

df = k – 1, where k = your number of groups. df = 3 – 1 = 2


3. State Decision Rule

Using our alpha and our degrees of freedom, who look up a critical value in the Chi-Square Table. We find
our critical value to be 5.99.

If ᵡ2 is greater than 5.99, reject H0.


4. Calculate Test Statistic

The Chi-Square statistic is found using the following equation, where observed values are compared to
expected values:
Less than 18 18–35 Greater than 35
Observed 121 288 91
Expected 100 150 250

2
ᵡ = ∑(O-E)2 /O ᵡ2 =(121 – 100)2 / 100 + (288 – 150)2 /150 +

(91 – 250)2 /250 ᵡ2 = 232.494


6. State Results

If ᵡ2 is greater than 5.99, reject H0.


2
ᵡ = 232.494
Reject the null hypothesis.

94
7. State Conclusion

The ages of the 2010 population are different than those expected based on the 2000 population.

Chi-Square Test for Independence

The Chi-Square Test for Independence evaluates the relationship between two variables. It is a
nonparametric test that is performed on categorical (nominal or ordinal) data.
Illustration: 2
500 elementary school boys and girls are asked which their favourite colour is: blue, green, or pink. Results
are shown below:
Blue Green Pink

Boys 100 150 20 300

Girls 20 30 180 200

120 180 200 N = 500

Using 5% level of significance (alpha = 0.05), would you conclude that there is a relationship between
gender and favourite colour?
Let’s perform a hypothesis test to answer this question.

Steps for Chi-Square Test for Independence

1. Define Null and Alternative Hypotheses

2. State Alpha

3. Calculate Degrees of Freedom

4. State Decision Rule

5. Calculate Test Statistic

6. State Results

7. State Conclusion

1. Define Null and Alternative Hypotheses

H0; For the population of elementary school students, gender and favourite colour are not related.
H1; For the population of elementary school students, gender and favourite colour are related.

95
2. State Alpha

Alpha = 0.05

3. Calculate Degrees of Freedom df = (rows – 1)(columns – 1) df = (2 – 1)(3 – 1) df = (1)(2) = 2

4. State Decision Rule

Using our alpha and our degrees of freedom, who look up a critical value in the Chi-Square Table. We find
our critical value to be 5.99.

If ᵡ2 is greater than 5.99, reject H0.


5. Calculate Test Statistic

First, we need to calculate our expected values using the equation below. We find the expected values by
multiplying each row total by each column total, and then diving by the total number of subjects. The
calculations are shown.

EXPECTED = ROW TOTAL * COLUMN TOTAL / GRAND TOTAL

E(Boys, Blue) = 300*120/500 = 72

E(Boys, Green) = 300*180/500 = 108

E(Boys, Pink) = 300*200/500 = 120

E(Girls, Blue) = 200*120/500 = 48

E(Girls, Green) = 200*180/500 = 72

E(Girls, Pink) = 200*200/500 = 80

Expected Blue Green Pink

Boys 72 108 120 300

Girls 48 72 80 200

120 180 200 N = 500

6. State Results

If ᵡ2 is greater than 5.99, reject H0.

96
2
ᵡ = 266.389
Reject the null hypothesis

7. State Conclusion

In the population, there is a relationship between gender and favourite colour.

Important parametric tests


The important parametric tests are: (1) t-test; and (2) F-test. All these tests are based on the assumption of
normality i.e., the source of data is considered to be normally distributed.
t- test: It is based on t-distribution and is considered an appropriate test for judging the significance of a
sample mean or for judging the significance of difference between the means of two samples in case of
small sample(s) when population variance is not known (in which case we use variance of the sample as
an estimate of the population variance). In case two samples are related, we use paired t-test (or what is
known as difference test) for judging the significance of the mean of difference between the two related
samples. It can also be used for judging the significance of the coefficients of simple and partial
correlations.
F-test: It is based on F-distribution and is used to compare the variance of the two- independent samples.
This test is also used in the context of analysis of variance (ANOVA) for judging the significance of more
than two sample means at one and the same time. It is also used for judging the significance of multiple
correlation coefficients.

5.2 ‘t’- test


‘t’ Test was developed by Gossett around 1900. He published his theoretical ideas about this test in the pen
name of ‘Student’ and so this test in also called Student Test. Statistical method for the comparison of the
mean of the two groups of the normally distributed sample(s).
It is used when:
• Population parameter (mean and standard deviation) is not known
• Sample size (number of observations) < 30

Type of t-test
The T-test is mainly classified into 3 parts:
• One sample
• Independent sample
• Paired sample

97
One Sample
In one sample t-test, we compare the sample mean with the population mean.

Formula:

Illustration: 3

98
Illustration: 4
Marks of student are 10.5, 9, 7, 12, 8.5, 7.5, 6.5, 8, 11 and 9.5. Mean population score is 12 and standard
deviation is 1.80. Is the mean value for student significantly differing from the mean population value?

Solution:

99
Independent (two-sample t-test):
In this test, we compare the means of two different samples. The formula is

Degree of Freedom: Degree of freedom is defined as the number of independent variables. It is given by:

Let’s understand two-sample t-test by an example:

Illustration: 5
100
The marks of boys and girls are given: Boys: 12, 14, 10, 8, 16, 5, 3, 9, and 11 Girls: 21, 18, 14, 20, 11, 19,
8, 12, 13, and 15. Is there any significant differnece between marks of males and females i.e. population
means are different.

Solution:

101
Paired t-test:

In this test, we compare the means of two related or same group at two different time. Formula:

Degree of freedom is n-1.

Let’s understand two-sample t-test by an example:


Illustration: 6
Blood pressures of 8 patients are before and after are recorded: Before: 180, 200, 230, 240, 170, 190, 200,
and 165 after: 140, 145, 150, 155, 120, 130, 140, and 130. Is there any significant difference between BP
reading before and after?
Solution:

102
5.3 F Tests
F-tests are named after the name of Sir Ronald Fisher. The F-statistic is simply a ratio of two variances.
Variance is the square of the standard deviation. For a common person, standard deviations are easier to
understand than variances because they’re in the same units as the data rather than squared units. Fstatistics
are based on the ratio of mean squares. The term "mean squares" may sound confusing but it is simply an
estimate of population variance that accounts for the degrees of freedom used to calculate that estimate. For
carrying out the test of significance, we calculate the ratio F, which is defined as:

103
The calculated value of F is compared with the table value for v1 and v2 at 5% or 1% level of significance.
If calculated value of F is greater than the table value then the F ratio is considered significant and the null
hypothesis is rejected. On the other hand, if the calculated value of F is less than the table value the null
hypothesis is accepted and it is inferred that both the samples have come from the population having same
variance.

Illustration: 7
Two random samples were drawn from two normal populations and their values are:
A 65 66 73 80 82 84 88 90 92
B 64 66 74 78 82 85 87 92 93 95 97
Test whether the two populations have the same variance at the 5% level of significance. (Given: F=3.36 at
5% level for v1=10 and v2=8.)
Solution: Let us take the null hypothesis that the two populations have not the same variance.
Applying F-test:

104
At 5 percent level of significance, for v1=10 and v2=8, the table value of F0.05 = 3.36. The calculated value of
F is less than the table value. The hypothesis is accepted. Hence the two populations have not the same
variance.

5.4 Analysis of Variance (ANOVA)


Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed aggregate
variability found inside a data set into two parts: systematic factors and random factors. The systematic
factors have a statistical influence on the given data set, while the random factors do not. The method is
based upon an unusual result that the equality of several population means can be tested by comparing the
sample variances using F distribution. In t statistic we test whether two population means are equal.
The analysis of variance is an extension of the t test for the case of more than two means.

One-Way ANOVA Versus Two-Way ANOVA


There are two types of ANOVA: one-way (or unidirectional) and two-way. One-way or two- way refers to
the number of independent variables in the analysis of variance test. A one-way ANOVA evaluates the
impact of a sole factor on a sole response variable. It determines whether all the samples are the same. The
one-way ANOVA is used to determine whether there are any statistically significant differences between
the means of three or more independent (unrelated) groups.
A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have one independent
variable affecting a dependent variable. With a two-way ANOVA, there are two independents. For example,
a two-way ANOVA allows a company to compare worker productivity based on two independent variables,
such as salary and skill set. It is utilized to observe the interaction between the two factors and tests the
effect of two factors at the same time.

Example of One Way ANOVA


Illustration: 8
The following table shows the retail prices (Rs. per kg.) of a commodity in some shops selected at random
in four cities:
A B C D
34 29 27 34
37 33 29 36
32 30 31 38
33 34 28 35
Carry out the analysis of variance to test the significance of the differences between prices of the commodity
in the four cities. [Given, F0.05 = 3.49 for (3, 12) degrees of freedom].
Solution:

105
Each observation is reduced by 39, and shown below: Calculation for Analysis of Variance
A B C D
-5 -10 -12 -5
-2 -6 -10 -3
-7 -9 -8 -1
-6 -5 -11 -4

Total T1= -20 T2= -30 T3= -41 T4= -13 T = -104
Total of
Squares 114 242 429 51 ∑∑xij2= 836
sample size n1= 4 n2= 4 n3= 4 n4= 4 N= 16

Correction Factor (C.F.)= T2/N = (-104)2/16 = 10816/16 = 676

Total Sum of Squares (SS) = ∑∑xij2 - C.F. = 836- 676 = 160

Sum of Squares Between Groups (SSB) = ∑ (Ti2 /ni) - C.F.


= (-202/4 + -302/4 +-412/4 + -132/4) – 676
= 787.50- 676
= 111.50
Sum of Squares due to Errors (SSE) = Total SS –SSB = 160-111.50 = 48.50
Analysis of Variance Table

Source S.S d.f Mean Squares Observed F Tabulated F


of (M.S) Value Value
Variation
Between 111.50 (k-1)= 4-1 = 3 111.50/3=
Groups 37.17 MSB/MSE= F0.05 = 3.49
Within 48.50 (N-K)= 16-4= 12 48.50/12= 4.04 37.17/4.04= for (3,12)
Groups 9.196 d.f
(Errors)
Total 160 16-1 = 15

Since the observed value of F (i.e., 9.196) exceeds the 5% tabulated value (i.e., 3.49) for (3,12) d.f., we
reject the null hypothesis of equality of population means, and conclude that the retail prices of the
commodity in the four cities are not equal.
In order to test which of the cities differ in prices, we calculate the critical difference (C.D.).

106
C.D. = s√2n t.0.025 (for 12d.f.) s2 = MSE
s = √MSE = √4.04
C.D. = √4.04* √2*4 * 2.18 = 12.39
The sample totals (of the reduced observations) are T1= -20, T2= -30, T3= -41, T4= -13 We
have
| T1- T2| = 10
| T1- T3| = 21
| T1- T4| = 7
| T2- T3| = 11
| T2- T4| = 17
| T3- T4| = 28
Comparing these figures with the C.D. (i.e., 12.39) we find that the cities A and C, B and D, C and D differ
in prices. Cities A and B and A and D may be taken to be having same prices.

Example of Two Way ANOVA


Illustration: 9
The following table gives the estimates of acreage of cultivable land but not cultivated land out of 100 acres
of total land, as obtained by three investigators in each of three districts. Perform an analysis of variance to
test whether there are significant differences between investigators and districts. [Given F0.05 = 6.94 for d.f.
(2, 4)]
Investigator District
I II III
A 23 28 26
B 24 25 27
C 24 22 26
Solution
Each observation is reduced by 24, and shown below: Calculation for Analysis of Variance
Investigator District Total
I II III
A -1 4 2 5
B 0 1 3 4
C 0 -2 2 0
Total (Ti) -1 3 7 T= 9
sample size n1= 3 n2= 3 n3= 3 N=9
Total of the squares of all figures

107
xij2= (-1)2 + (4)2 +(2)2 +(0)2 +(1)2 +(3)2 +(0)2 +(-2)2 +(2)2 = 39
Correction Factor (C.F.) = T2/N = (9)2/9 = 81/9 = 9
Total Sum of Squares (SS) = xij2 - C.F. = 39-9 = 30
Sum of Squares (SS) between Investigators = (T12 + T12 + T12)/3 – C.F
= (52 +42 + 02)/3 – 9

= 41/3 – 9
=4.67
Sum of Squares (SS) between Districts = (T1ʹ2 + T1ʹ2 + T1ʹ2 )/3 – C.F.
= [(-1)2 + (3)2 +(7)2]/3 – 9
= 59/3 – 9
= 10.67
SS due to Error = Total SS – (SS between investigators) - (SS between districts)
= 30 – 4.67 – 10.67 = 14.66

Analysis of Variance Table


Source of S.S d.f Mean Observed Tabulated F
Variation Squares (M.S) F Value Value

Between 4.67 3-1 = 2 4.67/2 = 2.34 2.34/3.67 = 0.64 F0.05 = 6.94 for
Investigators (2,4) d.f
Between 10.67 3-1 = 2 10.67/2 = 5.34 5.34/3.67 = 1.46 F0.05 = 6.94 for
Districts (2,4) d.f
Within 14.66 4 14.66/4 = 3.67
Groups
(Errors)
Total 30 9-1 = 8

Since the observed value of F for experimenters (i.e., 0.64) is less than the corresponding tabulated value
(i.e., 6.94) for d.f. (2, 4), it is not significant at 5% level. We conclude that the mean acreage of cultivable
land in the three districts as determined by the three investigators may not be different from one another,
i.e., there are no significant differences between investigators.
Since the observed value of F for districts (i.e., 1.46) is less than the corresponding tabulated value (i.e.,
6.94) for d.f (2,4), it is not significant at 5% level. We conclude that the estimates of acreage of cultivable
land in the three districts may not be different from one another, i.e., there are no significant differences
between districts.
108

You might also like