Business Statistics Class
Business Statistics Class
SYLLABUS
UNIT Details
Introduction – Meaning and Definition of Statistics – Collection and
I Tabulation of Statistical Data – Presentation of Statistical Data – Graphs
and Diagrams-
II Measures of Central Tendency – Arithmetic Mean, Median and
Mode – Harmonic Mean and Geometric Mean.
Measures of Variation –– Quartile deviation Mean deviation –
III
Standard Deviation
Simple Correlation – Scatter Diagram – Karl Pearson‘s
IV
Correlation – Rank Correlation – Regression.
Testing of hypothesis – Chi-Square test, T Test, F Test, ANOVA
V
1
2.5 Harmonic Mean 46
Measures of Variation 53
3.1 Quartile deviation 54
III
3.2 Mean deviation 59
3.3 Standard Deviation 63
Simple Correlation 75
4.1 Scatter Diagram 77
IV
4.2 Karl Pearson‘s Correlation 79
4.3 Rank Correlation 81
4.4 Regression 86
Testing of hypothesis 103
5.1 Chi-Square test 107
V
5.2 T Test 112
5.3 F Test 118
5.4 ANOVA 120
UNIT I
INTRODUCTION - MEANING AND DEFINITION OF STATISTICS
1.1 Introduction
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. In
other words, it is a mathematical discipline to collect, summarize data.
The word ‘Statistics ‘ is derived from a Latin term “Status’ or Italian term ‘Statistics’ or the German term
‘Statistick’ is the French term ‘Statistique’ each of which means a political state. The term statistics was
applied to mean facts and figures and figures which were needed the state in respect of the division of the
state, their respective population birth rate, income and the like.
Meaning of statistics
Statistics refers to numerical facts and figures collected in a systematic manner with a specific purpose in
any field of study. In this sense, statistics is also aggregates of facts expressed in numerical form.
2
In singular sense, statistics refers to a science which comprises methods that are used in the collection,
analysis, interpretation and presentation of numerical data. These methods are used to draw conclusion
about the population parameters.
Definition of statistics
• AL Bowley defines statistics as "Statistics is numerical statement of facts in any development of
enquiry placed in relation to each other"
• According to Croxton and Cowden, Statistics may be defined as the science of collection,
presentation, analysis and interpretation of numerical data.
• Gottfried Achenwall defined statistics as "Statistics are collection of noteworthy facts concerning
state both historical and descriptive".
• According to Yule and Kendall, “Statistics means quantitative data affected to a marked extent by
multiplicity of causes.”
• “Statistics” - as defined by the American Statistical Association (ASA) - “is the science of
learning from data, and of measuring, controlling and communicating uncertainty.”
Nature of statistics
• Statistics is both a science and an art
• As a science statistical methods are generally systematic and based on fundamental ideas and
processes
• It also works as a base for all other sciences.
• As an art it explores the merits and demerits, guides about the means to achieve the objective
Scope of statistics
• Statistics is a mathematical science pertaining to the collection, analysis, interpretation or
explanation and presentation of data.
• It provides tools for predicting and forecasting the economic activities.
• It is useful for an academician, government, business etc.
Characteristics of statistics
• Statistics are Aggregate of facts
• It is numerically expressed
• The statistical Data affected by multiplicity of causes
• It is enumerated according to reasonable standard of accuracy
• It is collected in systematic accuracy
• It is collected for pre-determined purpose
3
• It is placed in relation to other
Uses of statistics
Statistics helps in
• Providing a better understanding
• Exact description
• efficient planning of a statistical inquiry in any field of study
• Collecting appropriate quantitative data
• Business forecasting
• Decision making
• Quality control
• Search of new ventures
• Study of market
• Study of business cycles
• Useful for planning
• Useful for finding averages
• Useful for bankers, brokers, insurance, etc.
Limitations of statistics
• It is not useful for individual cases
• It ignores qualitative aspects
• It deals with average only
• Improper use of statistics can be dangerous
• It is only a mean, not an end
• It do not distinguish between cause and effect
• Its results are not always dependable
4
process is that it helps to determine many important things about the company, particularly the
performance. So, the data collection process plays an important role in all the streams. Depending on the
type of data, the data collection method is divided into two categories namely,
a) Primary Data Collection methods
b) Secondary Data Collection methods
Schedule method
5
This method is similar to the questionnaire method with a slight difference. The enumerations are specially
appointed for the purpose of filling the schedules. It explains the aims and objects of the investigation and
may remove misunderstandings, if any have come up. Enumerators should be trained to perform their job
with hard work and patience.
a) Textual presentation
When presenting data in this way, researcher use words to describe the relationship between information.
Textual presentation enables researchers to share information that cannot display on a graph. An example
of data, the researcher present textually is findings in a study. When a researcher wants to provide additional
context or explanation in their presentation, they may choose this format because, in text, information may
appear more clear.
Textual presentation is common for sharing research and presenting new ideas. It only includes paragraphs
and words, rather than tables or graphs to show data. b) Tabular presentation
6
Tabular presentation is using a table to share large amounts of information. When using this method,
researcher organise and classify the data in rows and columns according to the characteristics of the data.
Tabular presentation is useful in comparing data, and it helps visualise information. Researches use this
type of presentation in analysis, such as classify and tabulate them.
Classification of data
Classification is a process of arranging things or data in groups or classes according to the common
characteristics. It is based on
• Geographical (i.e. on the basis of area or region wise)
• Chronological (On the basis of Historical, i.e. with respect to time)
• Qualitative (on the basis of character / attributes)
• Numerical, quantitative (on the basis of magnitude)
1) Geographical Classification
In geographical classification, the classification is based on the geographical regions.
Ex: Sales of the company (In Million Rupees) (region – wise)
Region Sales
North 285
South 300
East 185
West 235
2) Chronological Classification
If the statistical data are classified according to the time of its occurrence, the type of classification is
called chronological classification.
Sales reported by a departmental store
Month Sales (Rs.) in
lakhs
January 22
February 26
March 32
April 25
May 27
7
June 30
3) Qualitative Classification
In qualitative classifications, the data are classified according to the presence or absence of attributes in
given units. Thus, the classification is based on some quality characteristics / attributes.
Ex: Sex, Literacy, Education, Class grade etc.
Further, it may be classified as
a) Simple classification b) Manifold classification
i) Simple classification:
If the classification is done into only two classes then classification is known as simple classification.
Ex: a) Population in to Male / Female
b) Population into Educated / Uneducated
ii) Manifold classification:
In this classification, the classification is based on more than one attribute at a time.
4) Quantitative Classification
In Quantitative classification, the classification is based on quantitative measurements of some
characteristics, such as age, marks, income, production, sales etc. The quantitative phenomenon under
study is known as variable and hence this classification is also called as classification by variable.
Ex:
For a 50 marks test, Marks obtained by students as classified as follows
8
0 – 10 5
10 – 20 7
20 – 30 10
30 – 40 25
40 – 50 3
Total Students = 50
In this classification marks obtained by students is variable and number of students in each class represents
the frequency.
Tabulation of data
Tabulation may be defined, as systematic arrangement of data is column and rows. It is designed to simplify
presentation of data for the purpose of analysis and statistical inferences. Objectives of Tabulation
• To simplify the complex data
• To facilitate comparison
• To economise the space
• To draw valid inference / conclusions
• To help for further analysis
• Table Number: Each table should have a specific table number for ease of access and
locating.
• Title: A table must contain a title that clearly tells the readers about the data.
• Head notes: A head note further aids in the purpose of a title and displays more
information about the table.
• Stubs: These are titles of the rows in a table.
• Caption: A caption is the title of a column in the data table. Body or field: The body of a
table is the content of a table in its entirety. Each item in a body is known as a ‘cell’.
• Footnotes: Footnotes are rarely used. In effect, they supplement the title of a table if
required.
Types of tabulation
9
In general, the tabulation is classified in two parts, that is simple tabulation, and a complex tabulation.
Simple tabulation, gives information regarding one or more independent questions complex tabulation
gives information regarding two manually dependent questions.
Simple tabulation
Data are classified based on only one characteristic.
Distribution of marks
Class Marks No. of students
30 – 40 20
40 – 50 20
50 – 60 10
Total 50
Complex tabulation
Data are classified based on two or more characteristics. Two-way table: Classification is based on two
characteristics.
Number of students
Class Marks
Boys Girls Total
30 – 40 10 10 20
40 – 50 15 5 20
50 – 60 3 7 10
Total 28 22 50
10
1.4 Graphs and diagrams
Graphical and diagrammatic representations of data are visual aids that can help people understand data
more easily. This method of displaying data uses diagrams, graphs and images. Graphs and diagrams are
both visual representations of data, but they have different purposes and uses.
Graphs
Data can also be effectively presented by means of graphs. A graph consists of curves or straight lines.
Graphs provide a very good method of showing fluctuations and trends in statistical data. Graphs can also
be used to make predictions and forecasts.
• Line graphs
• Histogram
• Frequency Polygon
• Frequency Curve
• Cumulative Frequency Polygon (Ogive)
Line graphs: Show changes and trends in data over time by connecting data points with lines.
Histograms: Similar to column charts, histograms are made up of vertical columns whose length is
proportional to the frequency of a variable.
11
Frequency polygon
A frequency polygon is a many sided closed figure. It can also be obtained by joining the mid-points of the
tops of rectangles in the histograms.
Frequency Curve
When the frequency polygon is smoothed out as a curve then it becomes frequency curve. OR when the
mid-points are potted against the frequencies then a smooth curve passes through these points is called a
frequency curve.
12
Cumulative Frequency Polygon (Ogive)
When a curve is based on cumulative frequencies then it is called a cumulative frequency polygon or ogive.
Diagrams
It is a technique of presenting numeric data through pictograms, cartograms, bar diagrams, and pie
diagrams. It is the most attractive and appealing way to represent statistical data. Diagrams help in visual
comparison and they have a bird's eye view. Diagrams classified as;
• Bar diagram/Charts
• Rectangle and Sub-divided Rectangle
• Pie diagram/Chart
13
• scatter plots
• Cartograms
• Pictograms
Bar diagram/Charts
14
A component bar chart is an effective technique in which each bar is sub-divided into two or more parts.
The component parts are shaded or coloured differently to increase the overall effectiveness of the diagram.
Pie chart
A pie chart is a type of a chart that visually displays data in a circular graph. It is one of the most commonly
used graphs to represent data using the attributes of circles, spheres, and angular data to represent real-
world information.
Scatter plots
A scatter plot, also called a scatter plot, scatter graph, scatter chart, scattergram, or scatter diagram. It uses
dots to represent values for two different numeric variables. Scatter plots are used to observe relationships
between variables.
15
Cartograms
This includes any type of map that shares the location of a person, place or object. For example, cartograms
help navigate theme parks so you can find attractions, food and gift shops.
Pictograms: This diagram uses images to represent data. For example, to show the number students and
their favourite games shown below using pictogram.
16
The field of statistics has numerous applications in business. Because of technological
advancements, large amounts of data are generated by business these days. These data are now being
used to make decisions. These better decisions we make help us improve the running of a department,
a company, or the entire economy.
“Statistics is extensively used to enhance Business performance through Analytics”
❖ Marketing: As per Philip Kotler and Gary Armstrong marketing ― identifies customer needs and
wants , determine which target markets the organisations can serve best, and designs appropriate
products, services and Programs to serve these markets.
Marketing is all about creating and growing customers profitably. Statistics is used in almost every
aspect of creating and growing customers profitably. Statistics is extensively used in making decisions
regarding how to sell products to customers. Also, intelligent use of statistics helps managers to design
marketing campaigns targeted at the potential customers. Marketing research is the systematic and
objective gathering, recording and analysis of data about aspects related to marketing. IMRB
international, TNS India, RNB Research, The Nielson, Hansa Research and Ipsos Indica Research are
some of the popular market research companies in India. Web analytics is about the tracking of online
behaviour of potential customers and studying the behaviour of browsers to various websites. Use of
Statistics is indispensable in forecasting sales, market share and demand for various types of Industrial
products. Factor analysis, conjoint analysis and multidimensional scaling are invaluable tools which
are based on statistical concepts, for designing of products and services based on customer response.
❖ Finance: Uncertainty is the hallmark of the financial world. All financial decisions are based on
―Expectation‖ that is best analysed with the help of the theory of probability and statistical techniques.
Probability and statistics are used extensively in designing of new insurance policies and in fixing of
premiums for insurance policies. Statistical tools and technique are used for analysing risk and
quantifying risk, also used in valuation of derivative instruments, comparing return on investment in
two or more instruments or companies. Beta of a stock or equity is a statistical tool for comparing
volatility, and is highly useful for selection of portfolio of stocks. The most sophisticated traders in
today‘s stock markets are those who trade in ―derivatives.. i.e financial instruments whose
underlying price depends on the price of some other asset.
❖ Economics: Statistical data and methods render valuable assistance in the proper understanding of the
economic problem and the formulation of economic policies. Most economic phenomena and indicators
can be quantified and dealt with statistically sound logic. In fact, Statistics got so much integrated with
17
Economics that it led to development of a new subject called Econometrics which basically deals with
economics issues involving use of Statistics.
❖ Operations: The field of operations is about transforming various resources into product and services
in the place, quantity, cost, quality and time as required by the customers. Statistics plays a very useful
role at the input stage through sampling inspection and inventory management, in the process stage
through statistical quality control and six sigma method, and in the output stage through sampling
inspection. The term Six Sigma quality refers to situation where there is only 3.4 defects per million
opportunities.
❖ Human Resource Management or Development: Human Resource departments are inter alia
entrusted with the responsibility of evaluating the performance, developing rating systems, evolving
compensatory reward and training system, etc. All these functions involve designing forms, collecting,
storing, retrieval and analysis of a mass of data. All these functions can be performed efficiently and
effectively with the help of statistics.
❖ Information Systems: Information Technology (IT) and statistics both have similar systematic
approach in problem solving. IT uses statistics in various areas like, optimisation of server time,
assessing performance of a program by finding time taken as well as resources used by the program. It
is also used in testing of the software.
❖ Data Mining: Data Mining is used in almost all fields of business. In Marketing, Data mining can be
used for market analysis and management, target marketing, CRM, market basket analysis, cross
selling, market segmentation, customer profiling and managing web based marketing, etc. In Risk
analysis and management, it is used for forecasting, customer retention, quality control, competitive
analysis and detection of unusual patterns.
In Finance, it is used in corporate planning and risk evaluation, financial planning and asset
evaluation, cash flow analysis and prediction, contingent claim analysis to evaluate assets, cross
sectional and time series analysis, customer credit rating, detecting of money laundering and other
financial crimes. In Operations, it is used for resource planning, for summarising and comparing the
resources and spending. In Retail industry, it is used to identify customer behaviours, patterns and
trends as also for designing more effective goods transportation and distribution policies, etc.
18
UNIT II
MEASURES OF CENTRAL TENDENCY
Measures of central tendency are a typical value of the entire group or data. It describes the
characteristics of the entire mass of data. It reduces the complexity of data and makes them to
compare. Human mind is incapable of remembering the entire mass of unwieldy data. So a simple
figure is used to describe the series which must be a representative number. It is generally called, "a
measure of central tendency or the average".
A central tendency is a central or typical value for a probability distribution. It may also be
called a center or location of the distribution. Colloquially, measures of central tendency are often
called averages. The term central tendency dates from the late 1920s. If a large volume of data is
summarized and given is one simple term. Then it is called as the ‗Central Value‘ or an ‗average‘.
In other words an average is a single value that represents group of values.
A measure of central tendency is a typical value around which other figures congregate.
Average condenses a frequency distribution in one figure. According to the statisticians, an average
will be termed good or efficient if possesses the following characteristics:
➢ It should be rigidly defined. It means that the definition should be so clear that the interpretation
of the definition does not differ from person to person.
➢ It should be easy to understand and simple to calculate.
➢ It should be such that it can be easily determined.
➢ The average of a variable should be based on all the values of the variable. This means that in
the formula for average all the values of the variable should be incorporated.
➢ The value of average should not change significantly along with the change in sample. This
means that the values of the averages of different samples of the same size drawn from the same
population should have small variations.
19
➢ It should be properly defined, preferably by a mathematical formula, so that different individuals
working with the same data should get the same answer unless there are mistakes in calculations.
➢ It should be based on all the observations so that if we change the value of any observation, the
value of the average should also be changed.
➢ It should not be unduly affected by extremely large or extremely small values.
➢ It should be capable of algebraic manipulation. By this we mean that if we are given the average
heights for different groups, then the average should be such that we can find the combined
average of all groups taken together.
➢ It should have quality of sampling stability. That is, it should not be affected by the fluctuations
of sampling. For example, if we take ten or twelve samples of twenty students‘ each and find
the average height for each sample, we should get approximately the same average height for
each sample.
2.1 MEAN
Mean is one of the types of averages. Mean is further divided into three kinds, which are the
arithmetic mean, the geometric mean and the harmonic mean. These kinds are explained as follows;
The arithmetic mean is most commonly used average. It is generally referred as the average
or simply mean. The arithmetic mean or simply mean is defined as the value obtained by dividing the
sum of values by their number or quantity. It is denoted as 𝐗 ̅ (read as X-bar). Therefore, the
mean for the values X1, X2, X3,……….., Xn shall be denoted by 𝐗̅. Following is the mathematical
representation for the formula for the arithmetic mean or simply, the mean.
Where, 𝑋̅̅ = Arithmetic Mean; 𝛴x = Sum of all the values of the variables i.e., X1 + X2 + X3+ … + Xn
20
N = Number of observations. Illustration
Roll Numbers 1 2 3 4 5 6 7 8 9 10
Marks 40 50 55 78 58 60 73 35 43 48
2 50
3 55
4 78
5 58
6 60
7 73
8 35
9 43
10 48
N = 10 𝚺𝑿 = 540
= 54 marks.
21
The arithmetic mean can also be calculated by short cut method. This method reduces the amount
of calculation. Formula for calculation
Where, 𝑋̅̅ = Arithmetic Mean; A = Assumed mean; 𝛴𝑑 = Sum of the deviations; N = Number of items.
2 50 0
3 55 5
4 78 28
5 58 8
6 60 10
7 73 23
8 35 -15
9 43 -7
10 48 -2
N = 10 𝚺𝒅 = 40
= 54 marks.
22
B. Discrete Series: Direct
Method:
To find out the total of items in discrete series, frequency of each value is multiplied with the
respective size. The values so obtained are totaled up. This total is then divided by the total number
of frequencies to obtain the arithmetic mean. The formula is
Where, 𝑋̅̅ = Arithmetic Mean; 𝛴𝑓𝑥 = the sum of products; N = Total frequency.
Illustration 3: Calculate mean from the following data:
Value 1 2 3 4 5 6 7 8 9 10
Frequency 21 30 28 40 26 34 40 9 15 57
x f Fx
1 21 21
2 30 60
3 28 84
4 40 160
5 26 130
6 34 204
7 40 280
8 9 72
9 15 135
10 57 570
𝚺𝒇 = 𝐍 = 𝟑𝟎𝟎 𝚺𝐟𝐱 = 1716
23
Short cut Method: Formula:
Where, 𝑋̅̅ = Arithmetic Mean; A = Assumed mean; 𝛴𝑓𝑑 = Sum of total deviations; N = Total frequency.
X F d = X -A fd
1 21 -4 -84
2 30 -3 -90
3 28 -2 -56
4 40 -1 -40
5 26 0 0
6 34 1 34
7 40 2 80
8 9 3 27
9 15 4 60
10 57 5 285
𝚺𝐟 = N = 300 𝚺𝐟𝐝 = + 216
C. Continuous Series
In continuous frequency distribution, the value of each individual frequency distribution is
unknown. Therefore an assumption is made to make them precise or on the assumption that the frequency
24
of the class intervals is concentrated at the centre that the midpoint of each class interval has to be found
out. In continuous frequency distribution, the mean can be calculated by any of the following methods:
1. Direct Method
2. Short cut method
3. Step Deviation Method
1. Direct Method: The formula is
Where, 𝑋̅̅ = Arithmetic Mean; 𝛴𝑓𝑚 = Sum of the product of f & m; N = Total frequency.
Illustration 5: From the following find out the mean:
Class Interval 0 – 10 10 – 20 20 – 30 30 – 40 40 - 50
Frequency 6 5 8 15 7
25
2. Short cut method: Formula:
Where, 𝑋̅̅ = Arithmetic Mean; A = Assumed mean; 𝛴𝑓𝑑 = Sum of total deviations; N = Total frequency.
Class Interval M d = m -A F fd
0 – 10 0 + 10 5 – 25 = -20 6 -120
=5
2
10 – 20 10 + 20 15 – 25 = - 10 5 -50
= 15
2
20 – 30 20 + 30 25 – 25 = 0 8 0
= 25
2
30 – 40 30 + 40 35 – 25 = 10 15 150
= 35
2
40 - 50 40 + 50 45 – 25 = 20 7 140
= 45
2
𝚺𝐟 = N = 41 𝚺𝐟𝐝 = +120
d = m – A; here A = 25
= 25 + 2.93
= 27.93
26
xC
Where, 𝑋̅̅ = Arithmetic Mean; A = Assumed mean; 𝛴𝑓𝑑′ = Sum of total deviations;
Here A = 25; C = 10
xC
= 25 + 2.93
= 27.93
2.2 MEDIAN
27
Median is the value of item that goes to divide the series into equal parts. It may be defined
as the value of that item which divides the series into equal parts, one half containing values greater
that it and the other half containing values less than it. Therefore, the series has to be arranged in
ascending or descending order, before finding the median. If the items of a series are arranged in
ascending or descending order of magnitude, the item which falls in the middle of it is called median.
Hence it is the ―middle most‖ or ―most central‖ value of a set of number.
(𝑁+1)
th
Median = Size of 2 item
(5+1)
th
= Size of 2 item
= 3rd item = 15.
Illustration 2: Find out the median of the following items. X: 8, 10, 5, 9, 12, 11.
S. No. X
1 5
2 8
3 9
4 10
5 11
6 12
28
(𝑁+1)
th
Median = Size of 2 item
th
= Size of item
= Size of
Size of F c.f
shoes
5 10 10
5.5 16 26
6 28 54
6.5 15 69
7 30 99
7.5 40 139
8 34 173
th
Median = Size of item
(173 +1)
th
= Size of 2 item
=7
29
Marks 10 – 25 25 - 40 40 – 55 55 - 70 70 – 85 85 - 100
Frequency 6 20 44 26 3 1
x F c.f
10 – 25 6 6
25 – 40 20 26
40 – 55 44 70
55 – 70 26 96
70 – 85 3 99
85 - 100 1 100
Median
50;
L = 40; f = 44; cf = 26; i = 15
Median
= 40 + 8.18
= 48. 18 marks
Merits:
1. It is easy to compute and understand.
2. It eliminates the effect of extreme items.
3. The value of median can be located graphically.
4. It is amenable to further algebraic process as it is used in the measurement of dispersion. 5.
It can be computed even if the items at the extremes are unknown.
Demerits:
1. For calculating median, it is necessary to arrange the data; other averages do not need any
arrangement.
2. Typical representative of the observations cannot be computed if the distribution of item is
irregular.
3. It is affected more by fluctuation of sampling than the arithmetic mean.
30
2.3 MODE
Mode is the value which occur the greatest number of frequency in a series. It is derived from
the French word “La mode” meaning the fashion. It is the most fashionable or typical value of a
distribution, because it is repeated the highest number of times in the series.
Mode or the modal value is defined as the value of the variable which occur more number of times
or most frequently in a distribution.
Types of Mode:
i) Unimoda
l:
If there is only one mode in series, it is called unimodal.
ii) Bi –
modal:
If there are two modes in the series, it is called bi - modal.
Eg., 20, 25, 30, 30, 15, 10, 25 (Modes are 25, 30)
iii) Tri –
modal:
If there are three modes in the series, it is called Tri - modal.
Eg., 60, 40, 85, 30, 85, 45, 80, 80, 55, 50, 60 (Modes are 60, 80, 85)
iv) Multi – modal:
If there are more than three modes in the series it is called multi-modal.
Merits:
1. It can be easily ascertained without much mathematical calculation.
2. It is not essential to know all the items in a series to compute mode.
3. Open – end classes do not disturb the position of the mode.
4. Its values can be ascertained graphically as well as empirically.
5. It may be very well applied to qualitative as well quantitative data. 6. It is not affected by
extreme values as in the average.
Demerits:
1. The mode becomes less useful as an average which the distribution is bi-modal.
31
2. It is not suitable for further mathematical treatment.
3. It is stable only when the sample is large. 4. Mode is influenced by magnitude of the class-
intervals.
Mode - Individual Series Illustration : 1. Calculate the mode from the following data of the
marks obtain by 10 students.
Serial No. 1 2 3 4 5 6 7 8 9 1
Marks obtained 60 77 74 62 77 77 70 68 65 80
Solution:
Marks obtained by 10 students 60, 77, 74, 62, 77, 77, 70, 68, 65, and 80.
DISCRETE SERIES:
Column 2: Frequencies are grouped in twos, adding frequencies of items 1 and 2; 3 and 4; 5 and 6; and
so on.
Column 3: Leave the first frequency and then add the remaining in twos.
Column 6: Leave the first two frequencies and then group the remaining the threes.
The maximum frequencies in all six columns are marked with a circle and an analysis table is prepared
as follows:
32
Illustration: 2. Calculate the mode from the following:
Size 10 11 12 13 14 15 16 17 18
Frequency 10 12 15 19 20 8 4 3 2
Size Frequency
1 2 3 4 5 6
10 10
22
37
11 12
27
46
12 15
34
54
13 19
39
14 20 47
28
32
15 8
12
15
16 4
7
9
17 3
5
18 2
33
Analysis Table
The mode is 13, as the size of item repeats 5 times. But through inspection, we say the mode is 14,
because the size 14 occurs 20 times. But this wrong decision is revealed by analysis table.
Z xi
Where,
Z = Mode; L1 = Lower limit of the modal class; f1 = Frequency of the modal; f0 = Frequency of the class
preceding the modal class; f2 = Frequency of the class succeeding the modal class; i = Class interval;
Solution:
34
Grouping Table
Size of item Frequency
1 2 3 4 5 6
0–5 20
44 76
5 – 10 24
56
10 – 15 32 84
60 80
15 – 20 28
48
20 – 25 20 64
36
25 – 30 16 70
50
30 – 35 34 60
44
52
35 – 40 10
18
40 - 45 8
Analysis Table
35
1 1
2 1 1
3 1 1
4 1 1 1
5 1 1 1
6 1 1 1
1 3 5 3 1 1
Z xi
The three averages (mean, median and mode) are identical, when the distribution is
symmetrical. In an asymmetrical distribution, the values of mean, median and mode are not
equal. In a moderately asymmetrical distribution the distance between the mean and median
is about one-third of the distance between mean and mode.
36
2.4 GEOMATRIC
The geometric mean, G, of a set of n positive values X1, X2, ……,Xn is the Nth
root of the product of N items. Mathematically the formula for geometric mean will be as
follows;
G. M = 𝒏√𝑿𝟏,𝑿𝟐, … , 𝑿𝒏 = (𝑿𝟏, 𝑿𝟐, … … , 𝑿𝒏𝟏/𝒏
G.M = Geometric Mean; n = number of items; X1, X2, X3,….. = are various values
Illustration1: The geometric mean of the values 2, 4 and 8 is the cubic root of 2 x 4 x 8 or
In practice, it is difficult to extract higher roots. The geometric mean is, therefore, computed
using logarithms. Mathematically, it will be represented as follows;
Here we assume that all the values are positive, otherwise the logarithms will be not defined.
X log of X
50 1.6990
72 1.8573
54 1.7324
82 1.9138
93 1.9685
G.M. = √50 × 72 × 54 × 82 × 93 or
37
G. M. = Antilog
G.M = Antilog
= Antilog of 1.8342
= 68.26
G.M. = antilog
Illustration 3: The following table gives the weight of 31 persons in a sample survey.
Calculate geometric mean.
Weight (lbs) 130 135 140 145 146 148 149 150 157
No. of persons 3 4 6 6 3 5 2 1 1
38
157 1 2.1959 2.1959
N = ∑𝐟 = 𝟑𝟏 ∑𝐟𝐥𝐨𝐠𝐗 = 𝟔𝟔. 𝟕𝟕𝟏𝟎
G.M. = antilog
G.M. = antilog
Where, f = frequency; m = mid value; log m = logarithm of each mid value; N = Total frequencies
13.5 – 16.5 19
16.5 – 19.5 23
19.5 – 22.5 7
22.5 – 25.5 4
25.5 – 28.5 1
39
19.5 – 22.5 7 21 1.3222 9.2554
G.M. = antilog
= 16.02 maunds
Uses:
• Geometric mean is highly useful in averaging ratios, percentages and rate of increase between two
periods.
• In economic and social sciences, where we want to give more weight to smaller items and smaller
weight to large items, geometric mean is appropriate.
• It is the only useful average that can be employed to indicate rate of change.
Merits:
Demerits:
40
2.5 HARMONIC MEAN
Harmonic Mean, like geometric mean is a measure of central tendency in solving special types of problems.
Harmonic Mean is the reciprocal of the arithmetic average of the reciprocal of values of various items in
the variable. The reciprocal of a number is that value, which is obtained dividing one by the value.
For example, the reciprocal of 5 is 1/5. The reciprocal can be obtained from logarithm tables.
H.M. = or H.M.
41
X1, X2, X3 ......... Xn, refer to the various in the observations
Family: 1 2 3 4 5 6 7 8 9 10
Income: 85 70 10 75 500 8 42 250 40 36
1 85
Reciprocals
2 70 1/70 = 0. 0143
3 10 1/10 = 0. 1000
4 75 1/75 = 0. 0133
6 8 1/8 = 0. 1250
7 42 1/42 = 0. 0232
9 40 1/40 = 0. 0250
10
N =10
H.M.
= Rs. 28.87/-
Size of items: 6 7 8 9 10 11
Frequency: 4 6 9 5 2 8
(x)
6 4 0.1667
7 6 0.1429 0.8574
8 9 0.1250 1.1250
9 5 0.1111 0.5555
10 2 0.1000 0.2000
11 8 0.0909
N = ∑𝐟 =
𝟑𝟒
H.M. =
43
H.M.
Size: 0 – 10 10 – 20 20 – 30 30 – 40 40 – 50
Frequency: 5 8 12 6 4
There are various methods of studying variation or dispersion important methods studying
dispersion are as follows:
1. Range
2. Inter - quartile range
3. Mean Deviation
4. Standard Deviation
5. Lorenz curve
1. Range
Range is the simplest and crudest measure of dispersion. It is a rough measure of dispersion.
It is the difference between the highest and the lowest value in the distribution.
Range = L – S
Co – efficient of Range
Illustration 1:
Find the range of weights of 7 students from the following.
Here L = 43; S = 27
Range = 43 – 27 = 16
Co – efficient of Range =
Advantages
Disadvantages
1. Range is completely dependent on the two extreme values.
2. It is subject to fluctuations of considerable magnitude from sample to sample. 3.
Range cannot tell us anything about the character of the distribution.
In the series, four quartiles are there. By eliminating the lowest items (25%) and the highest
items (25%) of a series we can obtain a measure of dispersion and can find out the half of the distance
between the first and the third quartiles. That is, [Q3 (third quartiles) – Q1 (first quartiles). The inter-
quartile range is reduced to the form of the semi – inter quartile range (or) quartile deviation by
dividing it by 2.
Quartile Deviation – Individual Series Illustration 2: Find out the value of Quartile Deviation
and its coefficient from the following data:
Roll No. 1 2 3 4 5 6 7
Marks 20 28 40 30 50 60 52
th
= size of item
= 28
th
Q3 = Size of item
th
Q3 = Size of item
th
= Size of item
th
= size of item
= 52
Q. D.
= 12
Coefficient of Q.D
= 0.3
Quartile Deviation – Discrete Series Illustration 3: Find out the value of Quartile Deviation and its
coefficient from the following data:
Age in years 20 30 40 50 60 70 80
No. of members 3 61 132 153 140 51 3
Solution:
Calculation of Q.D.
48
x F c.f.
20 3 3
30 61 64
40 132 196
50 153 349
60 140 489
70 51 540
80 3 543
th
Q3 = Value of item
th th
Q3 = value of item = value of item
Q. D. years
Coefficient of Q.D
Quartile Deviation – Continuous Series Illustration 4: Find out the value of Quartile Deviation and
its coefficient from the following data:
Wages (Rs.) 30 – 32 32 – 34 34 - 36 36 - 38 38 - 40 40 - 42 42 - 44
Labourers 12 18 16 14 12 8 6
49
Wages (x) Labourers (f) c.f.
30 – 32 12 12
32 – 34 18 30
34 – 36 16 46
36 – 38 14 60
38 – 40 12 72
40 – 42 8 80
42 – 44 6 86
th
Q1 = size of item
th
= size of item
= 21.5th item
th
Q3 = size of item
th
= size of item
= 64.5 th item
50
Q
Q. D.
Coefficient of Q.D
Merits:
1. It is simple to calculate.
2. It is easy to understand.
3. Risk of excrement item variation is eliminated, as it depends upon the central 50 percent items.
Demerits
M. D. (mean or median or
Illustration 5: Calculate mean deviation from mean and median for the following data:
51
100 150 200 250 360 490 500 600 671
X – 369 X - 360
100 269 260
360 9 0
= Value of th item
= Value of 5th item = 360
M. D.
Illustration 6: Calculate mean deviation from mean from the following data:
X 2 4 6 8 10
F 1 4 6 4 1
x F fx
2 1 2 4 4
4 4 16 2 8
6 6 36 0 0
8 4 32 2 8
10 1 10 4 4
N = ∑f = 16 ∑fx = 96
Mean
Coefficient of M.D.
M. D.
53
Illustration 7: Calculate mean deviation from mean from the following
data:
Solution:
x M f Fm │𝑫│ = 𝒎 − 𝐗̅ 𝒇│𝑫│
2–4 3 3 9 2.2 6.6
8 - 10 9 1 9 3.8 3.8
N = ∑f = 10 ∑fm = 52 ∑𝒇│𝑫│ = 𝟏𝟒. 𝟖
Mean
Coefficient of M.D.
Merits
1. It is clear and easy to understand.
2. It is based on each and every item of the data.
3. It can be calculated from any measure of central tendency and as such is flexible too. 4. It is
not disturbed by the values of extreme items as in the case of range.
Demerits:
1. It is not suitable for further mathematical processing.
2. It is rarely used in sociological studies.
54
Karl Pearson introduced the concept of Standard deviation in 1893. Standard deviation is the square
root of the means of the squared deviation from the arithmetic mean. So, it is called as Root - Mean Square
Deviation or Mean Error or Mean Square Error. The Standard deviation is denoted by the small Greek
letter ‗𝜎‘ (read as sigma)
2
Values (X) X X - 𝐗̅ ; (X - 15) (X - 𝐗̅ )2
14 196 -1 1
22 484 7 49
9 81 -6 36
15 225 0 0
20 400 5 25
17 289 2 4
12 144 -3 9
11 121 -4 16
∑X = 120 ∑X2 = ∑(X - 𝐗̅ )2 =
1940 140
55
Alternatively:
We can find out standard deviation by using variables directly, i.e., no deviation is found
out.
= √242.5 − 225
= √17.5
= 4.18
Where d = X – A
Illustration 9:
Solution:
56
Calculation of standard deviation from assumed mean
2
X d = X – A= X - 55 d
30 -25 625
43 -12 144
45 -10 100
55 0 0
68 13 169
69 14 196
75 20 400
N=7 ∑d = 0 ∑d2= 1634
Illustration 10:
Marks 10 20 30 40 50 60
No. of students 8 12 20 10 7 3
Solution:
57
10 8 80 -20.8 432.64 3461.12
Mean:
= 30.8
Standard Deviation:
Where d = X – A
x f d = x -𝟑𝟎 d fd fd2
10 8 -20 400 -160 3200
58
20 12 -10 100 -120 1200
30 20 0 0 0 0
60 3 30 900 90 2700
2
𝐍 = ∑𝐟 = 𝟔𝟎 ∑fd = 50 ∑𝐟d = 10900
= √181.67 − 0.69
= √180.98
= 𝟏𝟑. 𝟒𝟓
xC
Illustration 12:
Solution:
Calculation of standard deviation (from step deviation)
59
x f 𝑿−𝟑𝟎 d'2 fd' fd'2
d' = 𝟏𝟎
10 8 -2 4 -16 32
20 12 -1 1 -12 12
30 20 0 0 0 0
40 10 1 1 10 10
4
50 7 2 14 28
9
60 3 3 9 27
𝐍 = ∑𝐟 = 𝟔𝟎 ∑fd’ = 5 ∑𝐟d’2 = 109
xC
= √1.817 − 0.0069 x 10
xC
Class 0 - 10 10 - 20 20 - 30 30 - 40 40-50
Frequency 5 8 15 16 6
Solution:
60
Computation of standard deviation
2
x M F d= d fd fd2
𝒎−𝟐𝟓
𝟏𝟎
0 - 10 5 5 -2 4 -10 20
10 - 20 15 8 -1 1 -8 8
20 - 30 25 15 0 0 0 0
30 - 40 35 16 1 1 16 16
40 - 50 45 6 2 4 12 24
𝐍 = ∑𝐟 ∑fd = 10 ∑𝐟d2 = 68
= 𝟓𝟎
xC
Merits:
1. It is rigidly defined determinate.
2. It is based on all the observations of a series.
3. It is less affected by fluctuations of sampling and hence stable.
4. It is amenable to algebraic treatment and is less affected by fluctuations of sampling most other
measures of dispersion.
5. The standard deviation is more appropriate mathematically than the mean deviation, since the
negative signs are removed by squaring the deviations rather than by ignoring
Demerits:
1. It lacks wide popularity as it is often difficult to compute, when big numbers are involved, the
process of squaring and extracting root becomes tedious.
61
3. It is difficult to calculate accurately when a grouped frequency distribution has extreme groups with
no definite range.
Uses:
1. Standard deviation is the best measure of dispersion.
2. It is widely used in statistics because it possesses most of the characteristics of an ideal measure of
dispersion.
3. It is widely used in sampling theory and by biologists. 4. It is used in coefficient of correlation
and in the study of symmetrical frequency distribution.
The Standard deviation is an absolute measure of dispersion. The corresponding relative measure
is known as the co - efficient of variation. It is used to compare the variability of two or more than two
series. The series for which co-efficient or variation is more is said to be more variable or conversely less
consistent, less uniform less table or less homogeneous.
Variance
Co – efficient of variation
Illustration 14: The following are the runs scored by two batsmen A and B in ten innings:
A 101 27 0 36 82 45 7 13 65 14
B 97 12 40 96 13 8 85 8 56 15
62
Batsman A Batsman B
0 -39 1521 40 -3 9
36 -3 9 96 53 2809
45 6 36 8 -35 1225
65 26 676 56 13 169
Batsman A Batsman B
C.V. C.V.
= 82.72% = 83.07%
Batsman A is more consistent in his batting, because the co-efficient of variation of runs is less for
him.
63
UNIT IV SIMPLE
CORRELATION
Meaning:
Correlation refers to the relationship of two or more variables. For example, there exists some
relationship between the height of a mother and the height of a daughter, sales and cost and so on. Hence,
it should be noted that the detection and analysis of correlation between two statistical variables requires
relationship of some sort which associates the observation in pairs, one of each pair being a value of the
two variables. The word relationship is of important and indicates that there is some connection between
the variables under observation. Thus, the association of any two variates is known as correlation.
Significance:
Correlation is useful in physical and social sciences. We can study the uses of correlation in business
and economics. The following are the significance of study of correlation:
➢ Correlation is very useful to economics to study the relationship between variables, like price and
quantity demanded. To the businessmen, it helps to estimate costs, sales, price and other related
variables.
➢ Some variables show some kind of relationship; correlation analysis helps in measuring the degree
of relationship between the variables like supply and demand, price and supply, income and
expenditure, etc.
➢ The relation between variables can be verified and tested for significance, with the help of the
correlation analysis. The effect of correlation is to reduce the range of uncertainty of our prediction.
➢ The coefficient of correlation is a relative measure and we can compare the relationship between
variables which are expressed in different units.
➢ Sampling error can also be calculated. ➢ Correlation is the basis for the concept of regression and
ratio of variation.
Types of Correlation:
64
Correlation is classified into many types but the important are:
The correlation is said to be positive when the values of two variables move in the same
direction, so that an increase in the value of one variable is accompanied by an increase in the value of
the other variable or a decrease in the value of one variable is followed by a decrease in the value of
the other variable. Example: Height and weight, rainfall and yield of crops, etc.,
The correlation is said to be negative when the values of two variables move in opposite
direction, so that an increase or decrease in the values of one variable is followed by a decrease or
increase in the value of the other. Example: Price and demand, yield of crops and price, etc.,
When more than two variables are studied simultaneously, the correlation is said to be
multiple correlation. Example: the relationship of price, demand and supply of a commodity.
Example: When we study the relationship between the yield of rice per acre and both the
amount of rainfall and the amount of fertilizers used. In these relationship if we limit our
correlation analysis to yield and rainfall. It becomes a problem relating to simple correlation.
The correlation is said to be linear, if the amount of change in one variable tends to bear a
constant ratio to the amount of change in the other variable.
65
The correlation is non-linear, if the amount of change in one variable does not bear a
constant ratio to the amount of change in the other related variable.
Studying Correlation:
The following correlation methods are used to find out the relationship between two variables.
A. Graphic Method :
i. Scatter diagram (or) Scattergram method.
ii. Simple Graph or Correlogram method. B. Mathematical Method
:
i. Karl Pearson‘s Coefficient of Correlation. ii.
Spearman‘s Rank Correlation of Coefficient iii.
Coefficient of Concurrent Deviation iv. Method of Least
Squares.
C. Graphic Method
This is the simplest method of finding out whether there is any relationship present between two
variables by plotting the values on a chart, known as scatter diagram. In this method, the given data are
plotted on a graph paper in the form of dots. X variables are plotted on the horizontal axis and Y variables
on the vertical axis. Thus we have the dots and we can know the scatter or concentration of various points.
If the plotted dots fall in a narrow band and the dots are rising from the lower left hand corner to
the upper right-hand corner it is called high degree of positive correlation.
If the plotted dots fall in a narrow band from the upper left hand corner to the lower right
hand corner it is called a high degree of negative correlation.
If the plotted dots line scattered all over the diagram, there is no correlation between the two
variables.
66
Merits:
1. It is easy to plot even by beginner.
2. It is simple to understand.
3. Abnormal values in a sample can be easily detected. 4. Values of some dependent variables can be
found out.
Demerits:
1. Degree of correlation cannot be predicted.
2. It gives only a rough idea. 3. The method is useful only when number of terms is
small.
ii. Simple Graph Method of Correlation:
In this method separate curves are drawn for separate series on a graph paper. By examining the
direction and closeness of the two curves we can infer whether or not variables are related. If both the
curves are moving in the same direction correlation is said to be positive. On the other hand, if the curves
are moving in the opposite directions is said to be negative.
Merits:
1. It is easy to plot
2. Simple to understand 3. Abnormal values can easily be deducted.
Demerits:
1. This method is useless when number of terms is very big. 2.
Degree of correlation cannot be predicted.
B. Mathematical Method:
Karl Pearson, a great biometrician and statistician, introduced a mathematical method for
measuring the magnitude of linear relationship between two variables. This method is most widely used in
practice. This method is known as Pearsonian Coefficient of Correlation. It is denoted by the symbol ;
the formula for calculating Pearsonian r is:
r=
67
The value of the coefficient of correlation shall always lie between +1 and -1.
When r = +1, then there is perfect positive correlation between the variables.
When r = -1, then there is perfect negative correlation between the variables.
X x = X - 𝑿̅ x Y y = Y - 𝒀̅̅ y Xy
=X – 99 =Y - 95
100 1 1 98 3 9 3
101 2 4 99 4 16 8
102 3 9 99 4 16 12
102 3 9 97 2 4 6
100 1 1 95 0 0 0
99 0 0 92 -3 9 0
97 -2 4 95 0 0 0
98 -1 1 94 -1 1 1
96 -3 9 90 -5 25 15
95 -4 16 91 -4 16 16
68
∑X= ∑ x2 = 54 ∑Y= ∑ y2 = 96 ∑ xy =
990 950 61
= + 0.85
X: 6 2 10 4 8
69
Y: 9 11 5 8 7
2 2
X X Y Y XY
6 36 9 81 54
2 4 11 121 22
10 100 5 25 50
4 16 8 64 32
8 64 7 49 56
∑ X = 30 ∑ X2 = 220 ∑ Y = 40 ∑ Y2= 340 ∑ XY = 214
In 1904, a famous British psychologist Charles Edward Spearman found out the method of
ascertaining the coefficient of correlation by ranks. This method is based on rank. Rank correlation is
applicable only to individual observations. This measure is useful in dealing with qualitative characteristics
such as intelligence, beauty, morality, character, etc.,
70
or
Illustration 3: Following are the rank obtained by 10 students in two subjects, Statistics and Mathematics.
To what extent the knowledge of the students in the two subjects is related?
Statistics 1 2 3 4 5 6 7 8 9 10
Mathematics 2 4 1 5 3 9 7 10 6 8
Solution:
2 4 -2 4
3 1 +2 4
4 5 -1 1
71
5 3 +2 4
6 9 -3 9
7 7 0 0
8 10 -2 4
9 6 +3 9
10 8 +2 4
N = 10 𝚺𝑫𝟐 = 𝟒𝟎
= 1 – 0.24
= + 0.76
Mathematics 85 60 73 40 90
Statistics 93 75 65 50 80
Solution:
72
2
Mathematics Rank x Statistics Rank y D=x–y D
(x) (y)
85 2 93 1 +1 1
60 4 75 3 +1 1
73 3 65 4 -1 1
40 5 50 5 0 0
90 1 80 2 -1 1
𝚺𝑫𝟐 = 𝟒
= + 0. 8
Illustration 5: From the following data calculate the rank correlation coefficient after making adjustment
for tied ranks.
X 48 33 40 9 16 16 65 24 16 57
73
Y 13 13 24 6 15 4 20 9 6 19
2
X Rank x Y Rank y D =R(x) – R(y) D
48 8 13 5.5 2.5 6.25
40 7 24 10 -3.0 9.00
16 3 15 7 4.0 16.00
16 3 4 1 2.0 4.00
65 10 20 9 1.0 1.00
24 5 9 4 1.0 1.00
57 9 19 8 1.0 1.00
𝚺𝑫𝟐 = 𝟒𝟏
= + 0. 733
Merits:
1. It is simple to understand and easier to apply.
74
2. It can be used to any type of data, qualitative or quantitative.
3. It is the only method that can be used where we are given the ranks and not the actual data.
4. Even where actual data are given, rank method can be applied for ascertaining correlation by assigning
the ranks to each data.
Demerits:
1. This method is not useful to find out correlation in a grouped frequency distribution.
2. For large samples it is not convenient method. If the items exceed 30 the calculations become quite
tedious and require a lot of time. 3. It is only an approximately calculated measure as actual values are not
used for calculations.
The statistical method which helps us to estimate the unknown value of one variable from the
known value of the related variable is called Regression. The dictionary meaning of the word regression is
“return or going back”. In 1877, Sir Francis Galton, first introduced the word
‘Regression’. The tendency to regression or going back was called by Galton as the ‘Line of Regression’.
The line describing the average relationship between two variables is known as the line of regression. The
regression analysis confined to the study of only two variables at a time is termed as simple regression.
The regression analysis for studying more than two variables at a time is known as multiple regressions.
Regression Vs Correlation:
75
7 It has wider application, as it studies It has limited application, because it is
linear and non-linear relationship confined only to linear relationship
between the variables. between the variables.
8 It is widely used for further mathematical It is not very useful for mathematical
treatment. treatment.
9 It explains that the decrease in one If the coefficient of correlation is positive,
variable is associated with the increase then the two variables are positively
in the other variable. correlated and vice - versa.
10 There is a functional relationship It is immaterial whether X depends upon Y
between the two variables so that we or Y depends upon X.
may identify between the independent
and dependent variables.
Linear Regression:
Linear regression attempts to model the relationship between two variables by fitting a linear
equation to observed data. One variable is considered to be an explanatory variable, and the other is
considered to be a dependent variable. For example, a modeler might want to relate the weights of
individuals to their heights using a linear regression model.
Before attempting to fit a linear model to observed data, a modeler should first determine whether
or not there is a relationship between the variables of interest. This does not necessarily imply that one
variable causes the other (for example, higher SAT scores do not cause higher college grades), but that
there is some significant association between the two variables. A scatter plot can be a helpful tool in
determining the strength of the relationship between two variables. If there appears to be no association
between the proposed explanatory and dependent variables (i.e., the scatter plot does not indicate any
increasing or decreasing trends), then fitting a linear regression model to the data probably will not provide
a useful model. A valuable numerical measure of association between two variables is the correlation
coefficient, which is a value between -1 and 1 indicating the strength of the association of the observed
data for the two variables.
A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable
and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x =
0).
Regression lines:
If we take two variables X and Y we have two regression lines:
76
i) Regression of X on Y and ii) Regression
of Y on X
The regression line of X on Y gives the most probable value of X for any given value of Y. The
regression of Y on X gives the most probable value of Y for any given value of X. There are two regression
lines in the case of two variables.
Regression Equations: The algebraic expressions of the two regression lines are called regression
equations.
Regression Equation of X on Y:
Xc = a + by
To determine the values of ‗a‘ and ‗b‘, the following two normal equations are to be solved
simultaneously.
∑X = Na + b∑Y
Regression Equation of Y on X:
Yc = a + bx
To determine the value of ‗a‘ and ‗b‘, the following two normal equations are to be solved
simultaneously.
∑Y = Na + b∑X
X 10 12 13 16 17 20 25
Y 10 22 24 27 29 33 37
77
Solution: Calculation of Regression
X X Y Y XY
0 100 10 100 100
∑Y = Na + b∑X
78
Multiplying (2), by 7,
- 1112 b = - 1736
7a + 113(1.56)b = 182
x Regression Equation of X on Y:
The two normal equations are:
∑X = Na + b∑Y
79
N = 7; ∑X = 113; ∑ Y2 = 5188; ∑Y = 182;
80
7a + 182b = 113 …(1)
Multiplying (2), by 7,
3192 b = 1736
7a + 182(0.54) = 113
X-
81
Where, X = the value of x to be estimated for the given y value. 𝑿 ̅ = Mean value of X
variable. Y = the value of y given in the problem; 𝒀̅̅ = Mean value of y variables.
Regression equation of Y on X:
Y-
r Regression co-efficient of Y on X.
X 3 5 6 8 9 11
Y 2 3 4 6 5 10
2 2
X x= x Y y= y xy
X -𝑿 ̅ 𝐘 − 𝒀̅̅
3 -4 16 2 -3 9 12
5 -2 4 3 -2 4 4
6 -1 1 4 -1 1 1
8 1 1 6 1 1 1
9 2 4 5 0 0 0
11 4 16 10 5 25 20
∑X = ∑x = 0 ∑ x2 = ∑Y = ∑y = 0 ∑ y2 = ∑xy =
42 42 30 40 38
82
Regression equation of X on Y: Regression equation of Y on X:
X- Y-
r r
X – 7 = 0.95(Y – 5) Y – 5 = 0.90(X – 7)
= 13.5 – 1.30
Y = 14.8
Deviation taken from the Assumed Mean:
Regression equation of X on Y:
X-
Where, r
dx = X - A; dy = Y - A; ( A = assumed mean)
Regression equation of Y on X :
Y–
83
r
X 40 38 35 42 30
Y 30 35 40 36 29
38 3 9 35 5 25 15
35 0 0 40 10 100 0
42 7 49 36 6 36 42
30 -5 25 29 -1 1 5
∑ dx = ∑ dx2 ∑ dy ∑ dy2 ∑dx.dy
10 =108 =1020 =162 = 62
bxy byx
84
X – 37 = 0.27(Y – 34) Y – 34 = 0.25(X – 37)
= √2.2 x 0.37
= √0.814
X Y
Arithmetic Mean (̅X) 7.6 14.8
Standard Deviation (σ) 3.6 2.5
Coefficient of correlation(r) = 0.99
Solution:
Regression of Y on X
Y-
Y–
Y = 0.688 X + 9.57
85
When X = 12 => Y = 0.688 (12) + 9.57 = 17.826
It measures the scattering of the observations the regression line. It is calculated as follows :
√∑(Y-YC)2 Standard
Error of Y values from 𝑌𝑐 [𝑆𝑦𝑥 ] = N
Illustration:
86
Given the regression equation of Y on X as Y = 3 + 9X for the following data series, calculate (i) Standard
error of estimate (ii) Explained variation in Y (iii) unexplained variation in Y.
X 1 2 3 4 5
Y 10 20 30 50 40
Solution :
X Y y=Y-𝑌 𝑦2 𝑦𝑐 [𝑌 − 𝑌𝑐 ] [𝑌 − 𝑌𝑐 ] 2
[3 + 9𝑋̅]
1 10 -20 400 12 -2 4
2 20 -10 100 21 -1 1
3 30 0 0 30 0 0
4 50 20 400 39 11 121
5 40 10 100 48 -8 64
0
Y 150 Y2= Y Y
1000 2
c
=190
= 𝑦2 - ∑[𝑌 − 𝑌𝑐 ] 2
= 1000 - 190 = 810.
87
UNIT V TESTING OF HYPOTHESIS
Hypothesis
Hypothesis is a precise, testable statement of what the researchers predict will be outcome of the study.
Hypothesis usually involves proposing a relationship between two variables: the independent variable
(what the researchers change) and the dependent variable (what the research measures).
Hypothesis is usually considered as the principal instrument in research. The main goal in many research
studies is to check whether the data collected support certain statements or predictions. A statistical
hypothesis is an assertion or conjecture concerning one or more populations. Test of hypothesis is a process
of testing of the significance regarding the parameters of the population on the basis of sample drawn from
it. Thus, it is also termed as “Test of Significance’.
In short, hypothesis testing enables us to make probability statements about population parameter. The
hypothesis may not be proved absolutely, but in practice it is accepted if it has withstood a critical testing.
Points to be considered while formulating Hypothesis
Hypothesis should be clear and precise.
• Hypothesis should be capable of being tested.
• Hypothesis should state relationship between variables.
• Hypothesis should be limited in scope and must be specific.
• Hypothesis should be stated as far as possible in most simple terms so that the same is easily
understandable by all concerned.
• Hypothesis should be amenable to testing within a reasonable time.
• Hypothesis must explain empirical reference.
▪ Setting up a hypothesis:
At the very outset, we take certain hypothesis with regard to related variables under the assumptions defined
for the study. Generally, there are two forms of hypotheses which must be constructed; and if one hypothesis
is accepted, the other one is rejected.
88
i. Null Hypothesis: It is very useful tool to test the significance of difference. Any hypothesis concerned
to a population is called statistical hypothesis. In the process of statistical test, the rejection or acceptance
of hypothesis depends on sample drawn from population. The statistician tests the hypothesis through
observation and gives a probability statement. The simple hypothesis states that the statistical measures of
sample and those of the population under study do not differ significantly. Similarly it may assume no
relationship or association between two variables or attributes. In case of assessing the effectiveness of a
literacy campaign on the awareness of rural people we assume “There is no effect of the campaign on
public awareness”. It is denoted by 𝐻𝑜
For example, if we want to find out whether extra coaching has benefitted the students or not: the null
hypothesis would be;
Similarly, if we want to find out whether a particular drug is effective in curing Malaria we will take the
null hypothesis:
Ho (2): The drug used under experimentation is not effective in curing Malaria. Similarly for testing the
significance of difference between two sample, null hypothesis would be:
Ho (3): “There is no significant difference between the variation in data of two samples taken from the
same parent population.” i.e. σ1 = σ2
The rejection of the null hypothesis indicates that the differences have statistical significance and the
acceptance of null hypothesis indicates that the differences are due to chance and arised because of sampling
fluctuation. Since many practical problems aim at establishment of statistical significance of differences,
rejection of the null hypothesis may thus indicate success in statistical project.
ii. Alternative Hypothesis: As against the null hypothesis, the alternative hypothesis specifies those values
that the researcher believes to hold true, and, of course, he hopes that the sample data lead to acceptance of
this hypothesis as true.
Rejection of Null hypothesis 𝐻𝑜 leads to the acceptance of alternative hypothesis, which is denoted by 𝐻1.
With respect to the three null hypotheses as stated above, researcher might establish the following alternative
hypotheses:
Thus H1 (1): The extra coaching has benefitted the students.
89
𝑇ℎ𝑒 𝑛𝑢𝑙𝑙 𝑎𝑛𝑑 𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑠𝑒𝑠 𝑐𝑎𝑛 𝑎𝑙𝑠𝑜 𝑏𝑒 𝑤𝑟𝑖𝑡𝑡𝑒𝑛 𝑎𝑠 ∶
H1: ( 𝜎1 − 𝜎2 ≠ 0)
𝐻𝑜 : µ1 − µ2=0 H1: µ1 − µ2 ≠ 0
When two hypotheses are set up, the acceptance or rejection of a null hypothesis is based on a sample study.
While we make a decision on the basis of the data analysis and testing of the significance difference, it may
lead to wrong conclusions in two ways (i.) Rejecting a true null hypothesis
(ii) Accepting 𝑎 false hypothesis. This can be presented in the following table:
Decision
Accepted 𝐻𝑜 Rejected 𝐻𝑜
The maximum possibility of committing type I error, which we use to specify in a test, is known as the
level of significance. Generally, 5% level of significance is fixed in statistical tests. This implies that we
can have 95% confidence in accepting a hypothesis or we could be wrong 5% in taking the decision. The
range of variation has two regions-acceptance region and critical region or rejection region. If the sample
statistic falls in critical region, we reject the hypothesis, as it leads to false decision. We go with H1, if the
computed value of sample statistic falls in the rejection region.
90
The critical region under a normal curve, as stated earlier can be divided into two ways; (a) two sides under
a curve (Two Tailed Test) (b) one side under a curve; either on the right tail or left tail (One Tailed Test).
Acceptance and rejection regions in case of a two- tailed Test (with 5% significance level)
The third step in hypothesis testing procedure is to construct a test criterion. This involves selecting an
appropriate probability distribution for the particular test, that is, a probability distribution which can
properly be applied. Some probability distributions that are commonly used in testing procedures are Z, t,
F and ᵡ2.
▪ Computation:
After completing first three steps we have completely designed a statistical test. We now proceed to the
fourth step- computing various measures from a random sample of size n, which are necessary for applying
the test. These calculations include the test statistic and the standard error of the test statistic.
Finally we come to a conclusion stage where either we accept or reject the null hypothesis. The decision is
based on computed value of test statistic, whether it lies in the acceptance region or rejection region. If the
computed value of the test statistic falls in the acceptance region (it means computed value is less than
critical value), the null hypothesis is accepted. On the contrary, if the computed value of the test statistic is
greater than the critical value, the computed value of the statistic falls in the rejection region and the null
hypothesis is rejected.
Tests of Hypotheses
91
Hypothesis testing determines the validity of the assumption (technically described as null hypothesis) with
a view to choose between two conflicting hypotheses about the value of a population parameter. Hypothesis
testing helps to decide on the basis of a sample data, whether a hypothesis about the population is likely to
be true or false. Statisticians have developed several tests of hypotheses (also known as the tests of
significance) for the purpose of testing of hypotheses which can be classified as:
Parametric tests usually assume certain properties of the parent population from which we draw samples.
Assumptions like observations come from a normal population, sample size is large, assumptions about the
population parameters like mean, variance, etc., must hold good before parametric tests can be used. But
there are situations when the researcher cannot or does not want to make such assumptions. In such
situations we use statistical methods for testing hypotheses which are called non-parametric tests because
such tests do not depend on any assumption about the parameters of the parent population. Besides, most
non-parametric tests assume only nominal or ordinal data, whereas parametric tests require measurement
equivalent to at least an interval scale. As a result, non-parametric tests need more observations than
parametric tests to achieve the same size of Type I and Type II errors.
Non parametric Tests
Non parametric tests are used when the data isn't normal. Therefore, the key is to figure out if you have
normally distributed data. The only non-parametric test you are likely to come across in elementary stats
is the chi-square test. However, there are several others.
The Chi-Square Test for Goodness of Fit tests claims about population proportions. It is a nonparametric
test that is performed on categorical (nominal or ordinal) data.
Illustration: 1
In the 2000 US Census, the ages of individuals in a small town were found to be the following:
Less than 18 18–35 Greater than 35
121 288 91
Using 5% level of significance (alpha = 0.05), would you conclude that the population distribution of ages
has changed in the last 10 years?
Solution:
Using our sample size and expected percentages, we can calculate how many people we expected to fall
within each range. We can then make a table separating observed values versus expected values:
Less than 18 18–35 Greater than 35
Expected 20% 30% 50%
2. State Alpha
3. Calculate Degrees of Freedom
93
5. Calculate Test Statistic
6. State Results
7. State Conclusion
1. State Alpha
Alpha = 0.05
Using our alpha and our degrees of freedom, who look up a critical value in the Chi-Square Table. We find
our critical value to be 5.99.
The Chi-Square statistic is found using the following equation, where observed values are compared to
expected values:
Less than 18 18–35 Greater than 35
Observed 121 288 91
Expected 100 150 250
2
ᵡ = ∑(O-E)2 /O ᵡ2 =(121 – 100)2 / 100 + (288 – 150)2 /150 +
94
7. State Conclusion
The ages of the 2010 population are different than those expected based on the 2000 population.
The Chi-Square Test for Independence evaluates the relationship between two variables. It is a
nonparametric test that is performed on categorical (nominal or ordinal) data.
Illustration: 2
500 elementary school boys and girls are asked which their favourite colour is: blue, green, or pink. Results
are shown below:
Blue Green Pink
Using 5% level of significance (alpha = 0.05), would you conclude that there is a relationship between
gender and favourite colour?
Let’s perform a hypothesis test to answer this question.
2. State Alpha
6. State Results
7. State Conclusion
H0; For the population of elementary school students, gender and favourite colour are not related.
H1; For the population of elementary school students, gender and favourite colour are related.
95
2. State Alpha
Alpha = 0.05
Using our alpha and our degrees of freedom, who look up a critical value in the Chi-Square Table. We find
our critical value to be 5.99.
First, we need to calculate our expected values using the equation below. We find the expected values by
multiplying each row total by each column total, and then diving by the total number of subjects. The
calculations are shown.
Girls 48 72 80 200
6. State Results
96
2
ᵡ = 266.389
Reject the null hypothesis
7. State Conclusion
Type of t-test
The T-test is mainly classified into 3 parts:
• One sample
• Independent sample
• Paired sample
97
One Sample
In one sample t-test, we compare the sample mean with the population mean.
Formula:
Illustration: 3
98
Illustration: 4
Marks of student are 10.5, 9, 7, 12, 8.5, 7.5, 6.5, 8, 11 and 9.5. Mean population score is 12 and standard
deviation is 1.80. Is the mean value for student significantly differing from the mean population value?
Solution:
99
Independent (two-sample t-test):
In this test, we compare the means of two different samples. The formula is
Degree of Freedom: Degree of freedom is defined as the number of independent variables. It is given by:
Illustration: 5
100
The marks of boys and girls are given: Boys: 12, 14, 10, 8, 16, 5, 3, 9, and 11 Girls: 21, 18, 14, 20, 11, 19,
8, 12, 13, and 15. Is there any significant differnece between marks of males and females i.e. population
means are different.
Solution:
101
Paired t-test:
In this test, we compare the means of two related or same group at two different time. Formula:
102
5.3 F Tests
F-tests are named after the name of Sir Ronald Fisher. The F-statistic is simply a ratio of two variances.
Variance is the square of the standard deviation. For a common person, standard deviations are easier to
understand than variances because they’re in the same units as the data rather than squared units. Fstatistics
are based on the ratio of mean squares. The term "mean squares" may sound confusing but it is simply an
estimate of population variance that accounts for the degrees of freedom used to calculate that estimate. For
carrying out the test of significance, we calculate the ratio F, which is defined as:
103
The calculated value of F is compared with the table value for v1 and v2 at 5% or 1% level of significance.
If calculated value of F is greater than the table value then the F ratio is considered significant and the null
hypothesis is rejected. On the other hand, if the calculated value of F is less than the table value the null
hypothesis is accepted and it is inferred that both the samples have come from the population having same
variance.
Illustration: 7
Two random samples were drawn from two normal populations and their values are:
A 65 66 73 80 82 84 88 90 92
B 64 66 74 78 82 85 87 92 93 95 97
Test whether the two populations have the same variance at the 5% level of significance. (Given: F=3.36 at
5% level for v1=10 and v2=8.)
Solution: Let us take the null hypothesis that the two populations have not the same variance.
Applying F-test:
104
At 5 percent level of significance, for v1=10 and v2=8, the table value of F0.05 = 3.36. The calculated value of
F is less than the table value. The hypothesis is accepted. Hence the two populations have not the same
variance.
105
Each observation is reduced by 39, and shown below: Calculation for Analysis of Variance
A B C D
-5 -10 -12 -5
-2 -6 -10 -3
-7 -9 -8 -1
-6 -5 -11 -4
Total T1= -20 T2= -30 T3= -41 T4= -13 T = -104
Total of
Squares 114 242 429 51 ∑∑xij2= 836
sample size n1= 4 n2= 4 n3= 4 n4= 4 N= 16
Since the observed value of F (i.e., 9.196) exceeds the 5% tabulated value (i.e., 3.49) for (3,12) d.f., we
reject the null hypothesis of equality of population means, and conclude that the retail prices of the
commodity in the four cities are not equal.
In order to test which of the cities differ in prices, we calculate the critical difference (C.D.).
106
C.D. = s√2n t.0.025 (for 12d.f.) s2 = MSE
s = √MSE = √4.04
C.D. = √4.04* √2*4 * 2.18 = 12.39
The sample totals (of the reduced observations) are T1= -20, T2= -30, T3= -41, T4= -13 We
have
| T1- T2| = 10
| T1- T3| = 21
| T1- T4| = 7
| T2- T3| = 11
| T2- T4| = 17
| T3- T4| = 28
Comparing these figures with the C.D. (i.e., 12.39) we find that the cities A and C, B and D, C and D differ
in prices. Cities A and B and A and D may be taken to be having same prices.
107
xij2= (-1)2 + (4)2 +(2)2 +(0)2 +(1)2 +(3)2 +(0)2 +(-2)2 +(2)2 = 39
Correction Factor (C.F.) = T2/N = (9)2/9 = 81/9 = 9
Total Sum of Squares (SS) = xij2 - C.F. = 39-9 = 30
Sum of Squares (SS) between Investigators = (T12 + T12 + T12)/3 – C.F
= (52 +42 + 02)/3 – 9
= 41/3 – 9
=4.67
Sum of Squares (SS) between Districts = (T1ʹ2 + T1ʹ2 + T1ʹ2 )/3 – C.F.
= [(-1)2 + (3)2 +(7)2]/3 – 9
= 59/3 – 9
= 10.67
SS due to Error = Total SS – (SS between investigators) - (SS between districts)
= 30 – 4.67 – 10.67 = 14.66
Between 4.67 3-1 = 2 4.67/2 = 2.34 2.34/3.67 = 0.64 F0.05 = 6.94 for
Investigators (2,4) d.f
Between 10.67 3-1 = 2 10.67/2 = 5.34 5.34/3.67 = 1.46 F0.05 = 6.94 for
Districts (2,4) d.f
Within 14.66 4 14.66/4 = 3.67
Groups
(Errors)
Total 30 9-1 = 8
Since the observed value of F for experimenters (i.e., 0.64) is less than the corresponding tabulated value
(i.e., 6.94) for d.f. (2, 4), it is not significant at 5% level. We conclude that the mean acreage of cultivable
land in the three districts as determined by the three investigators may not be different from one another,
i.e., there are no significant differences between investigators.
Since the observed value of F for districts (i.e., 1.46) is less than the corresponding tabulated value (i.e.,
6.94) for d.f (2,4), it is not significant at 5% level. We conclude that the estimates of acreage of cultivable
land in the three districts may not be different from one another, i.e., there are no significant differences
between districts.
108