Mco 03 PDF
Mco 03 PDF
BY CA MD IMRAN
camdimran.com
Introduction to
UNIT 1 INTRODUCTION TO BUSINESS Business Research
RESEARCH
STRUCTURE
1.0 Objectives
1.1 Introduction
1.2 Meaning of Research
1.3 Meaning of Science
1.4 Knowledge and Science
1.5 Inductive and Deductive Logic
1.6 Significance of Research in Business
1.7 Types of Research
1.8 Methods of Research
1.8.1 Survey Method
1.8.2 Observation Method
1.8.3 Case Method
1.8.4 Experimental Method
1.8.5 Historical Method
1.8.6 Comparative Method
1.9 Difficulties in Business Research
1.10 Business Research Process
1.11 Let Us Sum Up
1.12 Key Words
1.13 Answers to Self Assessment Exercises
1.14 Terminal Questions
1.15 Further Reading
1.0 OBJECTIVES
After studying this unit, you should be able to:
l explain the meaning of research,
l differentiate between Science and Knowledge,
l distinguish between inductive and deductive logic,
l discuss the need for research in business,
l classify research into different types,
l narrate different methods of research,
l list the difficulties in business research, and
l explain the business research process and its role in decision making.
1.1 INTRODUCTION
Research is a part of any systematic knowledge. It has occupied the realm of
human understanding in some form or the other from times immemorial. The
thirst for new areas of knowledge and the human urge for solutions to the
problems, has developed a faculty for search and research and re-research in
him/her. Research has now become an integral part of all the areas of human
activity.
5
Research and Data Research in common parlance refers to a search for knowledge. It is an
Collection endeavour to discover answers to problems (of intellectual and practical nature)
through the application of scientific methods. Research, thus, is essentially a
systematic inquiry seeking facts (truths) through objective, verifiable methods in
order to discover the relationship among them and to deduce from them broad
conclusions. It is thus a method of critical thinking. It is imperative that any
type of organisation in the globalised environment needs systematic supply of
information coupled with tools of analysis for making sound decisions which
involve minimum risk. In this Unit, we will discuss at length the need and
significance of research, types and methods of research, and the research
process.
L.V. Redman and A.V.H. Mory in their book on “The Romance of Research”
defined research as “a systematized effort to gain new knowledge”
“A careful investigation or inquiry specially through search for new facts in any
branch of knowledge” (Advanced learners Dictionary of current English)
Knowing has an external reference, which may be called a fact. A fact is any
thing that exists or can be conceived of. A fact is neither true nor false. It is
what it is. What we claim to know is belief or judgement. But every belief
cannot, however, be equated with knowledge, because some of our beliefs, even
the true ones, may turn out to be false on verification. Knowledge, therefore, is
a matter of degree. However, knowledge need not always be private or
individual. Private knowledge may be transformed into public knowledge by the
application of certain scientific and common sense procedures.
4) What is a fact?
..................................................................................................................
..................................................................................................................
..................................................................................................................
Empirical studies have a great potential, for they lead to inductions and
deductions. Research enables one to develop theories and principles, on the one
hand, and to arrive at generalizations on the other. Both are aids to acquisition
of knowledge.
i) Industrial and economic activities have assumed huge dimensions. The size of
modern business organizations indicates that managerial and administrative
decisions can affect vast quantities of capital and a large number of people.
Trial and error methods are not appreciated, as mistakes can be tremendously
costly. Decisions must be quick but accurate and timely and should be
objective i.e. based on facts and realities. In this back drop business decisions
now a days are mostly influenced by research and research findings. Thus,
research helps in quick and objective decisions.
ii) Research, being a fact-finding process, significantly influences business
decisions. The business management is interested in choosing that course of
action which is most effective in attaining the goals of the organization.
Research not only provides facts and figures to support business decisions but
also enables the business to choose one which is best.
iii) A considerable number of business problems are now given quantitative
treatment with some degree of success with the help of operations research.
Research into management problems may result in certain conclusions by
means of logical analysis which the decision maker may use for his action or
solution.
iv) Research plays a significant role in the identification of a new project, project
feasibility and project implementation.
v) Research helps the management to discharge its managerial functions of
planning, forecasting, coordinating, motivating, controlling and evaluation
effectively.
vi) Research facilitates the process of thinking, analysing, evaluating and
interpreting of the business environment and of various business situations and
business alternatives. So as to be helpful in the formulation of business policy
10
and strategy.
vii) Research and Development ( R & D) helps discovery and invention. Introduction to
Business Research
Developing new products or modifying the existing products, discovering new
uses, new markets etc., is a continuous process in business.
viii) The role of research in functional areas like production, finance, human
resource management, marketing need not be over emphasized. Research not
only establishes relationships between different variables in each of these
functional areas, but also between these various functional areas.
ix) Research is a must in the production area. Product development, new and
better ways of producing goods, invention of new technologies, cost reduction,
improving product quality, work simplification, performance improvement,
process improvement etc., are some of the prominent areas of research in the
production area.
x) The purchase/material department uses research to frame alternative suitable
policies regarding where to buy, when to buy, how much to buy, and at what
price to buy.
xi) Closely linked with production function is marketing function. Market research
and marketing research provide a major part of marketing information which
influences the inventory level and production level. Marketing research studies
include problems and opportunities in the market, product preference, sales
forecasting, advertising effectiveness, product distribution, after sales service
etc.,
xii) In the area of financial management, maintaining liquidity, profitability through
proper funds management and assets management is essential. Optimum
capital mix, matching of funds inflows and outflows, cash flow forecasting,
cost control, pricing etc., require some sort of research and analysis. Financial
institutions also (banking and non-banking) have found it essential to set up
research division for the purpose of collecting and analysing data both for their
internal purpose and for making indepth studies on economic conditions of
business and people.
xiii) In the area of human resource management personnel policies have to be
guided by research. An individual’s motivation to work is associated with his
needs and their satisfaction. An effective Human Resource Manager is one
who can identify the needs of his work force and formulate personnel policies
to satisfy the same so that they can be motivated to contribute their best to the
attainment of organizational goals. Job design, job analysis, job assignment,
scheduling work breaks etc., have to be based on investigation and analysis.
xiv) Finally, research in business is a must to continuously update its attitudes,
approaches, products goals, methods, and machinery in accordance with the
changing environment in which it operates.
a) Life and physical sciences such as Botany, Zoology, Physics and Chemistry.
b) Social Sciences such as Political Science, Public Administration, Economics,
Sociology, Commerce and Management.
Research in these fields is also broadly referred to as life and physical science
research and social science research. Business education covers both
Commerce and Management, which are part of Social sciences. Business
research is a broad term which covers many areas.
Business Research
a) One time or single time period research - eg. One year or a point of
time. Most of the sample studies, diagnostic studies are of this type.
b) Longitudinal research - eg. several years or several time periods ( a time
series analysis) eg. industrial development during the five year plans in
India.
viii) According to the purpose of the Study
What is the purpose/aim/objective of the study ? Is it to describe or analyze or
evaluate or explore? Accordingly the studies are known as.
1) Survey Method
2) Observation Method
3) Case Method
4) Experimental Method
5) Historical Method
6) Comparative Method
i) It is not only seeing & viewing but also hearing and perceiving as well.
ii) It is both a physical and a mental activity. The observing eye catches many
things which are sighted, but attention is also focused on data that are relevant
to the problem under study.
iii) It captures the natural social context in which the person’s behaviour occurs.
iv) Observation is selective: The investigator does not observe every thing but
selects the range of things to be observed depending upon the nature, scope and
16 objectives of the study.
v) Observation is not casual but with a purpose. It is made for the purpose of Introduction to
noting things relevant to the study. Business Research
vi) The investigator first of all observes the phenomenon and then gathers and
accumulates data.
Case Study is one of the popular research methods. A case study aims at
studying every thing about something rather than something about everything. It
examines complex factors involved in a given situation so as to identify causal
factors operating in it. The case study describes a case in terms of its
peculiarities, typical or extreme features. It also helps to secure a fund of
information about the unit under study. It is a most valuable method of study
for diagnostic therapeutic purposes.
The contrast between the field experiment and laboratory experiment is not
sharp, the difference is a matter of degree. The laboratory experiment has a
maximum of control, where as the field experiment must operate with less
control.
For historical data only authentic sources should be depended upon and their
authenticity should be tested by checking and cross checking the data from as
many sources as possible. Many a times it is of considerable interest to use
Time Series Data for assessing the progress or for evaluating the impact of
policies and initiatives. This can be meaningfully done with the help of historical
data.
The origin and the development of human beings, their customs, their
institutions, their innovations and the stages of their evolution have to be traced
and established. The scientific method by which such developments are traced
is known as the Genetic method and also as the Evolutionary method. The
science which appears to have been the first to employ the Evolutionary
method is comparative philology. It is employed to “compare” the different
languages in existence, to trace the history of their evolution in the light of such
similarities and differences as the comparisons disclosed. Darwin’s famous work
“Origin of Species” is the classic application of the Evolutionary method in
comparative anatomy.
xii) Many researchers in our country also face the difficulty of inadequate
computerial and secretarial assistance, because of which the researchers
have to take more time for completing their studies.
xiv) Social Research, especially managerial research, relates to human beings and
their behaviour. The observations, the data collection and the conclusions etc
must be valid. There is the problem of conceptualization of these aspects.
xv) Another difficulty in the research arena is that there is no code of conduct for
the researchers. There is need for developing a code of conduct for
researchers to educate them about ethical aspects of research, maintaining
confidentiality of information etc.
In spite of all these difficulties and problems, a business enterprise cannot avoid
research, especially in the fast changing world. To survive in the market an
enterprise has to continuously update itself, it has to change its attitudes,
approaches, products, technology, etc., through continuous research.
21
Research and Data 5) List out five important difficulties faced by business researchers in India.
Collection
..................................................................................................................
..................................................................................................................
..................................................................................................................
Specifically, aspects (i) to (iv) are covered in unit-2, aspects (v) to (viii)
are covered in units 3,4 and 5, processing and presentation aspects of
(ix) are discussed in units 6 & 7, and analytical tools and techniques of
data analysis of (ix) are elaborated in units 8 to 17, interpretation
aspects of (x) are discussed in unit 18 and reporting aspects in unit 19.
Therefore, the above aspects are not elaborated in this unit.
Empirical studies have a great potential for they lead to inductions and
22 deductions. Induction is the process of reasoning to arrive at generalizations
from particular facts. Deduction is a way of making a particular inference from Introduction to
Business Research
a generalization.
Research can be classified into different types for the sake of better
understanding. Several bases can be used for this classification such as branch
of knowledge, nature of data, coverage, application, place of research, research
methods used, time frame etc., and the research may be known as that type.
The research has to provide answers to the research questions raised. For this
the problem has to be investigated and relevant data has to be gathered. The
procedures adopted for obtaining the data and information are described as
methods of research. There are six methods viz., Survey, Observation, Case,
Experimental, Historical and Comparative methods.
The business researcher in India has to face certain difficulties such as lack of
scientific research training, paucity of competent researchers and research
supervisors, non-encouragement of research by business organizations, small
business organizations are not able to afford R & D departments, lack of
scientific orientation in business management, insufficient interaction between
industry and university, funding problems, poor library facilities, delayed
availability of published data etc.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
25
Research and Data
Collection UNIT 2 RESEARCH PLAN
STRUCTURE
2.0 Objectives
2.1 Introduction
2.2 Research Problem
2.2.1 Sources of Research Problems
2.2.2 Points to be Considered While Selecting a Problem
2.2.3 Specification of the Problem
2.3 Formulation of Objectives
2.4 Hypothesis
2.4.1 Meaning of Hypothesis
2.4.2 Types of Hypothesis
2.4.3 Criteria for Workable Hypothesis
2.4.4 Stages in Hypothesis
2.4.5 Testing of Hypothesis
2.4.6 Uses of Hypothesis
2.5 Research Design
2.5.1 Functions of Research Design
2.5.2 Components of a Research Design
2.6 Pilot Study and Pre-testing
2.7 Let Us Sum Up
2.8 Key Words
2.9 Answers to Self Assessment Exercises
2.10 Terminal Questions
2.11 Further Reading
2.0 OBJECTIVES
After studying this unit, you should be able to:
2.1 INTRODUCTION
In unit 1, we have discussed the meaning and significance of business research,
types of research, methods of conducting research, and the business research
process. There we have shown that the research process begins with the
raising of a problem, leading to the gathering of data, their analysis and
interpretation and finally ends with the writing of the report. In this unit, we
propose to give a complete coverage on selection and specification of the
research problem, formulation of research objectives / hypotheses and designing
2 6 the action plan of research. Now we will dwell in detail on these aspects along
with the associated features which are interwoven with the research problem Research Plan
and hypothesis formulation and testing.
Let us, now, discuss some considerations for selection of a research problems.
2) Day to Day Problems: A research problem can be from the day to day
experience of the researcher. Every day problems constantly present some thing
new and worthy of investigation and it depends on the keenness of observation
and sharpness of the intellect of the researcher to knit his daily experience into
a research problem. For example, a person who travels in city buses every day
finds it a problem to get in or get out of the bus. But a Q system (that is the
answer to the problem) facilitates boarding and alighting comfortably.
The topic or problem which the researcher selects among the many possibilities
should meet certain requirements. Every problem selected for research must
satisfy the following criteria.
1) The topic selected should be original or at least less explored. The purpose
of research is to fill the gaps in existing knowledge or to discover new facts
and not to repeat already known facts. Therefore, a preliminary survey of the
existing literature in the proposed area of research should be carried out to find
out the possibility of making an original contribution. Knowledge about previous
research will serve at least three purposes.
a) It will enable the researcher to identify his specific problem for research.
b) It will eliminate the possibility of unnecessary duplication of effort, and
c) It will give him valuable information on the merits and limitations of various
research techniques which have been used in the past.
2) It should be of significance and socially relevant and useful.
3) It should be interesting to the researcher and should fit into his aptitude.
2 9
Research and Data 4) It should be from an area of the researcher’s specialization.
Collection
5) It should correspond to the researcher’s abilities - both acquired and acquirable.
6) It should be big enough to be researchable and small enough to be handled - the
topic should be amenable for research with existing and acquirable skills.
7) It should have a clear focus or objective.
8) The feasibility of carrying out research on the selected problem should be
checked against the following considerations.
a) Whether adequate and suitable data are available?
b) Whether there is access to the organization and respondents?
c) Whether cooperation will be forth coming from the organization and
respondents?
d) What are the resources required and how are they available?
e) Whether the topic is within the resources (money and man power) position
of the researcher?
9) It should be completed with in the time limits permissible.
The research problem should define the goal of the researcher in clear terms.
It means that along with the problem, the objective of the proposal should
adequately be spelled out. Without a clear cut idea of the goal to be reached,
research activities would be meaningless.
It should be remembered that there must be at least two means available to the
research consumer. If he/she has no choice of means, he/she cannot have a
3 0 problem.
4) Doubt in Regard to Selection of Alternatives: The existence of Research Plan
alternative courses of action is not enough. To experience a problem, the
research consumer must have some doubt as to which alternative to select.
Without such a doubt there can be no problem. All problems then get reduced
ultimately to the evaluation of efficiency of the alternative means for a given
set of objectives.
The selection of a topic for research is only half-a-step forward. This general
topic does not help a researcher to see what data are relevant to his/her
purpose. What are the methods would he/she employ in securing them? And
how to organize these? Before he/she can consider all these aspects, he/she
has to formulate a specific problem by making the various components of it (as
explained above) explicit.
1) What do you want to know? (What is the problem / what are the questions to
be answered).
3) How do you want to answer or solve it? (What is the methodology we want to
adopt to solve it)
2.4 HYPOTHESIS
We know that research begins with a problem or a felt need or difficulty. The
purpose of research is to find a solution to the difficulty. It is desirable that the
researcher should propose a set of suggested solutions or explanations of the
difficulty which the research proposes to solve. Such tentative solutions
formulated as a proposition are called hypotheses. The suggested solutions
formulated as hypotheses may or may not be the real solutions to the problem.
Whether they are or not is the task of research to test and establish.
3 2
2.4.1 Meaning of Hypothesis Research Plan
“It is a proposition which can be put to test to determine validity”. (Goode and
Hatt).
A hypothesis controls and directs the research study. When a problem is felt,
we require the hypothesis to explain it. Generally, there is more than one
hypothesis which aims at explaining the same fact. But all of them cannot be
equally good. Therefore, how can we judge a hypothesis to be true or false,
good or bad? Agreement with facts is the sole and sufficient test of a true
hypothesis. Therefore, certain conditions can be laid down for distinguishing a
good hypothesis from bad ones. The formal conditions laid down by thinkers
provide the criteria for judging a hypothesis as good or valid. These conditions
are as follows:
There are four stages. The first stage is feeling of a problem. The observation
and analysis of the researcher reveals certain facts. These facts pose a
problem. The second stage is formulation of a hypothesis or hypotheses. A
tentative supposition/ guess is made to explain the facts which call for an
explanation. At this stage some past experience is necessary to pick up the
significant aspects of the observed facts. Without previous knowledge, the
investigation becomes difficult, if not impossible. The third stage is deductive
development of hypothesis using deductive reasoning. The researcher uses the
hypothesis as a premise and draws a conclusion from it. And the last stage is
the verification or testing of hypothesis. This consists in finding whether the
conclusion drawn at the third stage is really true. Verification consists in finding
whether the hypothesis agrees with the facts. If the hypothesis stands the test
of verification, it is accepted as an explanation of the problem. But if the
hypothesis does not stand the test of verification, the researcher has to search
for further solutions.
To explain the above stages let us consider a simple example. Suppose, you
have started from your home for college on your scooter. A little while later
the engine of your scooter suddenly stops. What can be the reason? Why has
it stopped? From your past experience, you start guessing that such problems
generally arise due to either petrol or spark plug. Then start deducing that the
cause could be: (i) that the petrol knob is not on. (ii) that there is no petrol in
the tank. (iii) that the spark plug has to be cleaned. Then start verifying them
one after another to solve the problem. First see whether the petrol knob is on.
If it is not, switch it on and start the scooter. If it is already on, then see
whether there is petrol or not by opening the lid of the petrol tank. If the tank
is empty, go to the near by petrol bunk to fill the tank with petrol. If there is
petrol in the tank, this is not the reason, then you verify the spark plug. You
clean the plug and fit it. The scooter starts. That means the problem is with the
spark plug. You have identified it. So you got the answer. That means your
problem is solved.
When the hypothesis has been framed in the research study, it must be verified
as true or false. Verifiability is one of the important conditions of a good
hypothesis. Verification of hypothesis means testing of the truth of the
hypothesis in the light of facts. If the hypothesis agrees with the facts, it is said
to be true and may be accepted as the explanation of the facts. But if it does
not agree it is said to be false. Such a false hypothesis is either totally rejected
or modified. Verification is of two types viz., Direct verification and Indirect
verification.
If a clear scientific hypothesis has been formulated, half of the research work
is already done. The advantages/utility of having a hypothesis are summarized
here underneath:
..................................................................................................................
..................................................................................................................
..................................................................................................................
The research has to be geared to the available time, energy, money and to the
availability of data. There is no such thing as a single or correct design.
Research design represents a compromise dictated by many practical
considerations that go into research.
i) It provides the researcher with a blue print for studying research questions.
ii) It dictates boundaries of research activity and enables the investigator to
channel his energies in a specific direction.
iii) It enables the investigator to anticipate potential problems in the implementation
of the study.
iv) The common function of designs is to assist the investigator in providing
answers to various kinds of research questions.
3 7
Research and Data A study design includes a number of component parts which are interdependent
Collection and which demand a series of decisions regarding the definitions, methods,
techniques, procedures, time, cost and administration aspects.
1) Need for the Study: Explain the need for and importance of this study and its
relevance.
2) Review of Previous Studies: Review the previous works done on this topic,
understand what they did, identify gaps and make a case for this study and justify it.
3) Statement of Problem: State the research problem in clear terms and give
a title to the study.
4) Objectives of Study: What is the purpose of this study? What are the
objectives you want to achieve by this study? The statement of objectives should
not be vague. They must be specific and focussed.
5) Formulation of Hypothesis: Conceive possible outcome or answers to the
research questions and formulate into hypothesis tests so that they can be
tested.
6) Operational Definitions: If the study is using uncommon concepts or
unfamiliar tools or using even the familiar tools and concepts in a specific sense,
they must be specified and defined.
7) Scope of the Study: It is important to define the scope of the study,
because the scope decides what is within its purview and what is outside.
Research designs provide guidelines for investigative activity and not necessarily
hard and fast rules that must remain unbroken. As the study progresses, new
aspects, new conditions and new connecting links come to light and it is
necessary to change the plan / design as circumstances demand. A universal
characteristic of any research plan is its flexibility.
Depending upon the method of research, the designs are also known as survey
design, case study design, observation design and experimental design.
3 9
Research and Data
Collection 2.6 PILOT STUDY AND PRE-TESTING
A Pilot study is a small scale replica of the main study. When a problem is
selected for research, a plan of action is to be designed to proceed further. But
if we do not have adequate knowledge about the subject matter, the nature of
the population (The word ‘population’ as used in statistics denotes the aggregate
from which the sample is to be taken), the various issues involved, the tools
and techniques to be used for operationalizing the research problem, we have to
familiarize ourselves first with it and acquire a good deal of knowledge about
the subject matter of the study and its dimensions. For this purpose, a small
study is conducted before the main study, which is called a Pilot Study. A pilot
study provides a better knowledge of the problem and its dimensions. It
facilitates us to understand the nature of the population to be surveyed and the
field problems to be encountered. It also helps in developing better approaches
and better instruments. It covers the entire process of research, but on a small
scale. This is also useful for preparing the research design clearly and
specifically.
The difference between pilot study and pre-test is that, the former is a full
fledged miniature study of a research problem, where as the latter is a trial test
of a specific aspect of the study, such as a questionnaire.
..................................................................................................................
..................................................................................................................
..................................................................................................................
Having specified the problem, the next step is to formulate the objectives of
research so as to give direction to the study. The researcher should also
propose a set of suggested solutions to the problem under study. Such tentative
solutions formulated are called hypotheses. The hypotheses are of various types
such as explanatory hypothesis, descriptive hypothesis, analogical hypothesis,
working hypothesis, null hypothesis and statistical hypothesis. A good hypothesis
must be empirically verifiable, should be relevant, must have explanatory power,
must be as far as possible within the established knowledge, must be simple,
clear and definite. There are four stages in a hypothesis (a) feeling a problem
(b) formulating hypothesis (c) deductive development of hypothesis and (d)
verification / testing of hypothesis verification can be done either directly or
indirectly or through logical methods. Testing is done by using statistical
methods.
Having selected the problem, formulated the objectives and hypothesis, the
researcher has to prepare a blue print or plan of action, usually called as
research design. The design/study plan includes a number of components which
are interdependent and which demand a series of decisions regarding definitions,
scope, methods, techniques, procedures, instruments, time, place, expenditure and
administration aspects.
If the problem selected for research is not a familiar one, a pilot study may be
conducted to acquire knowledge about the subject matter, and the various issues
involved. Then for collection of data instruments and/or scales have to
constructed, which have to be pre-tested before finally accepting them for use. 4 1
Research and Data
Collection 2.8 KEY WORDS
Hypothesis : A hypothesis is a tentative answer / solution to the research
problem, whose validity remains to be tested.
Pilot Study : A study conducted to familiarize oneself first with the research
problem so that it can be operationalised with a good deal of knowledge about
the problem.
Pre-Test : A trial administration of an instrument such as a questionnaire or
scale to identify its weaknesses is called a pre-test.
Research Design : It is a systematic plan (planning) to direct a piece of
research work.
Research Problem : A research problem is a felt need, which needs an
answer/solution.
Testing of Hypothesis : It means verification of a hypothesis as true or false
in the light of facts.
4 2
5) What is meant by hypothesis? Explain the criteria for a workable Research Plan
hypothesis.
6) What are the different stages in a hypothesis? How do you verify /
test a hypothesis?
7) What is a research design? Explain the functions of a research design.
8) Define a research design and explain its contents.
9) What are the various components of a research design?
10) Distinguish between pilot study and pre-test. Also explain the need for
Pilot study and pre-testing.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
4 3
Research and Data
Collection UNIT 3 COLLECTION OF DATA
STRUCTURE
3.0 Objectives
3.1 Introduction
3.2 Meaning and Need for Data
3.3 Primary and Secondary Data
3.4 Sources of Secondary Data
3.4.1 Documentary Sources of Data
3.4.2 Electronic Sources
3.4.3 Precautions in Using Secondary Data
3.4.4 Merits and Limitations of Secondary Data
3.5 Methods of Collecting Primary Data
3.5.1 Observation Method
3.5.2 Interview Method
3.5.3 Through Local Reporters and Correspondents
3.5.4 Questionnaire and Schedule Methods
3.6 Choice of Suitable Method
3.7 Let Us Sum Up
3.8 Key Words
3.9 Answers to Self Assessment Exercises
3.10 Terminal Questions
3.11 Further Reading
3.0 OBJECTIVES
On the completion of this unit, you should be able to:
l discuss the necessity and usefulness of data collection,
l explain and distinguish between primary data and secondary data,
l explain the sources of secondary data and its merits and demerits,
l describe different methods of collecting primary data and their merits and
demerits,
l examine the choice of a suitable method, and
l examine the reliability, suitability and adequacy of secondary data.
3.1 INTRODUCTION
In Unit 2, we have discussed about the selection of a research problem and
formulation of research design. A research design is a blue print which directs
the plan of action to complete the research work. As we have mentioned
earlier, the collection of data is an important part in the process of research
work. The quality and credibility of the results derived from the application of
research methodology depends upon the relevant, accurate and adequate data.
In this unit, we shall study about the various sources of data and methods of
collecting primary and secondary data with their merits and limitations and also
the choice of suitable method for data collection.
The first and foremost task is to collect the relevant information to make an
analysis for the above mentioned problem. It is, therefore, the information
collected from various sources, which can be expressed in quantitative form, for
a specific purpose, which is called data. The rational decision maker seeks to
evaluate information in order to select the course of action that maximizes
objectives. For decision making, the input data must be appropriate. This
depends on the appropriateness of the method chosen for data collection. The
application of a statistical technique is possible when the questions are
answerable in quantitative nature, for instance; the cost of production, and profit
of the company measured in rupees, age of the workers in the company
measured in years. Therefore, the first step in statistical activities is to gather
data. The data may be classified as primary and secondary data. Let us now
discuss these two kinds of data in detail.
With the above discussion, we can understand that the difference between
primary and secondary data is only in terms of degree. That is that the data
which is primary in the hands of one becomes secondary in the hands of
another.
This category of secondary data source may also be termed as Paper Source.
The main sources of documentary data can be broadly classified into two
categories:
a) Published Sources
There are various national and international institutions, semi-official reports of
various committees and commissions and private publications which collect and
publish statistical data relating to industry, trade, commerce, health etc. These
publications of various organisations are useful sources of secondary data.
These are as follows:
The secondary data is also available through electronic media (through Internet).
You can download data from such sources by entering web sites like
google.com; yahoo.com; msn.com; etc., and typing your subject for which the
information is needed.
You can also find secondary data on electronic sources like CDs, and the
following online journals:
With the above discussion, we can understand that there is a lot of published
and unpublished sources where researcher can gets secondary data. However,
the researcher must be cautious in using this type of data. The reason is that
such type of data may be full of errors because of bias, inadequate size of the
sample, errors of definitions etc. Bowley expressed that it is never safe to take
published or unpublished statistics at their face value without knowing their
meaning and limitations. Hence, before using secondary data, you must examine
the following points.
Merits
1) Secondary data is much more economical and quicker to collect than primary
data, as we need not spend time and money on designing and printing data
collection forms (questionnaire/schedule), appointing enumerators, editing and
tabulating data etc.
4 8
2) It is impossible to an individual or small institutions to collect primary data with Collection of Data
regard to some subjects such as population census, imports and exports of
different countries, national income data etc. but can obtain from secondary
data.
Limitations
1) Secondary data is very risky because it may not be suitable, reliable, adequate
and also difficult to find which exactly fit the need of the present investigation.
2) It is difficult to judge whether the secondary data is sufficiently accurate or not
for our investigation.
3) Secondary data may not be available for some investigations. For example,
bargaining strategies in live products marketing, impact of T.V. advertisements
on viewers, opinion polls on a specific subject, etc. In such situations we have to
collect primary data.
Self Assessment Exercise B
1) Write names of five web sources of secondary data which have not been
included in the above table.
....................................................................................................................
....................................................................................................................
....................................................................................................................
2) Explain the merits and limitations of using secondary data.
....................................................................................................................
....................................................................................................................
....................................................................................................................
3) What precautions must a researcher take before using the secondary data?
....................................................................................................................
....................................................................................................................
....................................................................................................................
The Concise Oxford Dictionary defines observation as, ‘accurate watching and
noting of phenomena as they occur in nature with regard to cause and effect
or mutual relations’. Thus observation is not only a systematic watching but it
also involves listening and reading, coupled with consideration of the seen
phenomena. It involves three processes. They are: sensation, attention or
concentration and perception.
Merits
1) This is the most suitable method when the informants are unable or reluctant to
provide information.
2) This method provides deeper insights into the problem and generally the data is
accurate and quicker to process. Therefore, this is useful for intensive study
rather than extensive study.
Limitations
Despite of the above merits, this method suffers from the following limitations:
1) In many situations, the researcher cannot predict when the events will occur. So
when an event occurs there may not be a ready observer to observe the event.
2) Participants may be aware of the observer and as a result may alter their
behaviour.
3) Observer, because of personal biases and lack of training, may not record
specifically what he/she observes.
4) This method cannot be used extensively if the inquiry is large and spread over a
wide area.
3.5.2 Interview Method
Interview is one of the most powerful tools and most widely used method for
primary data collection in business research. In our daily routine we see
interviews on T.V. channels on various topics related to social, business, sports,
budget etc. In the words of C. William Emory, ‘personal interviewing is a two-
way purposeful conversation initiated by an interviewer to obtain information
5 0 that is relevant to some research purpose’. Thus an interview is basically, a
meeting between two persons to obtain the information related to the proposed Collection of Data
study. The person who is interviewing is named as interviewer and the person
who is being interviewed is named as informant. It is to be noted that, the
research data/information collect through this method is not a simple
conversation between the investigator and the informant, but also the glances,
gestures, facial expressions, level of speech etc., are all part of the process.
Through this method, the researcher can collect varied types of data intensively
and extensively.
Another technique for data collection through this method can be structured and
unstructured interviewing. In the Structured interview set questions are asked
and the responses are recorded in a standardised form. This is useful in large
scale interviews where a number of investigators are assigned the job of
interviewing. The researcher can minimise the bias of the interviewer. This
technique is also named as formal interview. In Un-structured interview, the
investigator may not have a set of questions but have only a number of key
points around which to build the interview. Normally, such type of interviews
are conducted in the case of an explorative survey where the researcher is not
completely sure about the type of data he/ she collects. It is also named as
informal interview. Generally, this method is used as a supplementary method of
data collection in conducting research in business areas.
Merits
The major merits of this method are as follows:
1) The chance of the subjective factors or the views of the investigator may come
in either consciously or unconsciously.
2) The interviewers must be properly trained, otherwise the entire work may be
spoiled.
3) It is a relatively expensive and time-consuming method of data collection
especially when the number of persons to be interviewed is large and they are
spread over a wide area.
4) It cannot be used when the field of enquiry is large (large sample).
Precautions : While using this method, the following precautions should be
taken:
5 2
and a high degree of accuracy is not of much importance. Collection of Data
Merits
1) This method is cheap and economical for extensive investigations.
2) It gives results easily and promptly.
3) It can cover a wide area under investigation.
Limitations
1) The data obtained may not be reliable.
2) It gives approximate and rough results.
3) It is unsuited where a high degree of accuracy is desired.
4) As the agent/reporter or correspondent uses his own judgement, his personal
bias may affect the accuracy of the information sent.
3.5.4 Questionnaire and Schedule Methods
Questionnaire and schedule methods are the popular and common methods for
collecting primary data in business research. Both the methods comprise a list
of questions arranged in a sequence pertaining to the investigation. Let us study
these methods in detail one after another.
i) Questionnaire Method
Merits
1) You can use this method in cases where informants are spread over a vast
geographical area.
2) Respondents can take their own time to answer the questions. So the researcher
can obtain original data by this method.
3) This is a cheap method because its mailing cost is less than the cost of personal
visits.
4) This method is free from bias of the investigator as the information is given by
the respondents themselves.
5) Large samples can be covered and thus the results can be more reliable and
dependable.
Limitations
1) Respondents may not return filled in questionnaires, or they can delay in replying
to the questionnaires. 5 3
Research and Data 2) This method is useful only when the respondents are educated and co-operative.
Collection
3) Once the questionnaire has been despatched, the investigator cannot modify the
questionnaire.
4) It cannot be ensured whether the respondents are truly representative.
ii) Schedule Method
Merits
1) It is a useful method in case the informants are illiterates.
2) The researcher can overcome the problem of non-response as the enumerators
go personally to obtain the information.
3) It is very useful in extensive studies and can obtain more reliable data.
Limitations
1) It is a very expensive and time-consuming method as enumerators are paid
persons and also have to be trained.
2) Since the enumerator is present, the respondents may not respond to some
personal questions.
3) Reliability depends upon the sincerity and commitment in data collection.
The success of data collection through the questionnaire method or schedule
method depends on how the questionnaire has been designed.
Specimen Questionnaire
The following specimen questionnaire incorporates most of the qualities which
we have discussed above. It relates to ‘Computer User Survey’.
5 5
Research and Data
Collection Computer User Survey
1. What brand of Computer do you primarily use?
(i) IBM (ii) Compaq
(iii) HCL (iv) Dell
(v) Siemens (vi) Any other
_____________
(please specify)
2. Where was the computer purchased?
(i) Computer store (ii) Mail order
(iii) Manufacturer (iv) Company Dealer
(v) Any other _____________
3. How long have you been using computers? ______years
_____months.
4. In a week about how many hours do you spend on the computer ____
hours?
5. Which database management package do you use most often?
(i) Dbase-II (ii) Dbase-III
(iii) Lotus 1,2,3 (iv) MS-Excel
(v) Oracle (vi) Any other
_____________
(please specify)
6. Does the computer, that you primarily use, have a hard disk
Yes No
7. Where did you obtain the software that you use?
(i) Computer user group (ii) Regular dealer
(iii) Mail order (iv) Directly from Software
(v) Any other _____________ dealer
8. On the following 9-point scale, rate the degree of difficulty that you have
encountered in using the computer.
Extremely difficult 123456789 Not difficult
9. If you have to purchase a personal computer today, which one would
you be most likely to purchase?
(i) IBM (ii) Compaq
(iii) HCL (iv) Dell
(v) Siemens (vi) Any other
_____________
(please specify)
10. What is your sex Male Female
11. Please state your date of birh .............. ............... ..............
Month Day Year
12. Your Qualifications
(i) Secondary (ii) Sr. Secondary
(iii) Graduate (iv) Post-graduate
(v) Doctorate (vi) Any other
_____________
(please specify)
13. Which of the following best describe you primary field of
employment.
(i) Medical (ii) Education
(iii) Business (iv) Government
(v) Technical (vi) Any other ____________
(please specify)
14. What is your current Salary?
5 6
Collection of Data
3.6 CHOICE OF SUITABLE METHOD
You have noticed that there are various methods and techniques for the
collection of primary data. You should be careful while selecting the method
which should be appropriate and effective. The selection of the methods
depends upon various factors like scope and objectives of the inquiry, time,
availability of funds, subject matter of the research, the kind of information
required, degree of accuracy etc. As apprised, every method has its own merits
and demerits. For example, the observation method is suitable for field surveys
when the incident is really happening, the interview method is suitable where
direct observation is not possible. Local reporter/correspondent method is
suitable when information is required at regular intervals. The questionnaire
method is appropriate in extensive enquiries where sample is large and
scattered over large geographical areas and the respondents are able to express
their responses in writing. The Schedule method is suitable in case respondents
are illiterate.
Several methods are used for collection of primary data. These are: observation,
interview, questionnaire and schedule methods. Every method has its own merits
and demerits. Hence, no method is suitable in all situations. The suitable method
can be selected as per the needs of the investigator which depends on objective
nature and scope of the enquiry, availability of funds and time.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
5 9
Research and Data
Collection 3.11 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt with in this unit.
Kothari, C.R. 2004. Research Methodology Methods and Techniques, New
Age International (P) Limited : New Delhi.
Rao K.V. 1993. Research Methodology in Commerce and Management,
Sterling Publishers Private Limited : New Delhi.
Sadhu, A.N. and A. Singh, 1980. Research Methodology in Social Sciences,
Sterling Publishers Private Limited : New Delhi.
6 0
Sampling
UNIT 4 SAMPLING
STRUCTURE
4.0 Objectives
4.1 Introduction
4.2 Census and Sample
4.3 Why Sampling?
4.4 Essentials of a Good Sample
4.5 Methods of Sampling
4.5.1 Random Sampling Methods
4.5.2 Non-Random Sampling Methods
4.6 Sample Size
4.7 Sampling and Non-Sampling Errors
4.7.1 Sampling Errors
4.7.2 Non-Sampling Errors
4.7.3 Control of Errors
4.8 Let Us Sum Up
4.9 Key Words
4.10 Answers to Self Assessment Exercises
4.11 Terminal Questions
4.12 Further Reading
4.0 OBJECTIVES
After studying this Unit, you should be able to:
4.1 INTRODUCTION
In the previous Unit 3, we have studied the types of data (primary and
secondary data) and various methods and techniques of collecting the primary
data. The desired data may be collected by selecting either census method or
sampling method.
In this Unit, we shall discuss the basics of sampling, particularly how to get a
sample that is representative of a population. It covers different methods of
drawing samples which can save a lot of time, money and manpower in a 6 1
Research and Data variety of situations. These include random sampling methods, such as, simple
Collection random sampling, stratified sampling, systematic sampling, multistage sampling,
cluster sampling methods (and non-random sampling methods viz., convenience
sampling, judgement sampling and quota sampling. The advantages and
disadvantages of sampling and census are covered. How to determine the
sample size of a given population is also discussed.
The disadvantages of sampling are few but the researcher must be cautious.
These are risk, lack of representativeness and insufficient sample size each of
which can cause errors. If researcher don’t pay attention to these flaws it may
invalidate the results.
1) Risk: Using a sample from a population and drawing inferences about the
entire population involves risk. In other words the risk results from dealing with
a part of a population. If the risk is not acceptable in seeking a solution to a
problem then a census must be conducted.
2) Lack of representativeness: Determining the representativeness of the
sample is the researcher’s greatest problem. By definition, ‘sample’ means a
representative part of an entire population. It is necessary to obtain a sample
that meets the requirement of representativeness otherwise the sample will be
biased. The inferences drawn from nonreprentative samples will be misleading
and potentially dangerous.
3) Insufficient sample size: The other significant problem in sampling is to
determine the size of the sample. The size of the sample for a valid sample
depends on several factors such as extent of risk that the researcher is willing to
accept and the characteristics of the population itself.
1) A sample must represent a true picture of the population from which it is drawn.
2) A sample must be unbiased by the sampling procedure.
3) A sample must be taken at random so that every member of the population of
data has an equal chance of selection.
4) A sample must be sufficiently large but as economical as possible.
5) A sample must be accurate and complete. It should not leave any information
incomplete and should include all the respondents, units or items included in the
sample.
6) Adequate sample size must be taken considering the degree of precision
required in the results of inquiry.
It was scientifically proved that if we increase the sample size we shall be that
much closer to the characteristics of the population. Ultimately, if we cover
each and every unit of the population, the characteristics of the sample will be
equal to the characteristics of the population. That is why in a census there is
no sampling error. Thus, “generally speaking, the larger the sample size, the less
sampling error we have.”
The statistical meaning of bias is error. The sample must be error free to make
it an unbiased sample. In practice, it is impossible to achieve an error free
sample even using unbiased sampling methods. However, we can minimize the
error by employing appropriate sampling methods.
The various sampling methods can be classified into two categories. These are
random sampling methods and non-random sampling methods. Let us discuss
them in detail. 6 5
Research and Data 4.5.1 Random Sampling Methods
Collection
The best way to choose a simple random sample is to use random number
table. A random sampling method should meet the following criteria.
a) Every member of the population must have an equal chance of inclusion in the
sample.
b) The selection of one member is not affected by the selection of previous
members.
The random numbers are a collection of digits generated through a probabilistic
mechanism. The random numbers have the following properties:
i) The probability that each digit (0,1,2,3,4,5,6,7,8,or 9) will appear at any place
is the same. That is 1/10.
ii) The occurrence of any two digits in any two places is independent of each
other.
Each member of a population is assigned a unique number. The members of
the population chosen for the sample will be those whose numbers are identical
to the ones extracted from the random number table in succession until the
desired sample size is reached. An example of a random number table is given
below.
6 6
Table 1: Table of Random Numbers Sampling
1 2 3 4 5 6 7 8 9 10
1 96268 11860 83699 38631 90045 69696 48572 05917 51905 10052
2 03550 59144 59468 37984 77892 89766 86489 46619 50236 91136
3 22188 81205 99699 84260 19693 36701 43233 62719 53117 71153
4 63759 61429 14043 44095 84746 22018 19014 76781 61086 90216
5 55006 17765 15013 77707 54317 48862 53823 52905 70754 68212
6 81972 45644 12600 01951 72166 52682 37598 11955 73018 23528
7 06344 50136 33122 31794 86723 58037 36065 32190 31367 96007
8 92363 99784 94169 03652 80824 33407 40837 97749 18361 72666
9 96083 16943 89916 55159 62184 86206 09764 20244 88388 98675
10 92993 10747 08985 44999 35785 65036 05933 77378 92339 96151
11 95083 70292 50394 61947 65591 09774 16216 63561 59751 78771
12 77308 60721 96057 86031 83148 34970 30892 53489 44999 18021
13 11913 49624 28519 27311 61586 28576 43092 69971 44220 80410
14 70648 47484 05095 92335 55299 27161 64486 71307 85883 69610
15 92771 99203 37786 81142 44271 36433 31726 74879 89384 76886
16 78816 20975 13043 55921 82774 62745 48338 88348 61211 88074
17 79934 35392 56097 87613 94627 63622 08110 16611 88599 02890
18 64698 83376 87527 36897 17215 74339 69856 43622 22567 11518
19 44212 12995 03581 37618 94851 63020 65348 55857 91742 79508
20 89292 00204 00579 70630 37136 50922 83387 15014 51838 81760
21 08692 87237 87879 01629 72184 33853 95144 67943 19345 03469
22 67927 76855 50702 78555 97442 78809 40575 79714 06201 34576
23 62167 94213 52971 85794 68067 78814 40103 70759 92129 46716
24 45828 45441 74220 84157 23241 49332 23646 09390 13031 51569
25 01164 35307 26526 80335 58090 85871 07205 31749 40571 51755
26 29283 31581 04359 45538 41435 61103 32428 94042 39971 63678
27 19868 49978 81699 84904 50163 22652 07845 71308 00859 87984
28 14292 93587 55960 23159 07370 65065 06580 46285 07884 83928
29 77410 52135 29495 23032 83242 89938 40516 27252 55565 64714
30 36580 06921 35675 81645 60479 71035 99380 59759 42161 93440
31 07780 18093 31258 78156 07871 20369 53977 08534 39433 57216
32 07548 08454 36674 46255 80541 42903 37366 21164 97516 66181
33 22023 60448 69344 44260 90570 01632 21002 24413 04671 05665
34 20827 37210 57797 34660 32510 71558 78228 42304 77197 79168
35 47802 79270 48805 59480 88092 11441 96016 76091 51823 94442
36 76730 86591 18978 25479 77684 88439 34112 26052 57112 91653
37 26439 02903 20935 76297 15290 84688 74002 09467 41111 19194
38 32927 83426 07848 59372 44422 53372 27823 25417 27150 21750
39 51484 05286 77103 47284 00578 88774 15293 50740 07932 87633
40 45142 96804 92834 26886 70002 96643 36008 02239 93563 66429
6 7
Research and Data To select a random sample using simple random sampling method we should
Collection follow the steps given below:
v) Choose the direction in which you want to read the numbers (from left to
right, or right to left, or down or up).
vi) Select the first ‘n’ numbers whose X digits are between 0 and N. If N =
100 then X would be 2, if N is a four digit number then X would be 3 and
so on.
viii) If you reach the end point of the table before obtaining ‘n’ numbers, pick
another starting point and read in a different direction and then use the
first X digit instead of the last X digits and continue until the desired
sample is selected.
Example: Suppose you have a list of 80 students and want to select a sample
of 20 students using simple random sampling method. First assign each student
a number from 00 to 79. To draw a sample of 20 students using random
number table, you need to find 20 two-digit numbers in the range 00 to 79. You
can begin any where and go in any direction. For example, start from the 6th
row and 1st column of the random number table given in this Unit. Read the
last two digits of the numbers. If the number is within the range (00 to 79)
include the number in the sample. Otherwise skip the number and read the next
number in some identified direction. If a number is already selected omit it. In
the example starting from 6th row and 1st column and moving from left to right
direction the following numbers are considered to selected 20 numbers for
sample.
81972 45644 12600 01951 72166 52682 37598 11955 73018 23528
06344 50136 33122 31794 86723 58037 36065 32190 31367 96007
92363 99784 94169 03652 80824 33407 40837 97749 18361 72666
The bold faced digits in the one’s and ten’s place value indicate the selected
numbers for the sample. Therefore, the following are the 20 numbers chosen as
sample.
72 44 00 51 66 55 18 28
36 22 23 37 65 67 07 63
69 52 24 49
6 8
Advantages Sampling
i) The simple random sample requires less knowledge about the characteristics of
the population.
ii) Since sample is selected at random giving each member of the population equal
chance of being selected the sample can be called as unbiased sample. Bias
due to human preferences and influences is eliminated.
iii) Assessment of the accuracy of the results is possible by sample error
estimation.
iv) It is a simple and practical sampling method provided population size is not large.
Limitations
i) If the population size is large, a great deal of time must be spent listing and
numbering the members of the population.
ii) A simple random sample will not adequately represent many population
characteristics unless the sample is very large. That is, if the researcher is
interested in choosing a sample on the basis of the distribution in the population
of gender, age, social status, a simple random sample needs to be very large to
ensure all these distributions are representative of the population. To obtain a
representative sample across multiple population attributes we should use
stratified random sampling.
2. Systematic Sampling: In systematic sampling the sample units are selected
from the population at equal intervals in terms of time, space or order. The
selection of a sample using systematic sampling method is very simple. From a
population of ‘N’ units, a sample of ‘n’ units may be selected by following the
steps given below:
i) Arrange all the units in the population in an order by giving serial numbers
from 1 to N.
ii) Determine the sampling interval by dividing the population by the sample
size. That is, K=N/n.
iii) Select the first sample unit at random from the first sampling interval (1 to
K).
iv) Select the subsequent sample units at equal regular intervals.
For example, we want to have a sample of 100 units from a population of 1000
units. First arrange the population units in some serial order by giving numbers
from 1 to 1000. The sample interval size is K=1000/100=10. Select the first
sample unit at random from the first 10 units ( i.e. from 1 to 10). Suppose the
first sample unit selected is 5, then the subsequent sample units are 15, 25,
35,.........995. Thus, in the systematic sampling the first sample unit is selected
at random and this sample unit in turn determines the subsequent sample units
that are to be selected.
Advantages
i) The main advantage of using systematic sample is that it is more expeditious to
collect a sample systematically since the time taken and work involved is less
than in simple random sampling. For example, it is frequently used in exit polls
and store consumers.
ii) This method can be used even when no formal list of the population units is
available. For example, suppose if we are interested in knowing the opinion of
consumers on improving the services offered by a store we may simply choose
6 9
Research and Data every kth (say 6th) consumer visiting a store provided that we know how many
Collection consumers are visiting the store daily (say 1000 consumers visit and we want to
have 100 consumers as sample size).
Limitations
i) If there is periodicity in the occurrence of elements of a population, the selection
of sample using systematic sample could give a highly un-representative sample.
For example, suppose the sales of a consumer store are arranged
chronologically and using systematic sampling we select sample for 1st of every
month. The 1st day of a month can not be a representative sample for the whole
month. Thus in systematic sampling there is a danger of order bias.
ii) Every unit of the population does not have an equal chance of being selected
and the selection of units for the sample depends on the initial unit selection.
Regardless how we select the first unit of sample, subsequent units are
automatically determined lacking complete randomness.
3. Stratified Random Sampling: The stratified sampling method is used when
the population is heterogeneous rather than homogeneous. A heterogeneous
population is composed of unlike elements such as male/female, rural/urban,
literate/illiterate, high income/low income groups, etc. In such cases, use of
simple random sampling may not always provide a representative sample of the
population. In stratified sampling, we divide the population into relatively
homogenous groups called strata. Then we select a sample using simple
random sampling from each stratum. There are two approaches to decide the
sample size from each stratum, namely, proportional stratified sample and
disproportional stratified sample. With either approach, the stratified sampling
guarantees that every unit in the population has a chance of being selected. We
will now discuss these two approaches of selecting samples.
i) Proportional Stratified Sample: If the number of sampling units drawn
from each stratum is in proportion to the corresponding stratum population size,
we say the sample is proportional stratified sample. For example, let us say
we want to draw a stratified random sample from a heterogeneous population
(on some characteristics) consisting of rural/urban and male/female respondents.
So we have to create 4 homogeneous sub groups called stratums as follows:
Urban Rural
To ensure each stratum in the sample will represent the corresponding stratum
in the population we must ensure each stratum in the sample is represented in
the same proportion to the stratums as they are in the population. Let us
assume that we know (or can estimate) the population distribution as follows:
65% male, 35% female and 30% urban and 70% rural. Now we can determine
the approximate proportions of our 4 stratums in the population as shown below.
Urban Rural
Male Female Male Female
0.30 × 0.65 = 0.195 0.30 × 0.35 = 0.105 0.70 × 0.65 = 0.455 0.70 × 0.35 = 0.245
Total: 1,000
Advantages
a) Since the sample are drawn from each of the stratums of the population,
stratified sampling is more representative and thus more accurately reflects
characteristics of the population from which they are chosen. 7 1
Research and Data b) It is more precise and to a great extent avoids bias.
Collection
c) Since sample size can be less in this method, it saves a lot of time, money and
other resources for data collection.
Limitations
a) Stratified sampling requires a detailed knowledge of the distribution of attributes
or characteristics of interest in the population to determine the homogeneous
groups that lie within it. If we cannot accurately identify the homogeneous
groups, it is better to use simple random sample since improper stratification can
lead to serious errors.
b) Preparing a stratified list is a difficult task as the lists may not be readily
available.
4. Cluster Sampling: In cluster sampling we divide the population into groups
having heterogenous characteristics called clusters and then select a sample of
clusters using simple random sampling. We assume that each of the clusters is
representative of the population as a whole. This sampling is widely used for
geographical studies of many issues. For example if we are interested in finding
the consumers’ (residing in Delhi) attitudes towards a new product of a
company, the whole city of Delhi can be divided into 20 blocks. We assume that
each of these blocks will represent the attitudes of consumers of Delhi as a
whole, we might use cluster sampling treating each block as a cluster. We will
then select a sample of 2 or 3 clusters and obtain the information from
consumers covering all of them. The principles that are basic to the cluster
sampling are as follows:
i) The differences or variability within a cluster should be as large as possible.
As far as possible the variability within each cluster should be the same as
that of the population.
ii) The variability between clusters should be as small as possible. Once the
clusters are selected, all the units in the selected clusters are covered for
obtaining data.
Advantages
a) The cluster sampling provides significant gains in data collection costs, since
traveling costs are smaller.
b) Since the researcher need not cover all the clusters and only a sample of
clusters are covered, it becomes a more practical method which facilitates
fieldwork.
Limitations
a) The cluster sampling method is less precise than sampling of units from the
whole population since the latter is expected to provide a better cross-section of
the population than the former, due to the usual tendency of units in a cluster to
be homogeneous.
b) The sampling efficiency of cluster sampling is likely to decrease with the
decrease in cluster size or increase in number of clusters.
The above advantages or limitations of cluster sampling suggest that, in practical
situations where sampling efficiency is less important but the cost is of greater
significance, the cluster sampling method is extensively used. If the division of
clusters is based on the geographic sub-divisions, it is known as area sampling.
In cluster sampling instead of covering all the units in each cluster we can
resort to sub-sampling as two-stage sampling. Here, the clusters are termed as
primary units and the units within the selected clusters are taken as secondary
7 2 units.
5. Multistage Sampling: We have already covered two stage sampling. Multi Sampling
stage sampling is a generalisation of two stage sampling. As the name suggests,
multi stage sampling is carried out in different stages. In each stage
progressively smaller (population) geographic areas will be randomly selected.
A political pollster interested in assembly elections in Uttar Pradesh may first
divide the state into different assembly units and a sample of assembly
constituencies may be selected in the first stage. In the second stage, each of
the sampled assembly constituents are divided into a number of segments and a
second stage sampled assembly segments may be selected. In the third stage
within each sampled assembly segment either all the house-holds or a sample
random of households would be interviewed. In this sampling method, it is
possible to take as many stages as are necessary to achieve a representative
sample. Each stage results in a reduction of sample size.
In a multi stage sampling at each stage of sampling a suitable method of
sampling is used. More number of stages are used to arrive at a sample of
desired sampling units.
Advantages
a) Multistage sampling provides cost gains by reducing the data collection on costs.
b) Multistage sampling is more flexible and allows us to use different sampling
procedures in different stages of sampling.
c) If the population is spread over a very wide geographical area, multistage
sampling is the only sampling method available in a number of practical
situations.
Limitations
a) If the sampling units selected at different stages are not representative
multistage sampling becomes less precise and efficient.
4.5.2 Non-Random Sampling Methods
The non-random sampling methods are also often called non-probability sampling
methods. In a non-random sampling method the probability of any particular unit
of the population being chosen is unknown. Here the method of selection of
sampling units is quite arbitrary as the researchers rely heavily on personal
judgment. Non-random sampling methods usually do not produce samples that
are representative of the general population from which they are drawn. The
greatest error occurs when the researcher attempts to generalise the results on
the basis of a sample to the entire population. Such an error is insidious
because it is not at all obvious from merely looking at the data, or even from
looking at the sample. The easiest way to recognise whether a sample is
representative or not is to determine whether the sample is selected randomly
or not. Nevertheless, there are occasions where non-random samples are best
suited for the researcher’s purpose.The various non-random sampling methods
commonly used are:
1) Convenience Sampling;
2) Judgement Sampling; and
3) Quota Sampling.
Let us discuss these methods in detail.
1) Convenience Sampling: Convenience sampling refers to the method of
obtaining a sample that is most conveniently available to the researcher. For
example, if we are interested in finding the overtime wage paid to employees
working in call centres, it may be convenient and economical to sample 7 3
Research and Data employees of call centres in a nearby area. Also, on various issues of public
Collection interest like budget, election, price rise etc., the television channels often present
on-the-street interviews with people to reflect public opinion. It may be
cautioned that the generalisation of results based on convenience sampling
beyond that particular sample may not be appropriate. Convenience samples are
best used for exploratory research when additional research will be
subsequently conducted with a random sample. Convenience sampling is also
useful in testing the questionnaires designed on a pilot basis. Convenience
sampling is extensively used in marketing studies.
2) Judgement Sampling: Judgement sampling method is also known as purposive
sampling. In this method of sampling the selection of sample is based on the
researcher’s judgment about some appropriate characteristic required of the
sample units. For example, the calculation of consumer price index is based on a
judgment sample of a basket of consumer items, and other related commodities
and services which are expected to reflect a representative sample of items
consumed by the people. The prices of these items are collected from selected
cities which are viewed as typical cities with demographic profiles matching the
national profile. In business judgment sampling is often used to measure the
performance of salesmen/saleswomen. The salesmen/saleswomen are grouped
into high, medium or low performers based on certain specified qualities. Then
the sales manager may actually classify the salesmen/saleswomen working
under him/her who in his/her opinion will fall in which group. Judgment sampling
is also often used in forecasting election results. We may often wonder how a
pollster can predict an election based on only 2% to 3% of votes covered. It is
needless to say the method is biased and does not have any scientific basis.
However, in the absence of any representative data, one may resort to this kind
of non-random sampling.
3) Quota Sampling: The quota sampling method is commonly used in marketing
research studies. The samples are selected on the basis of some parameters
such as age, sex, geographical region, education, income, occupation etc, in
order to make them as representative samples. The investigators, then, are
assigned fixed quotas of the sample meeting these population characteristics.
The purpose of quota sampling is to ensure that various sub-groups of the
population are represented on pertinent sample characteristics to the extent that
the investigator desires. The stratified random sampling also has this objective
but should not be confused with quota sampling. In the stratified sampling
method the researcher selects a random sample from each group of the
population, where as, in quota sampling, the interviewer has a quota fixed for
him/her to achieve. For example, if a city has 10 market centres, a soft drink
company may decide to interview 50 consumers from each of these 10 market
centres to elicit information on their products. It is entirely left to the
investigator whom he/she will interview at each of the market centres and the
time of interview. The interview may take place in the morning, mid day, or
evening or it may be in the winter or summer.
Quota sampling has the advantage that the sample confirms the selected
characteristics of the population that the researcher desires. Also, the cost and
time involved in collecting the data are also greatly reduced. However, quota
sampling has many limitations, as given below:
a) In quota sampling the respondents are selected according to the convenience of
the field investigator rather than on a random basis. This kind of selection of
sample may be biased. Suppose in our example of soft drinks, after the sample
is taken it was found that most of the respondents belong to the lower income
group then the purpose of conducting the survey becomes useless and the
7 4 results may not reflect the actual situation.
b) If the number of parameters, on which basis the quotas are fixed, are larger Sampling
then it becomes difficult for the researcher to fix the quota for each sub-group.
c) The field workers have the tendency to cover the quota by going to those places
where the respondents may be willing to provide information and avoid those
with unwilling respondents. For example, the investigators may avoid places
where high income group respondents stay and cover only low income group
areas.
1) Suppose there are 900 families residing in a colony. You are asked to select a
sample of families using simple random sampling for knowing the average
income. The families are identified with serial numbers 001 to 900.
i) Select a random sample using the following random table.
29283 31581 04359 45538 41435 61103 32428 94042 39971 63678
19868 49978 81699 84904 50163 22652 07845 71308 00859 87984
14292 93587 55960 23159 07370 65065 06580 46285 07884 83928
77410 52135 29495 23032 83242 89938 40516 27252 55565 64714
36580 06921 35675 81645 60479 71035 99380 59759 42161 93440
ii) While selecting the random sample in the above example, what are the
random numbers you have rejected and why?
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
7 5
Research and Data 4) State true or false.
Collection
a) A systematic sampling can be used even if all the units of the population
are not available.
b) A budget has been announced by the government. A TV journalist
recorded the views of the people residing near his house. The sampling
method that the TV journalist used is quota sampling.
....................................................................................................................
Once the researcher determines the desired degree of precision and confidence
level, there are several formulas he/she can use to determine the sample size
and interpretation of results depending on the plan of the study. Here we will
discuss three of them.
3) If the researcher plans the results in a variety of ways or if he/she has difficulty
in estimating the proportion or standard deviation of the attribute of interest, the
following formula may be more useful.
NZ 2
× .25
n=
[d × (N − 1)] + [Z 2 × .25]
2
1000×1.96 2 × 0.25
n= = 277.7or say 280
(0.05 2 × 999)+ (1.96 2 × 0.25)
The principal sources of sampling errors are the sampling method applied, and
the sample size. This is due to the fact that only a part of the population is
covered in the sample. The magnitude of the sampling error varies from one
sampling method to the other, even for the same sample size. For example, the
sampling error associated with simple random sampling will be greater than
stratified random sampling if the population is heterogeneous in nature.
Intuitively, we know that the larger the sample the more accurate the research.
In fact, the sampling error varies with samples of different sizes. Increasing the
sample size decreases the sampling error.
7 7
Research and Data The following Figure gives an approximate relationship between sample size and
Collection sampling error. Study the following figure carefully.
large
Sampling error
small
small large
Sample size
Fig.: 4.1
In the above two sections we have identified the most significant sources of
errors. It is not possible to eliminate completely the sources of errors.
However, the researcher’s objective and effort should be to minimise these
sources of errors as much as possible. There are ways of reducing the errors.
Some of these are:
(a) designing and executing a good questionnaire; (b) selection of appropriate
sampling method; (c) adequate sample size; (d) employing trained investigators
to collect the data; and (e) care in editing, coding and entering the data into the
computer. You have already learned the above ways of controlling the errors
in Unit 3 and in this Unit.
7 9
Research and Data Self Assessment Exercise C
Collection
1) The size of a population is 10000. You wish to have a 99% confidence
level and ±5% precision level. What is the sample size required?
..................................................................................................................
..................................................................................................................
..................................................................................................................
2) As the sample size increases, the sampling error:
a) Increases b) Decreases c) Remains constant
....................................................................................................................
....................................................................................................................
There are two broad categories of sampling methods. These are: (a) random
sampling methods, and (b) non-random sampling methods. The random sampling
methods are based on the chance of including the units of population in a
sample.
Some of the sampling methods covered in this Unit are: (a) simple random
sampling, (b) systematic random sampling, (c) stratified random sampling,
(d) cluster sampling, and (e) multistage sampling. With an appropriate sampling
plan and selection of random sampling method the sampling error can be
minimised. The non-random sampling methods include: (a) convenience sampling,
(b) judgment sampling, and (c) Quota sampling. These methods may be
convenient to the researcher to apply. These methods may not provide a
representative sample to the population and there are no scientific ways to
check the sampling errors.
There are two major sources of errors in survey research. These are:
(a) sampling errors, and (b) non-sampling errors. The sampling errors arise
because of the fact that the sample may not be a representative sample to the
population. Two major sources of non-sampling errors are due to: (a) non-
response on the part of respondent and/or respondent’s bias in providing correct
information, and (b) administrative errors like design and implementation of
questionnaire, investigators’ bias, and data processing errors.
2) Cluster sampling
3) Stratified sampling
4) a) true
b) false, it is convenience sampling
C. 1) The required sample size is 370
2) Decreases
3) Sampling method applied
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
8 2 university for assessment. These are for your practice only.
Sampling
4.12 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt with in this unit.
Gupta, C.B., & Vijay Gupta, An Introduction to Statistical Methods, Vikas
Publishing House Pvt. Ltd., New Delhi.
Kothari, C.R.(2004) Research Methodology Methods and Techniques, New Age
International (P) Ltd., New Delhi.
Levin, R.I. and D.S. Rubin. (1999) Statistics for Management, Prentice-Hall of
India, New Delhi
Mustafi, C.K.(1981) Statistical Methods in Managerial Decisions, Macmillan,
New Delhi
8 3
Research and Data
Collection UNIT 5 MEASUREMENT AND SCALING
TECHNIQUES
STRUCTURE
5.0 Objectives
5.1 Introduction
5.2 Measurement and Scaling
5.3 Issues in Attitude Measurement
5.4 Levels of Measurement Scales
5.5 Types of Scaling Techniques
5.5.1 Comparative Scales
5.5.2 Non-comparative Scales
5.6 Selection of an Appropriate Scaling Technique
5.7 Let Us Sum Up
5.8 Key Words
5.9 Answers to Self Assessment Exercises
5.10 Terminal Questions
5.11 Further Reading
5.0 OBJECTIVES
After studying this unit, you should be able to:
l explain the concepts of measurement and scaling,
l discuss four levels of measurement scales,
l classify and discuss different scaling techniques, and
l select an appropriate attitude measurement scale for your research problem.
5.1 INTRODUCTION
As we discussed earlier, the data consists of quantitative variables like price,
income, sales etc., and qualitative variables like knowledge, performance,
character etc. The qualitative information must be converted into numerical
form for further analysis. This is possible through measurement and scaling
techniques. A common feature of survey based research is to have
respondent’s feelings, attitudes, opinions, etc. in some measurable form. For
example, a bank manager may be interested in knowing the opinion of the
customers about the services provided by the bank. Similarly, a fast food
company having a network in a city may be interested in assessing the quality
and service provided by them. As a researcher you may be interested in
knowing the attitude of the people towards the government announcement of a
metro rail in Delhi. In this unit we will discuss the issues related to
measurement, different levels of measurement scales, various types of scaling
techniques and also selection of an appropriate scaling technique.
a) What is to be measured?
b) Who is to be measured?
c) The choices available in data collection techniques
The first issue that the researcher must consider is ‘what is to be measured’?
The definition of the problem, based on our judgments or prior research
indicates the concept to be investigated. For example, we may be interested in
measuring the performance of a fast food company. We may require a precise
definition of the concept on how it will be measured. Also, there may be more
than one way that we can measure a particular concept. For example, in
measuring the performance of a fast food company we may use a number of
measures to indicate the performance of the company. We may use sales
volume in terms of value of sales or number of customers or spread of
network of the company as measures of performance. Further, the
measurement of concepts requires assigning numbers to the attitudes, feelings or
opinions. The key question here is that on what basis do we assign the
numbers to the concept. For example, the task is to measure the agreement of
customers of a fast food company on the opinion of whether the food served
by the company is tasty, we create five categories: (1) strongly agree, (2)
agree, (3) undecided, (4) disagree, (5) strongly disagree. Then we may measure
the response of respondents. Suppose if a respondent states ‘disagree’ with the
statement that ‘the food is tasty’, the measurement is 4.
The third issue in measurement is the choice of the data collection techniques.
In Unit 3, you have already learnt various methods of data collection.Normally,
questionnaires are used for measuring attitudes, opinions or feelings.
a) Nominal Scale is the crudest among all measurement scales but it is also the
simplest scale. In this scale the different scores on a measurement simply
indicate different categories. The nominal scale does not express any values or
relationships between variables. For example, labelling men as ‘1’ and women
as ‘2’ which is the most common way of labelling gender for data recording
purpose does not mean women are ‘twice something or other’ than men. Nor it
suggests that men are somehow ‘better’ than women. Another example of
nominal scale is to classify the respondent’s income into three groups: the
highest income as group 1. The middle income as group 2, and the low-income
as group 3. The nominal scale is often referred to as a categorical scale. The
assigned numbers have no arithmetic properties and act only as labels. The only
statistical operation that can be performed on nominal scales is a frequency
count. We cannot determine an average except mode.
In designing and developing a questionnaire, it is important that the response
categories must include all possible responses. In order to have an exhaustive
number of responses, you might have to include a category such as ‘others’,
‘uncertain’, ‘don’t know’, or ‘can’t remember’ so that the respondents will not
distort their information by forcing their responses in one of the categories
provided. Also, you should be careful and be sure that the categories provided
are mutually exclusive so that they do not overlap or get duplicated in any way.
b) Ordinal Scale involves the ranking of items along the continuum of the
86 characteristic being scaled. In this scale, the items are classified according to
whether they have more or less of a characteristic. For example, you may wish Measurement and
Scaling Techniques
to ask the TV viewers to rank the TV channels according to their preference
and the responses may look like this as given below:
The main characteristic of the ordinal scale is that the categories have a logical
or ordered relationship. This type of scale permits the measurement of degrees
of difference, (that is, ‘more’ or ‘less’) but not the specific amount of
differences (that is, how much ‘more’ or ‘less’). This scale is very common
in marketing, satisfaction and attitudinal research.
Another example is that a fast food home delivery shop may wish to ask its
customers:
How would you rate the service of our staff?
(1) Excellent • (2) Very Good • (3) Good • (4) Poor • (5) Worst •
c) Interval Scale is a scale in which the numbers are used to rank attributes such
that numerically equal distances on the scale represent equal distance in the
characteristic being measured. An interval scale contains all the information of
an ordinal scale, but it also one allows to compare the difference/distance
between attributes. For example, the difference between ‘1’ and ‘2’ is equal to
the difference between ‘3’ and ‘4’. Further, the difference between ‘2’ and ‘4’
is twice the difference between ‘1’ and ‘2’. However, in an interval scale, the
zero point is arbitrary and is not true zero. This, of course, has implications for
the type of data manipulation and analysis. We can carry out on data collected in
this form. It is possible to add or subtract a constant to all of the scale values
without affecting the form of the scale but one cannot multiply or divide the
values. Measuring temperature is an example of interval scale. We cannot say
400C is twice as hot as 200C. The reason for this is that 00C does not mean that
there is no temperature, but a relative point on the Centigrade Scale. Due to
lack of an absolute zero point, the interval scale does not allow the conclusion
that 400C is twice as hot as 200C.
Interval scales may be either in numeric or semantic formats. The following are
two more examples of interval scales one in numeric format and another in
semantic format. 87
Research and Data i) Example of Interval Scale in Numeric Format
Collection
Food supplied is: Indicate your score on
Fresh 1 2 3 4 5 the concerned blank
Tastes good 1 2 3 4 5 and circle the appro-
Value for money 1 2 3 4 5 priate number on each
Attractive packaging 1 2 3 4 5 line.
Prompt time delivery 1 2 3 4 5
The interval scales allow the calculation of averages like Mean, Median and
Mode and dispersion like Range and Standard Deviation.
d) Ratio Scale is the highest level of measurement scales. This has the properties
of an interval scale together with a fixed (absolute) zero point. The absolute zero
point allows us to construct a meaningful ratio. Examples of ratio scales include
weights, lengths and times. In the marketing research, most counts are ratio
scales. For example, the number of customers of a bank’s ATM in the last
three months is a ratio scale. This is because you can compare this with
previous three months. Ratio scales permit the researcher to compare both
differences in scores and relative magnitude of scores. For example, the
difference between 10 and 15 minutes is the same as the difference between 25
and 30 minutes and 30 minutes is twice as long as 15 minutes. Most financial
research that deals with rupee values utilizes ratio scales. However, for most
behavioural research, interval scales are typically the highest form of
measurement. Most statistical data analysis procedures do not distinguish
between the interval and ratio properties of the measurement scales and it is
sufficient to say that all the statistical operations that can be performed on
interval scale can also be performed on ratio scales.
Now you must be wondering why you should know the level of measurement.
Knowing the level of measurement helps you to decide on how to interpret the
data. For example, when you know that a measure is nominal then you know
that the numerical values are just short codes for longer textual names. Also,
knowing the level of measurement helps you to decide what statistical analysis is
appropriate on the values that were assigned. For example, if you know that a
measure is nominal, then you would not need to find mean of the data values or
perform a t-test on the data. (t-test will be discussed in Unit-16 in the course).
88
It is important to recognise that there is a hierarchy implied in the levels of Measurement and
Scaling Techniques
measurement. At lower levels of measurement, assumptions tend to be less
restrictive and data analyses tend to be less sensitive. At each level up the
hierarchy, the current level includes all the qualities of the one below it and adds
something new. In general, it is desirable to have a higher level of measurement
(that is, interval or ratio) rather than a lower one (that is, nominal or ordinal).
1) The main difference between interval scale and the ratio scale in terms of their
properties is:
...................................................................................................................
....................................................................................................................
....................................................................................................................
2) Why should the researcher know the level of measurement?
....................................................................................................................
....................................................................................................................
...................................................................................................................
89
Research and Data Figure 5.1: Scaling Techniques
Collection
Scaling Techniques
The comparative scales can further be divided into the following four types of
scaling techniques: (a) Paired Comparison Scale, (b) Rank Order Scale, (c)
Constant Sum Scale, and (d) Q-sort Scale.
The following table gives paired comparison of data (assumed) for four brands
of cold drinks.
Table 5.2
Paired comparison is useful when the number of brands are limited, since it
requires direct comparison and overt choice. One of the disadvantages of paired
comparison scale is violation of the assumption of transitivity may occur. For
example, in our example (Table 5.1) the respondent preferred Coke 2 times,
Pepsi 3 times, Sprite 1 time, and Limca 0 times. That means, preference-wise,
Pepsi >Coke, Coke >Sprite, and Sprite >Limca. However, the number of times
Sprite was preferred should not be that of Coke. In other words, if A>B and
B >C then C >A should not be possible. Also, the order in which the objects
are presented may bias the results. The number of items/brands for comparison
should not be too many. As the number of items increases, the number of
comparisons increases geometrically. If the number of comparisons is too large,
the respondents may become fatigued and no longer be able to carefully
discriminate among them. The other limitation of paired comparison is that this
scale has little resemblance to the market situation, which involves selection
from multiple alternatives. Also, respondents may prefer one item over certain
others, but they may not like it in an absolute sense.
91
Research and Data Table 5.3: Preference of cold drink brands using rank order scaling
Collection
Instructions: Rank the following brands of cold drinks in order of
preference. Begin by picking out the one brand you like most and assign it a
number1. Then find the second most preferred brand and assign it a number
2. Continue this procedure until you have ranked all the brands of cold drinks
in order of preference. The least preferred brand should be assigned a rank
of 4. Also remember no two brands receive the same rank order.
Format:
Brand Rank
(a) Coke 3
(b) Pepsi 1
(c) Limca 2
(d) Sprite 4
Like paired comparison, the rank order scale, is also comparative in nature. The
resultant data in rank order is ordinal data. This method is more realistic in
obtaining the responses and it yields better results when direct comparison are
required between the given objects. The major disadvantage of this technique is
that only ordinal data can be generated.
c) Constant Sum Scale: In this scale, the respondents are asked to allocate a
constant sum of units such as points, rupees, or chips among a set of stimulus
objects with respect to some criterion. For example, you may wish to determine
how important the attributes of price, fragrance, packaging, cleaning power, and
lather of a detergent are to consumers. Respondents might be asked to divide a
constant sum to indicate the relative importance of the attributes using the
following format.
Table 5.4: Importance of detergent attributes using a constant sum scale
“If an attribute is assigned a higher number of points, it would indicate that the
attribute is more important.” From the above Table, the price of the detergent is
92
the most important attribute for the consumers followed by cleaning power, Measurement and
Scaling Techniques
packaging. Fragrance and lather are the two attributes that the consumers
cared about the least but preferred equally.” The advantage of this technique is
saving time. However, there are two main disadvantages. The respondents may
allocate more or fewer points than those specified. The second problem is
rounding off error if too few attributes are used and the use of a large number
of attributes may be too taxing on the respondent and cause confusion and
fatigue.
d) Q-Sort Scale: This is a comparative scale that uses a rank order procedure to
sort objects based on similarity with respect to some criterion. The important
characteristic of this methodology is that it is more important to make
comparisons among different responses of a respondent than the responses
between different respondents. Therefore, it is a comparative method of scaling
rather than an absolute rating scale. In this method the respondent is given
statements in a large number for describing the characteristics of a product or a
large number of brands of a product. For example, you may wish to determine
the preference from among a large number of magazines. The following format
shown in Table 5.5 may be given to a respondent to obtain the preferences.
Table 5.5: Preference of Magazines Using Q-Sort Scale Procedure
Question: How would you rate the TV advertisement as a guide for buying?
Scale Type A
Strongly Strongly
agree disagree
Scale Type B
Strongly Strongly
disagree agree
Scale Type C
Strongly Strongly
agree 10 9 8 7 6 5 4 3 2 1 0 disagree
Scale Type D
Strongly Strongly
disagree 0 1 2 3 4 5 6 7 8 9 10 agree
When scale type A and B are used, the respondents score is determined either
by dividing the line into as many categories as desired and assigning the
respondent a score based on the category into which his/her mark falls, or by
measuring distance, in millimeters, centimeters, or inches from either end of the
scale. Which ever of the above continuous scale is used, the results are
normally analysed as interval scaled.
94
The itemised rating scales can be in the form of : (a) graphic, (b) verbal, or (c) Measurement and
Scaling Techniques
numeric as shown below:
Some rating scales may have only two response categories such as : agree and
disagree. Inclusion of more response categories provides the respondent more
flexibility in the rating task. Consider the following questions:
1. How often do you visit the supermarket located in your area of residence?
2. In your case how important is the price of brand X shoes when you buy them?
Each of the above category scales is a more sensitive measure than a scale
with only two responses since they provide more information.
95
Research and Data Table 5.6: Some common words for categories used in Itemised Rating scales
Collection
Quality:
Excellent Good Not decided Poor Worst
Very Good Good Neither good Fair Poor
nor bad
Importance:
Very Important Fairly Neutral Not so Not at all
important important important
Interest:
Very interested Somewhat Neither interested Somewhat Not very
interested nor disinterested uninterested interested
Satisfaction:
Completely Somewhat Neither satisfied Somewhat Completely
satisfied satisfied nor dissatisfied dissatisfied dissatisfied
Frequency:
All of the time Very often Often Sometimes Hardly ever
Very ofen Often Sometimes Rarely Never
Truth:
Very true Somewhat Not very true Not at all true
true
Purchase
Interest:
Definitely will Probably will Probably will Definitely
buy buy not buy will not buy
Level of
Agreement:
Strongly agree Somewhat Neither agree Somewhat Strongly
agree nor disagree disagree disagree
Dependability:
Completely Somewhat Not very Not at all
dependable dependable dependable dependable
Style:
Very stylish Somewhat Not very Completely
stylish stylish unstylish
Cost:
Extremely Expensive Neither Slightly Very
expensive expensive nor inexpensive inexpensive
inexpensive
Ease of use:
Very ease to Somewhat Not very easy Difficult to
use easy to use to use use
Modernity:
Very modern Somehwat Neither modern Somewhat Very old
modern nor old-fashioned old fashioned fashioned
Alert:
Very alert Alert Not alert Not at all alert
96
In this section we will discuss three itemised rating scales, namely (a) Likert Measurement and
scale, (b) Semantic Differential Scale, and (c) Stapel Scale. Scaling Techniques
Modern — — — — — — — Old-fashioned
Good — — — — — — — Bad
Clean — — — — — — — Dirty
Important — — — — — — — Unimportant
Expensive — — — — — — — Inexpensive
Useful — — — — — — — Useless
Strong — — — — — — — Weak
Quick — — — — — — — Slow
In the Semantic Differential scale only extremes have names. The extreme
points represent the bipolar adjectives with the central category representing the
neutral position. The in between categories have blank spaces. A weight is
assigned to each position on the scale. The weights can be such as +3, +2, +1, 0,
–1, –2, –3 or 7,6,5,4,3,2,1. The following is an example of Semantic
Differential Scale to study the experience of using a particular brand of body
lotion.
98
Measurement and
In my experience, the use of body lotion of Brand-X was: Scaling Techniques
+3 +2 +1 0 –-1 –-2 –-3
Useful Useless
Attractive Unattractive
Passive Active
Beneficial Harmful
Interesting Boring
Dull Sharp
Pleasant Unpleasant
Cold Hot
Good Bad
Likable Unlikable
In the semantic Differential scale, the phrases used to describe the object form
a basis for attitude formation in the form of positive and negative phrases. The
negative phrase is sometimes put on the left side of the scale and sometimes on
the right side. This is done to prevent a respondent with a positive attitude from
simply checking the left side and a respondent with a negative attitude checking
on the right side without reading the description of the words.
The respondents are asked to check the individual cells depending on the
attitude. Then one could arrive at the average scores for comparisons of
different objects. The following Figure shows the experiences of 100
consumers on 3 brands of body lotion.
+3 +2 +1 0 –-1 –-2 –-3
Useful Useless
Attractive Unattractive
Passive Active
Beneficial Harmful
Interesting Boring
Dull Sharp
Pleasant Unpleasant
Cold Hot
Good Bad
Likable Unlikable
Brand-X Brand-Y Brand-Z
In the above example, first the individual respondent scores for each dimension
are obtained and then the average scores of all 100 respondents, for each
dimension and for each brand were plotted graphically. The maximum score
possible for each brand is + 30 and the minimum score possible for each brand
is –30. Brand-X has score +14. Brand-Y has score +7, and Brand-Z has score
–11. From the scale we can identify which phrase needs improvement for each
Brand. For example, Brand-X needs to be improved upon benefits and Brand-Y
on pleasantness, coldness and likeability. Brand Z needs to be improved on all
the attributes.
c) Staple Scale: The Stapel scale was originally developed to measure the
direction and intensity of an attitude simultaneously. Modern versions of the
Stapel scale place a single adjective as a substitute for the Semantic differential
when it is difficult to create pairs of bipolar adjectives. The modified Stapel
scale places a single adjective in the centre of an even number of numerical
values (say, +3, +2, +1, 0, –1, –2, –3). This scale measures how close to or how
distant from the adjective a given stimulus is perceived to be. The following is an
example of a Staple scale.
99
Research and Data
Collection Instructions: Select a plus number for words that you think describe
personnel banking of a bank accurately. The more accurately you think the
word describes the bank, the larger the plus number you should choose.
Select a minus number for words you think do not describe the bank
accurately. The less accurately you think the word describes the bank, the
larger the minus number you should choose.
Format:
+5 +5
+4 +4
+3 +3
+2 +2
+1 +1
Friendly Personnel Competitive Loan Rates
–1 –1
–2 –2
–3 –3
–4 –4
–5 –5
+4 +3 +2 +1 -1 -2 -3 -4
Fast
Fast Service
Services
Friendly
Friendly
Honest
Honest
Convenient
ConvenientLocation
Location
Convenient
ConvenientHours
Hours
Dull
Dull
Good
GoodServices
Services
High
HighSaving
SavingRates
Rates
Each respondent is asked to circle his opinion on a score against each phrase
that describes the object. The final score of the respondent on a scale is the
sum of their ratings for all the items. Also, the average score for each phrase
is obtained by totaling the final score of all the respondents for that phrase
divided by the number of respondents of the phrase. The following Figure
shows the opinions of 100 respondents on two banks.
+4 +3 +2 +1 -1 -2 -3 -4
Fast Service
Friendly
Honest
Convenient Location
Convenient Hours
Dull
Good Services
High Saving Rates
Bank-X Bank-Y
In the above example first the individual respondent’s scores for each phrase
that describes the selected bank are obtained and then the average scores of all
100
100 respondents for each phrase are plotted graphically. The maximum score Measurement and
Scaling Techniques
possible for each bank is +32 and the minimum possible score for each brand is
–32. In the example, Bank-X has score +24, and Bank-Y has score +3. From
the scale we can identify which phrase needs improvement for each Bank.
The advantages and disadvantages of the Stapel scale are very similar to those
for the Semantic differential scale. However, the Stapel scale tends to be easier
to construct and administer, especially over telephone, since the Stapel scale
does not call for the bipolar adjectives as does the Semantic differential scale.
However, research on comparing the Stapel scale with Semantic differential
scale suggests that the results of both the scales are largely the same.
This is a non-comparative scale since it deals with a single concept (the brand of
a detergent). On the other hand, a comparative scale asks a respondent to rate a
concept. For example, you may ask:
Which one of the following brands of detergent you prefer?
Brand-X Brand-Y
In this example you are comparing one brand of detergent with another brand.
Therefore, in many situations, comparative scaling presents ‘the ideal situation’
as a reference for comparison with actual situation.
1) In paired comparison, the order in which the objects are presented may
____________ results.
2) A researcher wants to measure consumer preference between 7 brands of
bath soap and has decided to use the Paired comparisons scaling technique.
How many pairs of brands will the researcher present the respondents?:
________________
3) In a semantic differential scale there are 20 scale items. Should all the
positive adjectives be on the left side and all the negative adjectives be on the
right side. Can you explain?
................................................................................................................
................................................................................................................
................................................................................................................
102
Measurement and
5.7 LET US SUM UP Scaling Techniques
2) 21
3) No. Some of the positive adjectives may be placed on the left side and
some on the right side. This prevents the respondent with positive
(negative) attitude from simply checking the left (right) side without
reading the description of the words.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
105
Processing of Data
UNIT 6 PROCESSING OF DATA
STRUCTURE
6.0 Objectives
6.1 Introduction
6.2 Editing of Data
6.3 Coding of Data
6.4 Classification of Data
6.4.1 Types of Classification
6.4.1.1 Classification According to External Characteristics
6.4.1.2 Classification According to Internal Characteristics
6.4.1.3 Preparation of Frequency Distribution
6.5 Tabulation of Data
6.5.1 Types of Tables
6.5.2 Parts of a Statistical Table
6.5.3 Requisites of a Good Statistical Table
6.6 Let Us Sum Up
6.7 Key Words
6.8 Answers to Self Assessment Exercises
6.9 Terminal Questions/Exercises
6.10 Further Reading
6.0 OBJECTIVES
After studying this unit, you should be able to:
l evaluate the steps involved in processing of data,
l check for obvious mistakes in data and improve the quality of data,
l describe various approaches to classify data,
l construct frequency distribution of discrete and continuous data, and
l develope appropriate data tabulation device.
6.1 INTRODUCTION
In Unit 3 we have discussed various methods of collection of data. Once the
collection of data is over, the next step is to organize data so that meaningful
conclusions may be drawn. The information content of the observations has to
be reduced to a relatively few concepts and aggregates. The data collected
from the field has to be processed as laid down in the research plan. This is
possible only through systematic processing of data. Data processing involves
editing, coding, classification and tabulation of the data collected so that they
are amenable to analysis. This is an intermediary stage between the collection
of data and their analysis and interpretation. In this unit, therefore, we will learn
about different stages of processing of data in detail.
1) The editor should have a copy of the instructions given to the interviewers.
2) The editor should not destroy or erase the original entry. Original entry should
be crossed out in such a manner that they are still legible.
3) All answers, which are modified or filled in afresh by the editor, have to be
indicated.
4) All completed schedules should have the signature of the editor and the date.
For checking the quality of data collected, it is advisable to take a small sample
of the questionnaire and examine them thoroughly. This helps in understanding
the following types of problems: (1) whether all the questions are answered, (2)
whether the answers are properly recorded, (3) whether there is any bias, (4)
whether there is any interviewer dishonesty, (5) whether there are
inconsistencies. At times, it may be worthwhile to group the same set of
questionnaires according to the investigators (whether any particular investigator
has specific problems) or according to geographical regions (whether any
particular region has specific problems) or according to the sex or background
of the investigators, and corrective actions may be taken if any problem is
observed.
A careful study of the answers is the starting point of coding. Next, a coding
frame is to be developed by listing the answers and by assigning the codes to
them. A coding manual is to be prepared with the details of variable names,
codes and instructions. Normally, the coding manual should be prepared before
collection of data, but for open-ended and partially coded questions. These two
categories are to be taken care of after the data collection. The following are
the broad general rules for coding:
2) Each qualitative question should have codes. Quantitative variables may or may
not be coded depending on the purpose. Monthly income should not be coded if
one of the objectives is to compute average monthly income. But if it is used as
a classificatory variable it may be coded to indicate poor, middle or upper
income group.
3) All responses including “don’t know”, “no opinion” “no response” etc., are to
be coded.
Sometimes it is not possible to anticipate all the responses and some questions
are not coded before collection of data. Responses of all the questions are to
be studied carefully and codes are to be decided by examining the essence of
the answers. In partially coded questions, usually there is an option “Any Other
(specify)”. Depending on the purpose, responses to this question may be
examined and additional codes may be assigned.
Population
Employee Unemployee
0 12 1,000-2,000 6
1 25 2,000-3,000 10
2 20 3,000-4,000 15
3 7 4,000-5,000 25
4 3 5,000-6,000 9
5 1 6,000-7,000 4
Total 68 Total 69
3 2 2 1 4 1 0 1 1 2 4 1 3 3 2 1 3 4 3 2 0 1 3 4 3
1 4 3 2 2 1 3 1 2 3 2 3 4 4 2 4 3 4 2 3 3 2 0 4 3
To have a discrete frequency table, we may take the help of ‘Tally’ marks as
indicated below.
0 5
1 8
2 12
3 15
4 10
Total 50
From the above frequency table it is clear that more than half the students (27
out of 50) go to the theatre twice or thrice a week and very few do not go
even once a week. These were not so obvious from the raw data.
10
Illustration 2 Processing of Data
1) The highest and the lowest values of the observations are to be identified and
the lower limit of the first class and upper limit of the last class may be decided.
2) The number of classes to be decided. There is no hard and fast rule. It should
not be too little (lower than 5, say) to avoid high information loss. It should not
be too many (more than 12, say) so that it does not become unmanageable.
3) The lower and the upper limits should be convenient numerals like 0-5, 0-10,
100-200 etc.
4) The class intervals should also be numerically convenient, like 5, 10, 20 etc., and
values like 3, 9, 17 etc., should be avoided.
5) As far as possible, the class width may be made uniform for ease in subsequent
calculation.
11
Processing and Presentation It is often quite useful to present the frequency distribution in two different
of Data
ways. One way is relative or percentage relative frequency distribution.
Relative frequencies can be computed by dividing the frequency of each class
with sum of frequency. If the relative frequencies are multiplied by 100 we will
get percentages. Another way is cumulative frequency distribution which
are cumulated to give the total of all previous frequencies including the present
class, cumulating may be done either from the lowest class (from below) or
from the highest class (from above). The following table illustrates this concept.
Illustration 3
Column (5) in the above table gives cumulative frequency of a particular class,
which is obtained as discussed earlier. Cumulative frequency of the second
class is obtained by adding of its class frequency (23) and the previous class
frequency (2). Cumulative frequency of the next class is obtained by adding of
its class frequency (19) to the cumulative frequency of the previous class (25).
Cumulative frequencies may be interpreted as the number of observation below
the upper class limit of a class. For example, a cumulative frequency of 44 in
the third class (25-30) indicates that 44 labourers received a daily wage of less
than Rs. 30. Cumulation from the highest class may also be done as shown in
column (6). It has a similar interpretation.
Illustration 4
2 1-2 12 12 19 43
3 2-5 11 15 20 10 8 64
4 5-10 2 8 15 5 10 40
5 10-20 2 12 4 9 6 33
6 20 and 2 1 2 2 7
more
7 Total 35 40 68 20 29 8 200
The above bivariate frequency table is prepared on the basis of sales and profit
data of 200 companies. As discussed earlier, class limits for both Sales and
Profit are decided first. Tally marks are placed in appropriate row and column
(not shown here). Suppose a company’s Sales and Profit figures are Rs. 2.5
lakhs and Rs. 49000 respectively. It is placed in class 3 of Sales (2 to 5 lakhs)
and Column (5) showing class interval of profit 20 to 50 thousands. The last
column (Column (9) gives the total over all class intervals of Profit. Hence it
gives the frequency distribution of Sales. The frequency distribution in this
column is known as Marginal Frequency distribution of Sales. Similarly, the
figures in Serial No 7 (Row 7) are obtained by summing up over all the class
intervals of Sales. This is the frequency distribution of profit or the Marginal
Frequencies of Profit. The entire table is also known as Joint Frequency
Distribution of Sales and Profit.
13
Processing and Presentation Self Assessment Exercise B
of Data
The following table gives values of production and values of raw materials
used in 60 industrial units. Prepare (i) two individual frequency distributions for
the variables. (ii) Prepare bivariate frequency table. For value of production
you may take - 8000-9000, 9000-10000,…….., 130000 - 140000 as classes and
for value of raw material you may take - 2500 - 3000, 3000-3500, ……,5000
- 5500 as class intervals.
14
.................................................................................................................. Processing of Data
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
Tables may be classified, depending upon the use and objectives of the data to
be presented, into simple tables and complex tables. Let us discuss them along
with illustrations.
Simple Table: In this case data are presented only for one variable or
characteristics. Therefore, this type of table is also known as one way table.
The table showing the data relating to the sales of a company in different years
will be an example of a single table.
15
Processing and Presentation Look at the following tables for an example of this type of table.
of Data
Illustration 5
Table 6.5 : Population of India During 1961–2001 (In thousands)
Census Year Population
1961 439235
1971 548160
1981 683329
1991 846303
2001 1027015
Source: Census of India, various documents.
20-30 2
30-40 5
40-50 21
50-60 19
60-70 11
70-80 5
80-90 2
Total 65
A simple table may be prepared for descriptive or qualitative data also. The
following example illustrates it
Illiterate 22
Literate but below
primary 10
Primary 5
High School 2
College and above 1
All 40
16
Complex Table: A complex table may contain data pertaining to more than Processing of Data
one characteristic. The population data given below is an example.
Illustration 6
Table 6.8 : Rural and Urban Population of India During 1961–2001
(In thousands)
Population
Census Year Rural Urban Total
1961 360298 78937 439235
1971 439046 109114 548160
1981 523867 159463 683329
1991 628691 217611 846303
2001 741660 285355 1027015
Note: The total may not add up exactly due to rounding off error.
Source: Census of India, various documents.
In the above example, rural and urban population may be subdivided into males
and females as indicated below.
Table 6.9 : Rural and Urban Population of India During 1961–2001 (sex-wise)
(In thousands)
Population
Census Year Rural Urban Total
Male Female Male Female Male Female
(1) (2) (3) (4) (5) (6) (7)
In each of the above categories, the persons could be grouped into child and
adult, worker and non-worker, or according to different age groups and so on.
A particular type of complex table that is of great use in research is a cross-
table, where the table is prepared based on the values of two or more
variables. The bivariate frequency table used earlier (illustration 4) is reproduced
here for illustration.
Illustration 7
Table 6.10 : Sales and Profit of 200 Companies
Sl. Sales Profit ( Rupees in thousands)
No. (Rupees Upto 10 10-20 20-50 50-100 100-200 200 and Total
in lakhs) more
(1) (2) (3) (4) (5) (6) (7) (8) (9)
1 Up to 1 10 3 13
2 1-2 12 12 19 43
3 2-5 11 15 20 10 8 64
4 5-10 2 8 15 5 10 40
5 10-20 2 12 4 9 6 33
6 20 and 2 1 2 2 7
more
7 Total 35 40 68 20 29 8 200
17
Processing and Presentation From bivariate table, one may get some idea about the interrelationship between
of Data
two variables. Suppose, that all the frequencies are concentrated in the diagonal
cells, then there is likely to be a strong relationship. That is positive relationship
if it starts from top-left corner to bottom-right corner or if it is from bottom-left
corner to top-right corner then, we could say there is negative relationship. If
the frequencies are more or less equally distributed over all the cells, then
probably there is no strong relationship.
Multivariate tables may also be constructed but interpretation becomes difficult
once we go beyond two variables.
So far we have discussed and learnt about the types of tables and their
usefulness in presentation of data. Now, let us proceed to learn about the
different parts of a table, which enable us to have a clear understanding of the
rules and practices followed in the construction of a table.
A table should have the following four essential parts - title, caption or box
head (column), stub (row heading) and main data. At times it may also contain
an end note and source note below the table. The table should have a title,
which is usually placed above the statistical table. The title should be clearly
worded to give some idea of the table’s contents. Usually a report has many
tables. Hence the tables should be numbered to facilitate reference.
Caption refers to the totle of the columns. It is also termed as “box head”.
There may be sub-captions under the main caption. Stub refers to the titles
given to the rows.
Some of these features are illustrated below with reference to the table on
Rural and Urban Population during 1961-2001, which was presented in earlier
illustration-6, Table 6.8.
1. Title of the Table: Rural and Urban Population of India during 1961–
Table 2001 (in thousands)
18
Processing of Data
Census Year
5. End Note Note: The total may not add up exactly due to
rounding off of error.
1
2
Row Number 3
4
5
Geographical: It can be used when the reader is familiar with the usual
geographical classification.
One point may be noted. The above arrangements are not exclusive. In a big
table, it is always possible and sometimes convenient to arrange the items
following two or three methods together. For example, it is possible to construct
a table in chronological order and within it in geographical order. Sometimes
information of the same table may be rearranged to produce another table to
highlight certain aspects. This will be clear from the following specimen tables.
Table A
Table B
Tables are prepared for making data easy to understand for the reader. It
should not be very large as the focus may be lost. A large table may be
logically broken into two or more small tables.
1) A good table must present the data in as clear and simple a manner as possible.
2) The title should be brief and self-explanatory. It should represent the
description of the contents of the table.
20 3) Rows and Columns may be numbered to facilitate easy reference.
4) Table should not be too narrow or too wide. The space of columns and Processing of Data
rows should be carefully planned, so as to avoid unnecessary gaps.
5) Columns and rows which are directly comparable with one another should
be placed side by side.
6) Units of measurement should be clearly shown.
7) All the column figures should be properly aligned. Decimal points and plus
or minus signs also should be in perfect alignment.
8) Abbreviations should be avoided in a table. If it is inevitable to use, their
meanings must be clearly explained in footnote.
9) If necessary, the derived data (percentages, indices, ratios, etc.) may also
be incorporated in the tables.
10) The sources of the data should be clearly stated so that the reliability of
the data could be verified, if needed.
3 OC PRIMARY RURAL
4 BC ILLITERATE RURAL
6 SC PRIMARY RURAL
7 ST ILLITERATE RURAL
10 ST ILLITERATE RURAL
12 OC PRIMARY RURAL
14 BC ILLITERATE RURAL
15 SC PRIMARY RURAL
16 SC PRIMARY RURAL
20 ST PRIMARY URBAN
21 SC PRIMARY RURAL
23 ST PRIMARY RURAL
25 OC PRIMARY RURAL
27 OC PRIMARY URBAN
21
Processing and Presentation 28 OC PRIMARY URBAN
of Data
29 BC ILLITERATE RURAL
30 SC ILLITERATE RURAL
31 SC ILLITERATE URBAN
32 ST ILLITERATE URBAN
33 OC PRIMARY URBAN
34 OC PRIMARY RURAL
35 BC ILLITERATE URBAN
36 OC HIGH SCHOOL URBAN
37 OC HIGH SCHOOL URBAN
38 OC ILLITERATE RURAL
39 BC ILLITERATE URBAN
40 OC PRIMARY RURAL
41 BC LITERATE BUT BELOW PRIMARY RURAL
42 BC ILLITERATE URBAN
43 OC ILLITERATE RURAL
44 BC PRIMARY RURAL
45 BC LITERATE BUT BELOW PRIMARY RURAL
46 SC PRIMARY RURAL
47 SC ILLITERATE URBAN
48 ST PRIMARY RURAL
49 OC PRIMARY RURAL
50 OC LITERATE BUT BELOW PRIMARY RURAL
..........................................................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
22 ..................................................................................................................
Processing of Data
6.6 LET US SUM UP
Once data collection is over, the next important steps are editing and coding.
Editing helps in maintaining consistency in quality of data. Editing is the first
stage in data processing. It is the process of examining the data collected to
detect errors and omissions and correct them for further analysis. Coding
makes further computation easier and necessary for efficient analysis of data.
Coding is the process of assigning some symbols to the answers. A coding
frame is developed by listing the answers and by assigning the codes to them.
The next stage is classification. Classification is the process of arranging data in
groups or classes on the basis of some chracteristics. It helps in making
comparisons and drawing meaningful conclusions. The classified data may be
summarized by means of tabulations and frequency distributions. Cross
tabulation is particularly useful as it provides some clue about relationship and
its direction between two variables. Frequency distribution and its extensions
provide simple means to summarize data and for comparison of two sets of data.
Class Interval : The difference between the upper and lower limits of a class.
Class Limits : The lowest and the highest values that can be included in the
class.
Coding : A method to categorize data into groups and assign numerical values
or symbols to represent them.
23
Processing and Presentation
of Data 6.8 ANSWERS TO SELF ASSESSMENT EXERCISES
B. Frequency Distribution for Value of Production (Rupees in lakh)
1 80–90 8
2 90–100 10
3 100–110 27
4 110–120 10
5 120–130 4
6 130–140 1
7 Total 60
1 25–30 2
2 30–35 11
3 35–40 36
4 40–45 9
5 45–50 1
6 50–55 1
7 Total 60
24
Table : Distribution of 50 Unskilled Workers According to Processing of Data
Educational Level
1 Rural 34
2 Urban 16
5 All 50
Complex Table
Table : Distribution of 50 Unskilled Workers
Education Place of Origin
Level Rural Urban Total
SC ST BC OC Total SC ST BC OC Total
Illiterate 1 2 3 2 8 2 1 3 0 6 14
Below 1 0 2 3 6 1 0 0 2 3 9
Primary
Primary 5 2 1 6 14 0 1 0 3 4 18
High 2 0 0 4 6 0 0 0 3 3 9
School
Total 9 4 6 15 34 3 2 3 8 16 50
6) Draw a “less than” and “more than” cumulative frequency distribution for the
following data.
Income (Rs.) 500-600600-700700-800800-900900-1000
No. of families 25 40 65 35 15
7) What is tabulation? Draw the format of a statistical table and indicate its
various parts.
8) Describe the requisites of a good statistical table.
9) Prepare a blank table showing the age, sex and literacy of the population in a
city, according to five age groups from 0 to 100 years.
10) The following figures relate to the number of crimes (nearest-hundred) in four
metropolitan cities in India. In 1961, Bombay recorded the highest number of
crimes i.e. 19,400 followed by Calcutta with 14,200, Delhi 10,000 and Madras
5,700. In the year 1971, there was an increase of 5,700 in Bombay over its
1961 figure. The corresponding increase was 6,400 in Delhi and 1,500 in
Madras. However, the number of these crimes fell to 10,900 in the case of
Calcutta for the corresponding period. In 1981, Bombay recorded a total of
36,300 crimes. In that year, the number of crimes was 7,000 less in Delhi as
compared to Bombay. In Calcutta the number of crimes increased by 3,100 in
1981 as compared to 1971. In the case of Madras the increase in crimes was
by 8,500 in 1981 as compared to 1971. Present this data in tabular form.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
26
Diagrammatic and
UNIT 7 DIAGRAMMATIC AND GRAPHIC Graphic Presentation
PRESENTATION
STRUCTURE
7.0 Objectives
7.1 Introduction
7.2 Importance of Visual Presentation of Data
7.3 Diagrammatic Presentation
7.3.1 Rules for Preparing Diagrams
7.4 Types of Diagrams
7.5 One Dimensional Bar Diagrams
7.5.1 Simple Bar Diagram
7.5.2 Multiple Bar Diagram
7.5.3 Sub-divided Bar Diagram
7.6 Pie Diagram
7.7 Structure Diagrams
7.7.1 Organisational Charts
7.7.2 Flow Charts
7.8 Graphic Presentation
7.9 Graphs of Time Series
7.9.1 Graphs of One Dependent Variable
7.9.2 Graphs of More Than One Dependent Variable
7.10 Graphs of Frequency Distribution
7.10.1 Histograms and Frequency Polygon
7.10.2 Cumulative Frequency Curves
7.11 Let Us Sum Up
7.12 Key Words
7.13 Answers to Self Assessment Exercises
7.14 Terminal Questions/Exercises
7.15 Further Reading
7.0 OBJECTIVES
After studying this Unit, you should be able to:
7.1 INTRODUCTION
In the previous Unit 6, you have studied the importance and techniques of
editing, coding, classification and tabulation that help to arrange the mass of
data (collected data) in a logical and precise manner. Tabulation is one of the
techniques for presentation of collected data which makes it easier to establish
trend, pattern, comparison etc. However, you might have noticed, it is a difficult 2 7
Processing and Presentation and cumbersome task for a researcher to interpret a table having a large mass
of Data of numerical information. Sometimes it may fail to convey the message
meaningfully to the readers for whom it is meant. To overcome this
inconvenience, diagrammatic and graphic presentation of data has been invented
to supplement and explain the tables. Practically every day we can find the
presentation of cricket score, stock market index, cost of living index etc., in
news papers, television, magazines, reports etc. in the form of diagrams and
graphs. This kind of presentation is also termed as 'visual presentation' or
‘charting’.
In this unit, you will learn about the importance of visual presentation of
research data and some of the reasons why diagrammatic and graphic
presentation of data is so widely used. You will also study the different kinds of
diagrams and graphs, which are more popularly used for presenting the data in
research work. Also its principles on how to present the frequency distribution
in the form of diagrams and graphs. As you are already familiar with graphs
and diagrams, we will proceed with further discussions.
1) They relieve the dullness of the numerical data: Any list of figures
becomes less comprehensible and difficult to draw conclusions from as its
length increases. Scanning of the figures from tables causes undue strain on the
mind. The data when presented in the form of diagrams and graphs, gives a
birds eye-view of the entire data and creates interest and leaves an impression
on the mind of readers for a long period.
2) They make comparison easy: This is one of the prime objectives of visual
presentation of data. Diagrams and graphs make quick comparison between two
or more sets of data simpler, and the direction of curves bring out hidden facts
and associations of the statistical data.
3) They save time and effort: The characteristics of statistical data, through
tables, can be grasped only after a great strain on the mind. Diagrams and
graphs reduce the strain and save a lot of time in understanding the basic
characteristics of the data.
1) You must have noted that the diagrams must be geometrically accurate.
Therefore, they should be drawn on the graphic axis i.e., ‘X’ axis (horizontal
line) and ‘Y’ axis (vertical line). However, the diagrams are generally drawn on
a plain paper after considering the scale.
2) While taking the scale on ‘X’ axis and ‘Y’ axis, you must ensure that the scale
showing the values should be in multiples of 2, 5, 10, 20, 50, etc.
3) The scale should be clearly set up, e.g., millions of tons, persons in Lakhs, value
in thousands etc. On ‘Y’ axis the scale starts from zero, as the vertical scale is
not broken.
4) Every diagram must have a concise and self explanatory title, which may be
written at the top or bottom of the diagram.
5) In order to draw the readers' attention, diagrams must be attractive and well
propotioned.
6) Different colours or shades should be used to exhibit various components of
diagrams and also an index must be provided for identification.
7) It is essential to choose a suitable type of diagram. The selection will depend
upon the number of variables, minimum and maximum values, objects of
presentation.
2 9
Processing and Presentation Self Assessment Exercise A
of Data
List out the importance of visual presentation of statistical data.
........................................................................................................................
........................................................................................................................
........................................................................................................................
........................................................................................................................
A large number of one dimensional diagrams are available for presenting data.
Such as line diagram, simple bar diagram, multiple bar diagram, sub-divided bar
diagram, percentage bar diagram, deviation bar diagram etc. We shall, however,
study only the simple bar diagram, multiple bar diagram, and sub-divided bar
diagram. Let us study these three kinds of diagrams with the support of
relevant illustrations.
Exports
(In Million kgs.) 167 209 410 316 192 215 160
Solution: The quantity of tea exported is given in million kgs. for different
years. A simple bar diagram will be constructed with 7 bars corresponding to
the 7 years. Now study the following vertical construction of bar diagram by
referring the guide lines for construction of simple bars, as explained in section
7.5.1.
450 410
400
Tea Export (in million kgs.)
350 316
300
250 215
209
192
200 167 160
150
100
50
0
1995-96 1996-97 1997-98 1998-99 1999-00 2000-01 2001-02
Years
Figure 7.1: Simple Bar Diagram Showing the Tea Exports in Different Years.
Illustration-2
The following data relates to the Profit and Loss of different industries in 1999-
2002. Present the data through simple bar diagram.
Solution : The given data represents positive and negative values i.e., profit
and loss. Let us draw the bars horizontally. Observe fig: 7.2 carefully and try
to understand the construction of simple bars horizontally.
3 1
Processing and Presentation
of Data
–24
Garments
Sugar 14
Industries
–12 Textiles
Oil 25
Cement 48
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
In this type of diagram, two or more than two bars are constructed side by
side horizontally for a period or related phenomenon. This type of diagram is
also called Compound Bar or Cluster Bar Diagram. The technique of preparing
such a diagram is the same as that of simple bar diagram. This diagram, on
the one hand, facilitates comparison of the values of different variables in a set
and on the other, it facilitates the comparison of the values of the same variable
over a period of time or phenomenon. To facilitate easy comparison, the
3 2 different bars of a set may be coloured or shaded differently to distinguish
between them. But the Colour or shade for the bars representing the same Diagrammatic and
Graphic Presentation
variable in different sets should be the same.
Let us consider the following illustration and learn the method of presentation
of the data in the form of a multiple bar diagram.
Illustration-3
2155
Foregin Investment- Industrywise Inflow
2000
1800
1580 1550
1500 1423
(Rs. In crores)
1194
1000 956
78
0
1997-98 1998-99 1999-2000
Years
Figure 7.3: Multiple Bar Diagram Showing the Inflow of Foreign Investment in Selected
Sectors During 1997-2000
3 3
Processing and Presentation Self Assessment Exercise C
of Data
The following table relates the Indian Textile Exports to different countries
Countries Year
1997-98 1998-99 1999-2000
USA 746.13 759.36 882.41
Germany 366.01 300.46 338.88
UK 403.07 337.94 341.42
Italy 241.64 233.14 215.48
Korea (Republic) 127.00 88.30 185.13
In this diagram one bar is constructed for the total value of the different
components of the same variable. Further it is sub-divided in proportion to the
values of various components of that variable. This diagram shows the total of
the variables as well as the total of its various components in a single bar.
Hence, it is clear that the sub-divided bar serves the same purpose as multiple
bars. The only difference is that, in case of the multiple bar each component of
a variable is shown side by side horizontally, where as in construction of sub-
divided bar diagram each component of a variable is shown one upon the other.
It is also called a component bar diagram. This method is suitable if the total
values of the variables are small, otherwise the scale becomes very narrow to
depict the data. To study the relative changes, all components may be
converted into percentages and drawn as sub-divided bars. Such a bar
construction is called a sub-divided percentage bar. The limitation is that all
the parts do not have a common base to enable us to compare accurately the
various components of a set.
Illustration-4
The following data relates to India's exports of electronic goods to different
3 4 countries during 1994-98. Represent the data by sub-divided bar diagram.
(Rs. in Crores) Diagrammatic and
Graphic Presentation
Country Total
Years USA Hong Malaysia Singapore Germany
Kong
1994-95 210 86 56 275 91 718
1995-96 378 105 159 467 118 1227
1996-97 789 189 221 349 93 1641
1997-98 880 248 175 327 90 1720
1998-99 900 220 200 350 130 1800
1800
130
90
1600 93
Indian's Export of Electronics Goods (Rs. in crores)
350
327
1400 349
175 200
1200 118
221
1000 248 220
467 189
800
91
600 159
275
105 880 900
400 789
56
86
200 378
210
0
1994-95 1995-96 1996-97 1997-98 1998-99
Years
Figure 7.4: Sub-divided Bar Diagram Showing the India's Exports of Electronic Goods to
Different Countries During 1994-99.
3 5
Processing and Presentation Self Assessment Exercise D
of Data
Draw sub-divided bar diagram for the following table. Do you agree that
this diagram is more effective for comparison of figures rather than the
Multiple bar diagram? Justify your opinion.
In constructing a pie diagram the first step is to convert the various values of
components of the variable into percentages and then the percentages
transposed into corresponding degrees. The total percentage of the various
components i.e., 100 is taken as 360° (degrees around the centre of a circle)
and the degree of various components are calculated in proportion to the
percentage values of different components. It is expressed as:
360 o
× component ' s percentage
100
It should be noted that in case the data comprises of more than one variable, to
show the two dimensional effect for making comparison among the variables,
we have to obtain the square root of the total of each variable. These square
3 6
roots would represent the radius of the circles and then they will be sub- Diagrammatic and
Graphic Presentation
divided. A pie diagram helps us in emphasizing the area and in ascertaining the
relationship between the various components as well as among the variables.
However, compared to a bar diagram, a pie diagram is less effective for
accurate interpretation when the components are in large numbers. Let us draw
the pie diagram with the help of the data contained in the following table.
Illustration 5
3 7
Processing and Presentation 1.8%
of Data
3.6%
9.1%
10.9%
18.2%
56.4%
Radio Daily wages Local traders Co-farmers Personal visits M arket office
What features of this distribution does your pie diagram mainly illustrate?
3 8
Diagrammatic and
7.7 STRUCTURE DIAGRAMS Graphic Presentation
There are several important diagram formats that are used to display the
structural information (qualitative) in the form of charts. The format depends
upon the nature of information. Under these type of diagrams we will discuss
two different diagrams, i.e., (1) Organisational Charts and (2) Flow Charts.
These types of charts are most commonly used to represent the internal
structure of organisations. There is no standard format for these kind of
diagrams as the design of the diagram depends on the nature of the
organization. A special format is used in the following illustration which relates
to the organisational structure of the IGNOU. Study the Fig. 7.6 and try to
understand the preparation of this kind of diagram relating to other
organisations.
VISITOR
BOARD OF MANAGEMENT
Vice Chancellor
Pro-Vice Chancellors
Flow charts are used most commonly in any situation where we wish to
represent the information which flows through different situations to its ultimate
point. These charts can also be used to indicate the flow of information about
various aspects i.e., material flow, product flow (distribution channels), funds
flow etc.
3 9
Processing and Presentation The following Figure 7.7 relates to the marketing channels for fruits, which will
of Data give you an understanding about flow charts.
Growers
Processors Pre-Harvest
Contractors
Commission Agent in
Wholesaler Exports Wholesale Market
Retail Wholesaler
Consumer Exports
4 0
Diagrammatic and
7.8 GRAPHIC PRESENTATION Graphic Presentation
The shape of a graph offers easy and appropriate answers to several questions,
such as:
l The direction of curves on the graph makes it very easy to draw comparisons.
l The presentation of time series data on a graph makes it possible to interpolate
or extrapotrate the values, which helps in forecasting.
l The graph of frequency distribution helps us to determine the values of Mode,
Median, Quartiles, percentiles, etc.
l The shape of the graph helps in demonstrating the degree of inequality and
direction of correlation
For all such advantages it is necessary for a researcher to have an
understanding of different types of graphic presentation of data. In practice,
there are a variety of graphs which can be used to depict the data. However,
here we will discuss only a few graphs which are more frequently used in
business research.
Broadly, the graphs of statistical data may be classified into two types, one is
graphs of time series, another is graphs of frequency distribution. We will
discuss both these types, after studying the parts of a graph.
Parts of a Graph
The foremost requirement for a researcher is to be aware of the basic
principles for using the graph paper for presentation of statistical data
graphically.
4 1
Processing and Presentation
of Data
Y
5
QUADRANT-II 4 QUADRANT-I
3
X–Negative Values X–Positive Values
2
Y–Positive Values Y–Positive Values
1
X X
-5 -4 -3 -2 -1 0 1 2 3 4 5
-1
-2
QUADRANT-III QUADRANT-IV
-3
X–Negative Values -4 X–Positive Values
Y–Negative Values Y–Negative Values
-5
Y
Chart 7.1 : Parts of a Graph
After understanding the above parts of a graph, let us study the different types
of graphs.
1) On X-axis we take the time as an independent variable and on Y axis the values
of data as dependent variable. Plot the different points corresponding to given
data; then the points are joined by a straight line in the order of time.
2) Equal magnitude of scale must be maintained on X-axis as well as on Y-axis.
3) The Y-axis normally starts with zero. In case, there is a wide difference
between the lowest value of the data and zero (origin point), the Y-axis can be
broken and a false base line may be drawn. However, it will be explained under
4 2 the related problem in this section.
4) If the variables are in different units, double scales can be taken on the Y Diagrammatic and
Graphic Presentation
axis.
5) The scales adopted should be clearly indicated and the graph must have a
self-explanatory title.
6) Unfortunately, graphs lend themselves to considerable misuse. The same
data can give different graphical shapes depending on the relative size of
two axes. In order to avoid such misrepresentations the convention in
research is to construct graphs, wherever possible, such that the vertical axis
is around 2/3 to 3/4 the length of the horizontal.
After having learnt about the principles for construction of historigrams, we
move on to discuss the types of historigrams. There are various types which
have been developed. Among them the frequently used graphs are one variable
graphs and more than one dependent variable graphs. We will now look at the
construction of these graphs.
When there is only one dependent variable, the values of the dependent variable
are taken on Y axis, while the time is taken on X-axis. Study the following
illustration carefully and try to understand the method of construction for one
dependent variable historigrams.
Illustration 6
The following data relates to India's exports to USA during the period of 1994-
2000. Represent the data graphically.
Y
11000 10687
10000
9000 9071
Export to USA (in million $)
8237
8000
7322
7000
6000 6130
5726
5000 5310
4000
3000
2000
1000
0 X
1994 1995 1996 1997 1998 1999 2000
Years
Figure 7.8: Historigram Showing India's Exports to USA During 1994-2000. 4 3
Processing and Presentation False base line
of Data
In the above graph (Figure 7.8), the scale on Y axis has been taken as 1cm =
1,000 million starting from the origin point i.e., zero. Consequently, the portion of
the graph paper, on which lies the scale between zero and the smallest value of
the data (5310) is omitted and only the above half of the graph paper is used
to depict the data because, there is a wide difference between zero and the
lowest value of the given data. Therefore, the curve which is drawn is not
significant to understand the fluctuations. In such a situation, in order to use the
space of the graph paper effectively, it is mandatory to draw a false base line.
By using the false base line, minor fluctuations are amplified and they become
clearly visible on the graph. The false base line breaks the continuity of the
scale of Y axis from the origin point, i.e., zero by drawing a horizontal wave
line in between the zero and the first centimeter on the scale of Y axis.
Y
11000
10500
10000 10687
9500
9000
8500 9071
8000
7500 8237
7000
7322
6500
6000
6170
5500 5726
5000 5310
0 X
1994 1995 1996 1997 1998 1999 2000
When the data of time series relate to more than one dependent variable,
curves must be drawn for each variable separately. These graphs are prepared
in the same manner as we prepare one dependent variable historigram. Let us
consider the following data to construct historigrams. Study Figure 7.10 carefully
and understand the procedure for preparation of this type of graph.
4 4
Illustration-7 Diagrammatic and
Graphic Presentation
Years 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Sales
(In 31 58 42 65 75 80 72 96 83 98
lakh)
Cost
of Sales 42 50 48 55 82 75 62 80 67 73
(Rs. In
Lakh)
Profit/ –11 +8 –6 +10 –7 +5 +10 +20 +16 +25
Loss
Solution : The given data comprises of three variables, so, we have to draw
a separate curve for each variable. In this graph, it is not necessary to draw
false base line because the minimum value is close to the point of origin (zero).
For easy identification, each curve is marked differently.
80
60
40
20
0 X
-20 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Years
Figure 7.10 : Historigram Showing Sales, Cost of Sales and Profit/Loss of a Company
During 1991-2000
The above graph clearly reveals that with passage of time the profits are rising
after 1996, even though the sales are fluctuating slightly.
4 5
Processing and Presentation
of Data Self Assessment Exercise G
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
Let us study the procedure involved in the preparation of these types of graphs.
The value of mode can be determined from the histogram. The procedure for
locating the mode is to draw a straight line from the top right corner of the
highest rectangle (Modal Class) to the top right corner of the preceding
rectangle (Pre Modal Class). Similarly, draw a straight line from the top left
corner of the highest rectangle to top left corner of the succeeding rectangle
(Post Modal Class). Draw a perpendicular from the point of intersection of
these two straight lines to X-axis. The point where it meets the X-axis gives
the value of mode. This is shown in Figure 7.11. However, graphic location of
Mode is not possible in a multi-distribution.
Let us, now, take up an illustration to learn how to draw a histogram, and
frequency polygon practically and also determine the mode. The data relates to
the sales of computers by different companies.
Illustration-8
Sales (Rs. 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
In crores)
No. of
Companies 8 20 35 50 90 70 30 15
4 7
Processing and Presentation 100
of Data
90
80
70
Histogram
No. of companies
60
50
Frequency Polygan
40
30
20
10
0
10 20 30 40 50 60 70 80
Z = 46.67
Sale of computers (Rs. in crores)
Figure 7.11: Histogram and Frequency Polygon for Computer Sales of Various
Companies
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
4 8
7.10.2 Cumulative Frequency Curves Diagrammatic and
Graphic Presentation
Some times we are interested in knowing how many families are there in a
city, whose earnings are less than Rs. 5,000 p.m. or whose earning are more
than Rs. 20,000 p.m. In order to obtain this information, we have first of all
to convert the ordinary frequency table into cumulative frequency table. When
the frequencies are added they are called cumulative frequencies. The curves
so obtained from the cumulative frequencies are called ‘cumulative frequency
curves’, popularly known as “ogives”. There are two types of ogives namely
less than ogive, and more than ogive. Let us know about the procedure
involved in drawing these two ogives.
In less than ogive, we start with the upper limit of each class and the
cumulative (addition) starts from the top. When these frequencies are plotted
we get less than ogive. In case of more than ogive we start with the lower
limit of each class and the cumulation starts from the bottom. When these
frequencies are plotted we get more than ogive. You should bear in mind that
while drawing ogives the classes must be in exclusive form.
The ogives are useful to determine the number of items above or below a
given value. It is also useful for comparison between two or more frequency
distributions and to determine certain values (positional values) such as mode,
median, quartiles, percentiles etc. Let us take up an illustration to understand
how to draw ogives practically. Observe carefully the procedures involved in it.
Note: Mode and Median are explained in Unit 8. Similarly, quartiles are in
Unit 9. This illustration can be better understood only after studying those units.
Illustration-9
The cumulative frequencies presented in the above table have the following
interpretation.
The ‘less than’ cumulative frequencies are to be read against upper class limits.
In contrast, the ‘more than’ cumulative frequencies are to be read against
lower class boundaries. For instance, there are 7 units with operating expenses
of less than Rs. 20,000, there are 160 units with operating expenses of less
than Rs. 120,000. On the other hand, there are 153 units with operating
expenses more than Rs. 60,000; no units with operating expenses more than or
equal to Rs. 2,00,000.
160
140
120
100
80
60
40
20
0
0 20 40 60 80 100 120 140 160 180 200
Q1 = 60.18 Me = 80.77 Q3 = 112.31
Operating Expenses (Rs. in 000'
Fig 7.12: ‘Less than’ and ‘More than’ Cumulative Frequency Curves Showing the Operating Expenses
5 0 (Rs. in’ 000) of Small Scale Industrial Units.
Now, look at Figure 7.12 which shows both the cumulative curves on the same Diagrammatic and
Graphic Presentation
graph. Study carefully and understand the procedures for drawing ogives.
From the above ogives, the median can be located by drawing a perpendicular
from the intersection of the two ogives to X-axis. The point where the
perpendicular touches X-axis would be the Median of the distribution. Similarly,
the perpendicular drawn from the intersection of the two curves to the Y-axis
would divide the sum of frequencies into two equal parts. The values of
positional averages like Q1, D6, P50, etc., can also be located with the help of
an item’s value on the less than ogive. In the above figure determination of Q1
and Q3 are shown as an illustration.
c) How many sample families are approximately spending less than 3,800 on
food.
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
5 1
Processing and Presentation
of Data 7.11 LET US SUM UP
Statistical data not only requires a careful analysis but also ensures an attractive
and communicative display. The work of the researcher is to understand the
facts of the data himself/herself and also to present them in such a form that
their significance may be understandable to a common reader. In order to
achieve this objective, we have, in this unit, discussed the techniques of
diagrammatic and graphic presentation of statistical data. Besides, presenting the
data in the form of tables, data can also be presented in the form of diagrams
and graphs. Such visual presentation of data allows relation between numbers to
be exhibited clearly and attractively, makes quick comparison between two or
more data sets easier, brings out hidden facts and the nature of relationship,
saves time and effort, facilitates the determination of various statistical
measures such as Mean, Mode, Median, Quartiles, Standard deviation etc., and
establishes trends of past performance. Hence, with the help of the diagrams
and graphs the researcher can effectively communicate to readers the
information contained in a large mass of numerical data.
We have discussed the method for constructing simple bar diagram, multiple bar
diagram, sub-divided bar diagram, pie diagram and structure diagrams.
Continuous Data : Data that may progress from one class to the next without
a break and may be expressed by either fractions or whole number.
Discrete Data : Data that do not progress from one class to the next without
break, i.e., where classes represent distinct categories or counts and may be
represented by whole numbers only.
False Base Line : A line that is drawn between the origin point (zero) and
the first c.m., by breaking Y-axis in case of historigrams. Hence the scale of
Y-axis does not start at zero.
Flow Chart : Presents the information which flows through various situations
to the ultimate point.
E. Steps : 1) Find out the percentages of each reason for buying face cream.
2) Convert the percentages into degree of angle. 3) Then depict the percentages
in a circle with the help of their respective degree of angles.
7) Draw a Multiple bar and sub-divided bar diagrams to represent the following
data relating to the enrollment of various programmes in an open university
over a period of four years and comment on it.
5 3
Processing and Presentation
of Data Programme No. of Candidate enrolled
1998-99 1999-2000 2000-01 20001-02
MBA 1,565 2,356 1,924 3,208
M.Com 872 1,208 1,118 1,097
B.A. 1,600 1,220 1,090 987
B.Com 726 948 1,458 1,220
8) Construct a pie diagram to describe the following data which relates to the
amount spent on various heads under Rural development programme.
What features of this distribution does your pie diagram mainly illustrate?
9) The following table gives the Index numbers of wholesale Prices (Average) of
Cereals, Pulses and oilseeds over a period of 7 yrs. Compare these prices
through a suitable graph.
10)Draw histogram and frequency polygon of the following distribution. Locate the
approximate mode with the help of histogram.
5 4
11) The following data relating to sales of 80 companies are given below Diagrammatic and
Graphic Presentation
Sales (Rs.Lakhs) No. of Companies
5-15 8
15-25 13
25-35 19
35-45 14
45-55 10
55-65 7
65-75 6
75-85 3
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
5 5
Processing and Presentation
of Data UNIT 8 STATISTICAL DERIVATIVES AND
MEASURES OF CENTRAL
TENDENCY
STRUCTURE
8.0 Objectives
8.1 Introduction
8.2 Statistical Derivatives
8.2.1 Percentage
8.2.2 Ratio
8.2.3 Rate
8.3 Measures of Central Tendency
8.3.1 Properties of an Ideal Measure of Central Tendency
8.3.2 Mean and Weighted Mean
8.3.3 Median
8.3.4 Mode
8.3.5 Choice of a Suitable Average
8.3.6 Some Other Measures of Central Tendency
8.4 Let Us Sum Up
8.5 Key Words
8.6 Answers to Self Assessment Exercises
8.7 Terminal Questions/Exercises
8.8 Further Reading
8.0 OBJECTIVES
After studying this unit, you should be able to:
l explain the meaning and use of percentages, ratios and rates for data
analysis,
l discuss the computational aspects involved in working out the statistical
derivatives,
l describe the concept and significance of various measures of central
tendency, and
l compute various measures of central tendency, such as arithmetic mean,
weighted mean, median, mode, geometric mean, and hormonic mean.
8.1 INTRODUCTION
In Unit 6 we discussed the method of classifying and tabulating of data.
Diagrammatic and graphic presentations are covered in the previous unit
(Unit-7). They give some idea about the existing pattern of data. So far no big
numerical computation was involved. Quantitative data has to be condensed in a
meaningful manner, so that it can be easily understood and interpreted. One of
the common methods for condensing the quantitative data is to compute
statistical derivatives, such as Percentages, Ratios, Rates, etc. These are simple
derivatives. Further, it is necessary to summarise and analyse the data. The first
step in that direction is the computation of Central Tendency or Average, which
gives a bird's-eye view of the entire data. In this Unit, we will discuss computation
of statistical derivatives based on simple calculations. Further, numerical methods
for summarizing and describing data – measures of Central Tendency – are
discussed. The purpose is to identify one value, which can be obtained from the
5 6 data, to represent the entire data set.
Statistical Derivatives and
8.2 STATISTICAL DERIVATIVES Measures of Central Tendency
Statistical derivatives are the quantities obtained by simple computation from the
given data. Though very easy to compute, they often give meaningful insight to
the data. Here we discuss three often-used measures: percentage, ratio and
rate. These measures point out an existing relationship among factors and
thereby help in better interpretation.
8.2.1 Percentage
As we have noted earlier, the frequency distribution may be regarded as simple
counting and checking as to how many cases are in each group or class. The
relative frequency distribution gives the proportion of cases in individual classes.
On multiplication by 100, the percentage frequencies are obtained. Converting to
percentages has some advantages - it is now more easily understood and
comparison becomes simpler because it standardizes data. Percentages are quite
useful in other tables also, and are particularly important in case of bivariate
tables. We show one application of percentages below. Let us try to understand
the following illustration.
Illustration 1
The following table gives the total number of workers and their categories for
all India and major states. Compute meaningful percentages.
Table: Total Workers and Their Categories-India and Major States : 2001
(In thousands)
Sl. State/ Cultivators Agricultural Household Other Total
No. India Labourers Industry Workers Workers
Workers
1. Jammu &
Kashmir
2. Himachal
Pradesh
3. Punjab
4. Haryana
5. Rajasthan
6. Uttar Pradesh
7. Bihar
8. Assam
9. West Bengal
10. Orissa
11. Madhya
Pradesh
12. Gujarat
13. Maha-
rashtra
14. Andhra
Pradesh
15. Karnataka
16. Kerala
17. Tamil Nadu
INDIA 100.00 100.00 100.00 100.00 100.00
8.2.2 Ratio
There are several types of ratios used in statistical work. Let us discuss them.
Time ratio: This ratio is a measure which expresses the changes in a series
of values arranged in a time sequence and is typically shown as percentage.
Mainly, there are two types of time ratios :
i) Those employing a fixed base period: Under this method, for instance, if
you are interested in studying the sales of a product in the current year, you
would select a particular past year, say 1990 as the base year and compare
the current year’s production with the production of 1990.
ii) Those employing a moving base: For example, for computation of the
current year's sales, last year's sales would be assumed as the base (for
1991, 1990 is the base. For 1992, 1991 is the base and so on …. .
Ratios are more often used in financial economics to indicate the financial
status of an organization. Look at the following illustration:
Illustration 2
The following table gives the balance sheet of XYZ Company for the year
2002–03. Compute useful financial ratios.
Solution: Three common ratios may be computed from the above balance
sheet: current ratio, cash ratio, and debt-equity ratio. However, these ratios are
discussed in detail in MCO-05 : Accounting for Managerial Decisions, under
Unit-5 : Techniques of Financial Analysis.
Current assets , loans , advances + current investment s 330 + 10
Current ratio = =
Current liabilitie s and provisions + short term debt 150 + 50 + 60
= 1.31
8.2.3 Rate
The concept of ratio may be extended to the rate. The rate is also a
comparison of two figures, but not of the same variable, and it is usually
expressed in percentage. It is a measure of the number of times a value occurs
in relation to the number of times the value could occur, i.e. number of actual
occurrences divided by number of possible occurrences. Unemployment rate in
a country is given by total number of unemployed person divided by total
number of employable persons. It is clear now that a rate is different from a
ratio. For example, we may say that in a town the ratio of the number of
unemployed persons to that of all persons is 0.05: 1. The same message would
be conveyed if we say that unemployment rate in the town is 0.05, or more
commonly, 5 per cent. Sometimes rate is defined as number of units of a
variable corresponding to a single unit of another variable; the two variables
6 1
Processing and Presentation could be in different units. For example, seed rate refers to amount of seed
of Data
required per unit area of land. The following table gives some examples of
rates.
2) What is a rate?
..................................................................................................................
..................................................................................................................
..................................................................................................................
To start with, we list the properties that could be defined by an ideal measure
of central tendency. Some of the measures are discussed in detail later.
Most of the time, when we refer to the average of something, we are talking
about the arithmetic mean. This is the most important measure of central
tendencies which is commonly called mean.
Mean of ungrouped data: The mean or the arithmetic mean of a set of data
is given by:
X1 + X 2 + … + X n
X=
N
This formula can be simplified as follows:
∑x Sum of values of all observations.
Arithmetic mean ( x ) =
N Number of observations.
The Greek letter sigma, Σ, indicates “the sum of ”
Illustration 3
Suppose that wages (in Rs) earned by a labourer for 7 days are 22, 25, 29, 58,
30, 24 and 23.The mean wage of the labourer is given by:
Mean of grouped data: We have seen how to obtain the result of mean from
ungrouped data. In Unit-6, we have learnt the preparation of frequency
distribution (grouped data). Let us consider what modifications are required for
grouped data for calculation of mean.
When we have grouped data, either in the form of discrete or continuous, the
expression for the mean would be :
fx
(x ) = ∑ Σ (f × x)
N Sum of the frequency (Σf)
Let us consider an illustration to understand the application of the formula.
Illustration 4
Now, to compute the mean wage, multiply each variable with its corresponding
frequency (f × x) and obtain the total (Σfx).
6 3
Processing and Presentation Divide this total by number of observations (Σf or N). Practically, we compute
of Data the mean as follows:
102 Σfx
= = 29.26
35 Σf or N
Illustration-5
The following table gives the daily wages for 70 labourers on a particular day.
Daily Wages (Rs) : 15-20 20-25 25-30 30-35 35-40 40-45 45-50
No of labourer : 2 23 19 14 5 4 3
Solution: For obtaining the estimated value of mean we have to follow the
procedure as explained above. This is elaborated below.
6 4
Statistical Derivatives and
X=
∑ fx = 2030 = Rs. 29 Measures of Central Tendency
N 70
Hence, the mean daily wage is Rs. 29.
Assume A as 32.5
X=A+
∑ fd × i
N
− 49
= 32.5 + × 5 = 29
70
Hence mean daily wage is Rs. 29, as obtained earlier.
The important property of arithmetic mean is that the means of several sets
of data may be combined into a single mean for the combined sets of data.
The combined mean may be defined as:
N X + N X ..... + N n X n
X = 1 1 2 2
12...n N + N ..... + N n
1 2
If we have to combine means of four sets of data, then the above formula can
be generalized as:
N 1 X1 + N 2 X 2 + N 3 X 3 + N 4 X 4
X1234 =
N1 + N 2 + N 3 + N 4
6 5
Processing and Presentation Advantages and disadvantages of mean
of Data
The concept of mean is familiar to most people and easily understood. It is due
to the fact that it possesses almost all the properties of a good measure of
central tendency. However, the mean has disadvantages of which we must be
aware. First, the value of mean may be distorted by the presence of extreme
values in a given data set and in case of U-shaped distribution this measure is
not likely to serve a useful purpose. Second problem with the mean is that we
are unable to compute mean for open-ended classes, since it is difficult to
assign a mid-point to the open-ended classes. Third, it cannot be used for
qualitative variables.
Weighted Mean
The arithmetic mean, as discussed above, gives equal importance (weight) to all
the observations. But in some cases, all observations do not have the same
weightage. In such a case, we must compute weighted mean. The term
‘weight’, in statistical sense, stands for the relative importance of the different
variables. It can be defined as:
xW =
∑ Wx
∑W
where, x w is the weighted mean, ‘w’ are the weights assigned to the variables (x).
Weighted mean is extensively used in Index numbers, it will be discussed in
detail in Unit 12 : Index Numbers, of this course. For example, to compute the
cost of living index, we need the price index of different items and their
weightages (percentage of consumption). The important issue that arises is the
selection of weightages. If actual weightages are not available then estimated or
arbitrary weightages may be used. This is better than no weightages at all.
However, keeping the phenomena in mind, the weightages are to be assigned
logically. To understand this concept, let us take an illustration.
Illustration 6
Given below are Price index numbers and weightages for different group of
items of consumption for an average industrial worker. Compute the cost of
living index.
Solution: The cost of living index is obtained by taking the weighted average
as explained in the table below:
Group Item Group Price Weight Wi. Pi
Index( Pi) (Wi )
Food 150 55 8250
Clothing 186 15 2790
House rent 125 17 2125
Fuel and Light 137 8 1096
Others 184 5 920
6 6 – ΣW = 100 ΣWx = 15181
Statistical Derivatives and
Xw =
∑ Wx = 15181 = 151.81 Measures of Central Tendency
Therefore, the cost of living index is
∑ W 100
Self Assessment Exercise C
1) A student's marks in a Computer course are 69, 75 and 80 respectively in
the papers on Theory, Practical and Project Work.
What are the mean marks if the weights are 1, 2 and 4 respectively?
What would be the mean marks if all the papers have equal importance?
Use the following table
Since the class width is 150 for all the classes, the method of assumed
mean is useful. The following table may be helpful.
6 7
Processing and Presentation So, the average monthly sales =
of Data
3) The mean wage of 200 male workers in a factory was Rs 150 per day,
and the mean wage of 100 female and 50 children were Rs. 90 and Rs.
35 respectively, in the same factory. What would be the combined mean
of the workers. Comment on the result.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
8.3.3 Median
The median is another measure of central tendency. The median represents the
middle value of the data that measures the central item in the data. Half of the
items lie above the median, and the other half lie below it.
Median of Ungrouped Data: To find the median from ungrouped data, first
array the data either in ascending order or in descending order. If the number
of observations (N) contains an odd number then the median is the middle
value. If it is even, then the median is the mean of the middle two values.
th
N + 1
In formal language, the median is item in a data array, where N is
2
the number of items. Let us consider the earlier illustration 3 to locate the
median value in two different sets of data.
Illustration-7
On arranging the daily wage data of the labourers (as given in illustration 3) in
ascending order, we get
Had there been one more observation, say, Rs. 6, the order would have been
as below:
Rs. 6, 22, 23, 24, 25, 29, 30, 58
There are eight observations, and the median is given by the mean of the
8 +1
fourth and the fifth observations (i.e., th item = 4.5th item. So, median
2
wage = (24 + 25)/2 = Rs. 24.5.
Median of Grouped Data: Now, let us calculate the median from grouped
data. When the data is in the form of discrete series, the median can be
computed by examining the cumulative frequency distribution, as is shown
6 8 below.
Illustration-8 Statistical Derivatives and
Measures of Central Tendency
To compute median wage from the data given in Illustration 4, we add one
more row of cumulative frequency (the formation of cumulative frequency, we
have discussed in Unit 6 of this course : Processing of Data.
th
N + 1
According to the formula, item, the number of observations is 35.
2
35 + 1
Therefore, item th is the 18th item. Hence the 18th observation will be the
2
median. By inspection it is clear that Median wage is Rs. 29.
6 9
Processing and Presentation th
of Data 70
value of the variable item. This item lies in 44 (35th observation)
2
cumulative frequency. Hence, It is clear Column (3) that median is in the third
class interval, i.e. Rs. 25 to Rs. 30. So we have to locate the position of the
35th observation in the class 25-30.
N
Here, = 35 , L = 25, cf = 25, f = 19 and i = 30 – 25 = 5.
2
N/ 2 − cf 35− 25
Thus median is : L + ×i = 25+ ×5 = Rs. 27.63.
f 19
It is to be noted that the median value may also be located with the help of
graph by drawing ogives or a less than cumulative frequency curve. This
method was discussed in detail in Unit 7 : Diagrammatic and Graphic
Presentation, of this block.
..................................................................................................................
7 0 ..................................................................................................................
8.3.4 Mode Statistical Derivatives and
Measures of Central Tendency
Mode is also a measure of central tendency. This measure is different from the
arithmetic mean, to some extent like the median because it is not really
calculated by the normal process of arithmetic. The mode, of the data, is the
value that appears the maximum number of times. In an ungrouped data, for
example, the foot size (in inches) of ten persons are as follows: 5, 8, 6, 9, 11,
10, 9, 8, 10, 9. Here the number 9 appeares thrice. Therefore, mode size of
foot is 9 inches. In grouped data the method of calculating mode is different
between discrete distribution and continous distribution.
In discrete data, for example consider the earlier illustration 6, the modal wage
is Rs. 29 as is the wage for maximum number of days, i.e. six days. For
continuous data, usually we refer to modal class or group as the class with the
maximum frequency (as per observation approach). Therefore, the mode from
continuous distribution may be computed using the expression:
∆1
Mode = L + ×i
∆1 + ∆2
where, L = lower limit of the modal class, i = width of the modal class, ∆1 =
excess of frequency of the model class (pi) over the frequency of the
preceding class (f0),
∆2 = excess of frequency of the model class (f1) over the frequency of the
succeeding class (f2). The letter ∆ is read as delta.
It is to be noted that while using the formula for mode, you must arrange the
class intervals uniformly throughout, otherwise you will get misleading results.
To illustrate the computation of mode, let us consider the grouped data of
earlier illustration 7.
Illustration 10
20-25 23 40-45 4
25-30 19 45-50 3
30-35 14
7 1
Processing and Presentation The related values are as follows:
of Data
It may not be unique all the time. There may be more than one mode or no
mode (no value that occurs more than once) at all. In such a case it is difficult
to interpret and compare the distributions. It is not amenable to arithmetic and
algebraic manipulations. For example, we cannot get the mode of the combined
data set from the modes of the constituent data sets.
L = , f0 = , f1 = , f2 = and i= .
For a moderately skewed distribution, it has been empirically observed that the
difference between Mean and Mode is approximately three times the difference
between Mean and Median. This was illustrated in the Fig. 8.1 (b) and (c).
The expression is:
Sometimes this expression is used to calculate value of one measure when the
value of the other two measures are known.
7 2
8.3.5 Choice of a Suitable Average Statistical Derivatives and
Measures of Central Tendency
From the above discussion, it is clear that for nominal data only mode can be
used, for ordinal data both mode and median can be used whereas for ratio and
interval levels of data all three measures can be calculated.
Mo= Me = X
(a)
Mode Mode
Median Median
Mean Mean
(b) (c)
Figure 8.1 7 3
Processing and Presentation Stability: Quite often a researcher studies a sample to infer about the entire
of Data population. Mean is generally more stable than median or mode. If we calculate
means, medians and modes of different samples from a population, the means
will generally be more in agreement then the medians or the modes. Thus,
mean is a more reliable measure of central tendency. Normally, the choice of a
suitable measure of central tendency depends on the common practice in a
particular industry. According to its requirement, each case must be judged
independently. For example, the Mean sales of different products may be useful
for many business decisions. The median price of a product is more useful to
middle class families buying a new product. The mode may be a more useful
measure for the garment industry to know the modal height of the population to
decide the quantity of garments to be produced for different sizes.
Hence, the choice of the measure of central tendency depends on (1) type of
data (2) shape of the distribution (3) purpose of the study. Whenever possible,
all the three measures can be computed. This will indicate the type of
distribution.
Geometric Mean (GM): Geometric mean is defined as the Nth root of the
product of all the N observations. It may be expressed as:
G.M. = n Pr oduct of all n values . Thus, the geometric mean of four numbers 2, 5,
1
N / Σ For example, the harmonic mean of 4 and 6 is 2 / (1/4 + 1/6) = 2 /
x
(5/12) = 2/ (5/12) = 21 (0.4166) = 4.8. Suppose a car moves half the distance
at the speed of 60 km/hr and the other half at the speed of 80 km/hr. Then the
average speed of the car is 68.57 km/hr, which is the harmonic mean of 60
and 80. Harmonic mean is useful in averaging rates.
For any set of data wherever computation is possible, the following inequality
holds
x > GM > HM
7 4
Illustration-11 Statistical Derivatives and
Measures of Central Tendency
Rate : Amount of one variable per unit amount of some other variable. 7 5
Processing and Presentation Ratio : Relative value of one value with respect to another value.
of Data
Weighted Mean : An average in which each observation value is weighted by
some index of its importance.
The interpretation is obvious - Out of all the workers in all India, 13.46 percent
are in Uttar Pradesh and 10.45 percent in Maharashtra. Andhra Pradesh has
the highest number of Agricultural Labourers (12.86%) followed by Uttar
Pradeh (12.66%) and Bihar (12.59%). The lowest number of Household
7 6 Industry workers is in Himachal Pradesh, etc.
C: 1) The weighted mean is = 539/7 = 77 Statistical Derivatives and
Measures of Central Tendency
If all the papers have equal importance, i.e. equal weightage, then the
simple mean = 224/3 =74.67.
2) Since the class width is 150 for all the classes, the method of assumed
mean is useful. On observation, assumed mean is taken as 375.
Σfd
x =A+ xi
N ; Mean sales of 125 firms is Rs. 361.8 thousands.
N1x1 + N 2 x 2 + N 3 x 3
3) x123 = N1 + N 2 + N 3
x 123 = 116 .43
N / 2 − c.f
D: Median = L + ×c
f
Me = 359.76
∆1
E: Mode = L + ∆ + ∆ × i ; Modal sales value is Rs. 363.64 thousands.
1 2
7 7
Processing and Presentation 5) The monthly salaries (in Rupees) of 11 staff members of an office are:
of Data
2000, 2500, 2100, 2400, 10000, 2100, 2300, 2450, 2600, 2550 and 2700.
Find mean, median and mode of the monthly salaries.
Which one among the above do you consider the best measure of central
tendency for the above data set and why?
6) Consider the data set given in problem 2 above.
Find mean deviation of the data set from (i) median (ii) 2400 and (iii) 2500.
Find mean squared deviation of the data set from (i) mean (ii) 3000 and (iii)
3100.
7) Mean examination marks in Mathematics in three sections are 68, 75 and 72, the
number of students being 32, 43 and 45 respectively in these sections. Find the
mean examination marks in Mathematics for all the three sections taken
together.
8) The followings are the volume of sales (in Rupees) achieved in a month by 25
marketing trainees of a firm:
1220 1280 1700 1400 400 350 1200 1550 1300 1400
1450 300 1800 200 1150 1225 1300 1100 450 1200
1800 475 1200 600 1200
The firm has decided to give the trainees some performance bonus as per the
following rule - Rs. 100 if the volume of sales is below Rs. 500; Rs. 250 if the
volume of sales is between Rs. 500 and Rs.1000; Rs.400 if the volume of sales
is between Rs. 1000 and Rs, 1500 and Rs.600 if the volume of sales is above
Rs. 1500.
Find the average value of performance bonus of the trainees.
9) In an urban cooperative bank, the minimum deposit in a savings bank is Rs. 500.
The deposit balance at the end of a working day is given in the table below :
Table: Average Deposit Balance in ABC Urban Cooperative Bank
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
7 9
Processing and Presentation
of Data UNIT 9 MEASURES OF VARIATION AND
SKEWNESS
STRUCTURE
9.0 Objectives
9.1 Introduction
9.2 Variation – Why is it Important?
9.3 Significance of Variation
9.4 Measures of Variation
9.4.1 Range
9.4.2 Quartile Deviation
9.4.3 Mean Deviation
9.4.4 Standard Deviation
9.4.5 Coefficient of Variation
9.5 Skewness
9.6 Relative Skewness
9.7 Let Us Sum Up
9.8 Key Words
9.9 Answers to Self Assessment Exercises
9.10 Terminal Questions/Exercises
9.11 Further Reading
9.0 OBJECTIVES
After studying this Unit, you should be able to:
l describe the concept and significance of measuring variability for data analysis,
l compute various measures of variation and its application for analysing the
data,
l choose an appropriate measure of variation under different situations,
l describe the importance of Skewness in data analysis,
l explain and differentiate the symmetrical, positively skewed and negatively
skewed data, and
l ascertain the value of the coefficient of skewness and comment on the nature
of distribution.
9.1 INTRODUCTION
In Unit 8, we have learnt about the measures of central tendency. They give us
only a single figure that represents the entire data. However, central tendency
alone cannot adequately describe a set of data, unless all the values of the
variables in the collected data are the same. Obviously, no average can
sufficiently analyse the data, if all the values in a distribution are widely spread.
Hence, the measures of central tendency must be supported and supplemented
with other measures for analysing the data more meaningfully. Generally, there
are three other characteristics of data which provide useful information for data
analysis i.e., Variation, Skewness, and Kurtosis. The third characteristic,
Kurtosis, is not with in the scope of this course. In this unit, therefore, we shall
discuss the importance of measuring variation and skewness for describing
distribution of data and their computation. We shall also discuss the role of
normal curves in characterizing the data.
80
Measures of Variation
9.2 VARIATION – WHY IS IT IMPORTANT? and Skewness
Measures of variation are statistics that indicate the degree to which numerical
data tend to spread about an average value. It is also called dispersion, scatter,
spread etc., It is related to the homogeniety of the data. In the simple words of
Simpson of Kafka “the measurement of the scatterness of the mass of figures
(data) in a series about an average is called measure of variation”. Therefore,
we can say, variation measures the extent to which the items scatter from
average. To be more specific, an average is more meaningful when the data
are examined in the light of variation. Infact, in the absence of measure of
dispersion, it will not be possible to say which one of the two or more sets of
data is represented more closely and adequately by its arithmetic mean value.
Here, the following illustration helps you to understand the necessity of
measuring variability of data for effective analysis.
Illustration-1
The data given below relates to the marks secured by three students (A, B and
C) in different subjects
Subjects Marks
A B C
Research methodology 50 50 10
Accounting for Mangers 50 70 100
Financial Management 50 40 80
Marketing Management 50 40 30
Managerial Economics 50 50 30
Total 250 250 250
Mean ( x ) 50 50 50
In the above illustration, you may notice that the marks of the three students
have the same mean i.e. the average marks of A, B and C are the same i.e.,
50 Marks, and we may analyse and conclude that the three distributions are
similar. But, you should note that, by observing distributions (subject-wise) there
is a wide difference in the marks of these three students. In case of 'A' the
marks in each subject are 50, hence we can say each and every item of the
data is perfectly represented by the mean or in other words, there is no
variation. In case of B there is slight variation as compared to 'C', where as in
case of 'C' not a single item is perfectly represented by the mean and the
items vary widely from one another. Thus, different set of data may have the
same value of average, but may differ greatly in terms of spread or scatter of
items. The study of variability, therefore, is necessary to know the average
scatter of the item from the average to gauge the degree of variability in the
collected data.
81
Processing and Presentation A family intends to cross a lake. They come to know that the average depth of
of Data the lake is 4 feet. The average height of the family, is 5.5 feet. Then they
decide that the lake can be crossed safely. While crossing the lake, at a
particular place all the members of the family get drowned where the level of
water is more than 6.5 feet deep. The reason for drowning is that they rely on
the average depth of the lake and their average height but do not rely on the
variability of the Lake's depth and their height. In the light of the above
example, we may understand the reason for measuring variability of a given
data.
Keeping in view the above purposes, the variation of data must be taken into
account while taking business decisions.
9.4.1 Range
The Range is the simplest measure of variation. It is defined simply as the
difference between the highest value and the lowest value of observation in a
set of data. In equation form for absolute measure of range from ungrouped
and grouped data, we can say
In a grouped data, the absolute range is the difference between the upper limit
of the highest class and lower limit of the lowest class. The equation form for
relative measure of range from ungrouped and grouped data, called coefficient
of range is as follows:
Illustration-2
The following data relates to the total fares collected on Monday from three
different transport agencies.
The interpretation for the above result is simple. In the above illustration, the
variation is nil in case of taxi fare of agency ‘B’. While the variation is small in
agency ‘A’ and the variation is high in transport agency C. The coefficient of
Range for transport agency ‘A’ and ‘C’ is as follows:
The following data relates to the record of time (in minutes) of trucks waiting
to unload material.
Calculate the absolute and relative range and comment on whether you think it
is a useful measure of variation.
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
Q3 − Q1
Q.D. =
2
84
The relative measure of Q.D., called coefficient of quartile deviation, is Measures of Variation
and Skewness
calculated as:
Q3 − Q1
Coefficient of Q.D. =
Q3 + Q1
It is to be noted that the above formulae (absolute and relative) are applicable
to ungrouped data and grouped data as well. Let us take up an illustration to
ascertain the value of quartile deviation and coefficient of Q.D.
Illustration-3
The following data relates to the daily expenditure of the students in Delhi
University. Calculate quartile deviation and its co-efficient.
Daily expenditure 50- 100- 150- 200- 250- 300- 350- 400- 450-
100 150 200 250 300 350 400 450 500
No. of Students 18 14 21 15 12 13 8 5 2
Where, ‘L1’ is the lower limit of the Q1 class, c.f. is the cumulative
frequency of the preceding class of Q1 class ‘f’ is the frequency of
the Q1, class and ‘i’ is the class interval. Now we present these
values to obtain the result of Q1.
27 − 18
Q1 = 100 + × 50 = Rs.132.14 85
14
Processing and Presentation Q3, has 3(n/4)th observation i.e., 3(108/4)th = 81th observation. This
of Data observation lies in 93 cumulative frequency. So Q3 lies in the 300-350 class.
3N / 4 − c.f .
Q3 = L1 + ×i
f
Here, as explained above, L1; c.f; f; and i are related to Q3 class
81 − 80
Therefore, Q 3 = 300 + × 50 = Rs.303.85
13
Q3 − Q1 303.85 − 132.14
Q.D. = = = Rs.85.85
2 2
Q3 − Q1 303.85 − 132.14
Re lative measure of Q.D. i.e., Coefficient of Q.D. = = = 0.39
Q3 + Q1 303.85 + 132.14
From the above data it may be concluded that the variation of daily expenditure
among the sample students of DU is Rs. 85.85. The coefficient of Q.D. is 0.39
this relative value of variation may be compared with the other dependent
variables of the expenditure like family income of the students, pocket money,
habit of spending etc.
Compute the quartile deviation and its co-efficient. Do you think this is an
appropriate measure for measuring variability? Comment on your opinion.
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
86
9.4.3 Mean Deviation Measures of Variation
and Skewness
M.D.
Co-efficient of M.D. = The average used ( x or Me)
As an illustration, let us consider the following data, which relates to the sales
of Company A and Company B during 1995-2001.
Illustration-4
Compute the mean deviation and its co-efficient of the sales of two companies
A and B and comment on the result.
87
Processing and Presentation
of Data
Company A Company B
Years Sales (Rs. in Mean = 547 Sales (Rs. in Mean = 4894
’000) X |x−x| ’000) X |x−x|
∑X 3829
Mean Sales of Company ‘A’ = N = 7 = Rs. 547 thousand
A
∑X 34258
Mean Sales of Company ‘B’ = N =
B
∑ x−x
Formula for Mean Deviation from Mean =
N
1294
M.D. of Company ‘A’ = = Rs. 184.9 thousand
7
8456
M.D. of Company ‘B’ = = Rs. 1208 thousand
7
Co-efficient of M.D.
M .D . A 184 .9
Company ‘A’ = Mean = = 0 .34
A 547
M.D. 1208
Company ‘B’ = Mean = 4894 = 0.25
B
The coefficient of mean deviation of company ‘A’ sales is more than the
company 'B' sales. Hence we can conclude there is greater variability in the
sales of company ‘A’.
The drawbacks of this method are, it may be observed, the algebraic signs (+
or –) of the deviations are ignored. From the mathematical point of view it is
unjustifiable as it is not useful for further algebraic treatment. That is the
reason mean deviation is not frequently used in business research. The
accuracy of the result of mean deviation depends upon the degree of
representation of average. Despite of few drawbacks of this measure, it is most
useful measure in case of : i) small samples with no-elaborate analysis is
required, ii) the reports presented to the general public not familiar with
statistical methods, and iii) it has some specific utility in the area of inventory
88 control.
Self Assessment Exercise C Measures of Variation
and Skewness
Calculate the mean deviation and its co-efficient from the following data which
relates the weekly earnings of the family in an area. What light does it throw
on the economical condition of that community and do you justify this measure
is a scientific measure of variability? Give your opinion.
..........................................................................................................................
..........................................................................................................................
The Standard deviation is the most familiar, important and widely used measure
of variation. It is a significant measure for making comparison of variability
between two or more sets of data in terms of their distance from the mean.
The mean deviation, in practice, has been replaced by the standard deviation.
As discussed earlier, while calculating mean deviation algebraic signs ( – / +)
are ignored and can be computed from any of the averages. Whereas, in
computation of standard deviation signs are taken into account, and the
deviation of items, always from mean are taken, squared (instead of ignoring
signs) and averaged. So, finally square root of this value is extracted. Thus,
standard deviation may be defined as “the square root of the arithmetic mean
of the squares of deviations from arithmatic mean of given distribution.” This
measure is also known as root mean square deviation. If the values in a given
data are dispersed more widely from the mean, then the standard deviation
becomes greater. It is usually denoted by σ (read as sigma). The square of the
standard deviation (σ2) is called “variance”.
σ=
∑ ( x − x ) 2 , In simple σ = x2
89
N N
Processing and Presentation where, x2 = sum of the squares of deviations ( x − x ) and N = No. of
of Data
observations.
If the collected data are very large, then considering the assumed mean is more
convenient to compute standard deviation. In such case, the formula is slightly
modified as:
2
σ=
∑ f ( x − x A ) 2 − ∑ f dx
×C
N or Σf N or Σf
X − Assumed Mean
Where, x A = Assumed mean, dx = , C = Common factor .
C
The above formula is applicable only when the class intervals are equal.
Illustration-5
x − xA
Profit M.V.
c
(Rs. In lakh) f x dx fdx fdx2
6-10 9 8 –2 –18 36
10-14 11 12 –1 –11 22
14-18 20 16 0 0 0
18-22 16 20 1 16 16
22-26 9 24 2 18 36
26-30 5 28 3 15 45
N=70 Σfdx =20 Σfdx2 = 155
In the above computation, we have taken the mid value "16" as assumed mean
(AM), the common factor is 4.
2
σ=
∑ fdx 2 ∑ fdx
− ×C
90 N N
Measures of Variation
2
∑ fdx 2 ∑ fdx
and Skewness
σ= − ×C
N N
2
155 20
= − × 4 = 2.21 − 0.08 × 4 = 2.13 × 4 = 1.46 × 4 = 5.84
70 70
Among all the measures of variation, standard deviation is the only measure
possessing the necessary mathematical properties which enhance its utility in
advanced statistical work. It is least affected by the fluctuations of sampling. In
a normal distribution, x ± σ covers 68% of the values whereas x ± QD covers
50% values and x ± M.D. covers 57% values. This is the reason that
standard deviation is called a “Standard Measure”.
I 83 9.93
II 40 5.24
III 70 8.12
IV 59 10.89
You may notice that the mean production of Paddy in four states is not equal.
In such a situation, to determine which state is more consistent in terms of
production, we shall compute the coefficient of variation.
σ
C.V. = × 100
X 91
Processing and Presentation
9.93 5.24
of Data C.V. of State I = × 100 = 11.96%; C.V.of State II = × 100 = 13.10%
83 40
8.12 10.89
C.V. of State III = × 100 = 11.60%; C.V. of State IV = × 100 = 18.46%
70 59
It is seen that the standard deviation is low in State II when we compare with
the other states. However, since the C.V. is less in State III, it is the more
consistent state in production of paddy than the other three states. Among the
4 states, state IV is extremely inconsistent in production of paddy.
A Prospective buyer tested the bursting pressure of a sample of 120 carry bags
received from A and B manufactures. The results are tabulated below:
No. of bags of A 3 14 30 56 12 5
No. of bags of B 8 16 23 34 24 15
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
92
Measures of Variation
9.5 SKEWNESS and Skewness
The measure of skewness tells us the direction of dispersion about the centre
of the distribution. Measures of central tendency indicate only the single
representative figure of the distribution while measures of variation, indicate only
the spread of the individual values around the means. They do not give any
idea of the direction of spread. Two distributions may have the same mean and
variation but may differ widely in the shape of their distribution. A distribution is
often found skewed on either side of its average, which is termed as
asymmetrical distribution. Thus, skewness refers to the lack of symmetry in
distribution. Symmetry signifies that the value of variables are equidistant from
the average on both sides. In other words, a balanced pattern of a distribution
is called symmetrical distribution, where as unbalanced pattern of distribution is
called asymmetrical distribution.
Fig.9.1
Carefully observe the figures presented above and try to understand the
following rules governing them.
It is clear from Figure 9.1 (a) that the data are symmetrical when the spread
of the frequencies is the same on both sides of the middle point of the
frequency polygon. In this case the value of mean, median, and mode coincide
i.e., Mean = Median = Mode.
In Figure (c), when there is a longer tail towards the left hand side of the
centre, then the skewness is said to be ‘Negatively Skewed’. In such a case,
Mean < Median < Mode.
Tests of Skewness
In the light of the above discussion, we can summerise the following facts
regarding presence of skewness in a given distribution.
X − MO
SKp =
σ
This method computes the co-efficient skewness by considering all the items
of the data set. The value of variation usually varies in value between the limits
± 3.
If mode is ill-defined and cannot be easily located then using the approximate
empirical relationship between mean, median, and mode as stated in Unit-8
section 8.3.5, (mode = 3 median – 2 mean) the coefficient of skewness can be
determined by the removal of the mode and substituting median in its place.
Thus the changed formula is:
3 (Mean − Median)
Sk p =
σ
Let us consider the following data to understand the application of Karl
Pearson’s formula for measuring the co-efficient of skewness.
94
Illustration-6 Measures of Variation
and Skewness
The following measures are obtained from the profits of 100 shops in two
different regions. Calculate Karl Pearson’s co-efficient of skewness and
comment on the results.
16.62 − 18.47
Coefficient of skewness for Re gion I = = − 0.61
3.04
45.36 − 36.94
Coefficient of skewness for Re gion II = = 0.49
17.71
Based on the results we can comment on the distributions of the two regions
as follows: The coefficient of skewness for Region-I is negative, while that of
Region - II is positive. Hence the earnings of profit in Region I is more
skewed. Since the result in Region-I, indicates that the distribution is negatively
skewed, there is a greater concentration towards higher profits. In case of
Region-II the value of coefficient of skewness indicates that the distribution is
positively skewed. Therefore there is a greater concentration towards lower
profits.
Illustration-7
The following statistical measures are given from a data of a factory before
and after the settlement of wage dispute. Calculate the Pearson’s co-efficient
of skewness and comment.
3 (Mean − Median )
Kal Pearson’s Co-efficient of Skewness (SKp) =
σ
95
Processing and Presentation
3 (22.8 − 24.0) − 3.6
of Data a) Before settlement of wage dispute: SK p = = = − 0.61
5.9 5.9
3 ( 24 − 23 ) 3
b) After settlement of wage dispute: SK p = = = 0 . 61
4 .9 4 . 95
From the above calculated values of coefficient of skewness, under different
situations, we may comment upon the nature of distribution as follows:
Before the settlement of dispute the distribution was negatively skewed and
hence there is a greater concentration of wages towards the higher wages.
Whereas it was positively skewed after the settlement of dispute, which reveals
that even though the mean wage of workers has increased after the settlement
of disputes (before settlement wages were 1,200 × 22.8 = Rs. 27360. After
settlement total wages were 1175 × 24 = Rs. 28,200). The workers who were
getting low wages are getting considerably increased wages after settlement of
their dispute, while wages of the workers getting high wages before settlement
had fallen.
σ
C.V. = × 100
X
5.9
a) Before settlement the coefficient of variation = × 100 = 25.88%
22.8
4.95
b) After settlement the coefficient of variation = × 100 = 20.62%
24.0
Based on the computed values of variation, it may be concluded that there is
sufficient evidence that there is lesser inequality in the distribution of wages
after settlement of the dispute. It means that there was a greater scattered in
wage payment before the dispute was settled.
96 ..........................................................................................................................
.......................................................................................................................... Measures of Variation
and Skewness
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
(Q 3 − Q 2 ) − (Q 2 − Q 1 )
SK = ,
( Q 3 − Q 2 ) + ( Q 2 − Q 1 ) Alternatively;
B
Q 3 + Q1 − 2 Median
SK B =
Q 3 − Q1
Illustration-8
Q1 = 62 Q2 = 141 Q3 = 190
Q3 + Q1 − 2 Median
Bowley’s coefficient of skewness (SKB) = Q3 − Q1
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
In such a distribution mean, median, and mode are equal and they lie at the
centre of the distribution. In contrast, asymmetrical distribution means
unbalanced pattern of frequency distribution, called as 'skewed' distribution.
Skewed distribution may be positively skewed or negatively skewed. In a
positively skewed distribution the mean is greater than mode and median
( x > Me > Mo) and has a long tail on the right hand side of the data curve.
On the other hand, in a negatively skewed distribution the mode is greater than
the mean and median (Mo > Me > x ) and has a long tail on the left hand
side of the data curve. In a skewed distribution, the relationship of Mean and
median is that the interval between both is approximately 1/3rd of the interval
between the mean and mode. Based on this relationship the degree of
skewness is measured. There are two formulae we use for measuring the
coefficient of skewness, which are called relative measures of skewness,
proposed by Karl Pearson and Bowley. Bowley's formula is normally applied
when the data is an open-end type or/and the classes are unequal.
Range : It is the difference between the highest value and the lowest value of
observations.
The Median earnings of the 1600 families is Rs. 1381. It reveals that 50% of
the families are earning between Rs. 1,000 to Rs. 2,000. It is to be noted that
very few (44 families out of 1,600 families) fall in the last three classes of
higher-earning groups.
Infact, this measure of variability gives us best results when deviations are
taken from Median, but Median is not a satisfactory measure when the
dispersion in a distribution is very high. It is also not appropriate for large
samples.
Since the mean bursting pressure of manufacturer B's bags is higher, these
bags may be regarded more standard. However, the bags of manufacturer
A may be suggested for purchase as these bags of manufacturer A are
more consistant because CV is significantly lesser than the bags of
manufacturer B.
If the buyer would not like to buy bags having more than 16 kgs. bursting
pressure then:
In case the buyer would not like to buy bags having more than 16 kgs
100 bursting pressure, then the average bursting pressure of manufacturer A's
bags is higher than manufacturer B. The co-efficient of variation is also Measures of Variation
and Skewness
much lesser in case of manufacturer A than manufacturer B. Hence, in this
case, we may suggest to buy from manufacturer A.
8 A transport agency had tested the tyres of two brands A and B. The results are
given in the following table below.
Life (thousand units) Brand A Brand B
15-20 6 8
20-25 15 8
25-30 10 22
30-35 16 17
35-40 13 12
40-45 9 6
45-50 11 0
101
Processing and Presentation i) Which brand of tyres do you suggest to the transport agency to use on their
of Data
fleet of trucks?
8) In a manufacturing firm, four employees on the same job show the following
results over a period of time.
A B C D
Mean time of completing the Job 61 70 83 80.5
(minutes)
Variance (σ2) 64 81 121 100
10) The following Table gives the No. of defects per product and its frequency.
102
Measures of Variation
11) The following information was obtained from records of a factory relating to the and Skewness
wages, before and after settlement of wages.
Regular M.Com 20 24 18 22 26 25 21 28 23 29
Distance M.Com 24 29 40 46 34 27 31 28 38 23
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
103
Correlation and Simple
UNIT 10 CORRELATION AND SIMPLE REGRESSION Regression
STRUCTURE
10.0 Objectives
10.1 Introduction
10.2 Correlation
10.2.1 Scatter Diagram
10.3 The Correlation Coefficient
10.3.1 Karl Pearson’s Correlation Coefficient
10.3.2 Testing for the Significance of the Correlation Coefficient
10.3.3 Spearman’s Rank Correlation
10.4 Simple Linear Regression
10.5 Estimating the Linear Regression
10.5.1 Standard Error of Estimate
10.5.2 Coefficient of Determination
10.6 Difference Between Correlation and Regression
10.7 Let Us Sum Up
10.8 Key Words
10.9 Answers to Self Assessment Exercises
10.10 Terminal Questions/Exercises
10.11 Further Reading
Appendix Tables
10.0 OBJECTIVES
After studying this unit, you should be able to:
10.1 INTRODUCTION
In previous units, so far, we have discussed the statistical treatment of data
relating to one variable only. In many other situations researchers and decision-
makers need to consider the relationship between two or more variables. For
example, the sales manager of a company may observe that the sales are not
the same for each month. He/she also knows that the company’s advertising
expenditure varies from year to year. This manager would be interested in
knowing whether a relationship exists between sales and advertising
expenditure. If the manager could successfully define the relationship, he/she 5
Relational and might use this result to do a better job of planning and to improve predictions of
Trend Analysis yearly sales with the help of the regression technique for his/her company.
Similarly, a researcher may be interested in studying the effect of research and
development expenditure on annual profits of a firm, the relationship that exists
between price index and purchasing power etc. The variables are said to be
closely related if a relationship exists between them.
This unit, therefore, introduces the concept of correlation and regression, some
statistical techniques of simple correlation and regression analysis. The methods
used are important to the researcher(s) and the decision-maker(s) who need to
determine the relationship between two variables for drawing conclusions and
decision-making.
10.2 CORRELATION
If two variables, say x and y vary or move together in the same or in the
opposite directions they are said to be correlated or associated. Thus,
correlation refers to the relationship between the variables. Generally, we find
the relationship in certain types of variables. For example, a relationship exists
between income and expenditure, absenteesim and production, advertisement
expenses and sales etc. Existence of the type of relationship may be different
from one set of variables to another set of variables. Let us discuss some of
the relationships with the help of Scatter Diagrams.
Y r=1 Y r = –1
(a) X (b) X
Perfect Positive Correlation Perfect Negative Correlation
6
Correlation and Simple
r<0
Y r>0 Y Regression
X X
(c) (d)
Positive Correlation Negative Correlation
Y Y r=0
(e) X (f) X
Non-linear
Non-linearCorrelation
correlation No
NoCorrelation
correlation
If X and Y variables move in the same direction (i.e., either both of them
increase or both decrease) the relationship between them is said to be positive
correlation [Fig. 10.1 (a) and (c)]. On the other hand, if X and Y variables
move in the opposite directions (i.e., if variable X increases and variable Y
decreases or vice-versa) the relationship between them is said to be negative
correlation [Fig. 10.1 (b) and (d)]. If Y is unaffected by any change in X
variable, then the relationship between them is said to be un-correlated [Fig.
10.1 (f)]. If the amount of variations in variable X bears a constant ratio to the
corresponding amount of variations in Y, then the relationship between them is
said to be linear-correlation [Fig. 10.1 (a) to (d)], otherwise it is non-linear
or curvilinear correlation [Fig. 10.1 (e)]. Since measuring non-linear
correlation for data analysis is far more complicated, we therefore, generally
make an assumption that the association between two variables is of the linear
type.
7
Relational and Illustration 1
Trend Analysis
Table 10.1 : A Company’s Advertising Expenses and Sales Data (Rs. in crore)
Years : 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
Sales (Y) 60 55 50 40 35 30 20 15 11 10
The company’s sales manager claims the sales variability occurs because the
marketing department constantly changes its advertisment expenditure. He/she is
quite certain that there is a relationship between sales and advertising, but does
not know what the relationship is.
The different situations shown in Figure 10.1 are all possibilities for describing
the relationships between sales and advertising expenditure for the company. To
determine the appropriate relationship, we have to construct a scatter diagram
shown in Figure 10.2, considering the values shown in Table 10.1.
60
50
Sales (Rs. Crore)
40
30
20
10
0
1 2 3 4 5 6
Advertising Expenditure (Rs. Crore)
Figure 10.2 : Scatter Diagram of Sales and Advertising Expenditure for a Company.
Figure 10.2 indicates that advertising expenditure and sales seem to be linearly
(positively) related. However, the strength of this relationship is not known, that
is, how close do the points come to fall on a straight line is yet to be
determined. The quantitative measure of strength of the linear relationship
between two variables (here sales and advertising expenditure) is called the
correlation coefficient. In the next section, therefore, we shall study the
methods for determining the coefficient of correlation.
. ..........................................................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
8
2) How does a scatter diagram approach help in studying the correlation between Correlation and Simple
Regression
two variables?
. ..........................................................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................
The simplified formulae (which are algebraic equivalent to the above formula)
are:
∑ xy
1) r= , where x = X − X, y = Y − Y
2 2
∑x ∑y
∑ X.∑ Y
∑ XY −
2) r= n
2 (∑ X) 2 2 (∑ Y)
2
∑X − ∑Y −
n n
i) ‘r’ is a dimensionless number whose numerical value lies between +1 to –1. The
value +1 represents a perfect positive correlation, while the value –1 represents
a perfect negative correlation. The value 0 (zero) represents lack of correlation.
Figure 10.1 shows a number of scatter plots with corresponding values for
correlation coefficient.
ii) The coefficient of correlation is a pure number and is independent of the units of
measurement of the variables.
iii) The correlation coefficient is independent of any change in the origin and scale
of X and Y values.
Remark: Care should be taken when interpreting the correlation results.
Although a change in advertising may, in fact, cause sales to change, the fact
that the two variables are correlated does not guarantee a cause and effect
relationship. Two seemingly unconnected variables may often be highly
correlated. For example, we may observe a high degree of correlation: (i)
between the height and the income of individuals or (ii) between the size of the
9
Relational and shoes and the marks secured by a group of persons, even though it is not
Trend Analysis possible to conceive them to be casually related. When correlation exists
between such two seemingly unrelated variables, it is called spurious or non-
sense correlation. Therefore we must avoid basing conclusions on spurious
correlation.
Illustration 2
We know that
∑ (X)∑ (Y)
∑ XY −
r= n
2 (∑ X) 2 2 (∑ Y)
2
∑X − ∑Y −
n n
You may notice that the manual calculations will be cumbersome for real life
research work. Therefore, statistical packages like minitab, SPSS, SAS, etc.,
may be used to calculate ‘r’ and other devices as well.
Once the coefficient of correlation has been obtained from sample data one is
normally interested in asking the questions: Is there an association between the
two variables? Or with what confidence can we make a statement about the
association between the two variables? Such questions are best answered
statistically by using the following procedure.
Testing of the null hypothesis (testing hypothesis and t-test are discussed in
detail in Units 15 and 16 of this course) that population correlation coefficient
equals zero (variables in the population are uncorrelated) versus alternative
hypothesis that it does not equal zero, is carried out by using t-statistic
formula.
n−2
t=r , where, r is the correlation coefficient from sample.
1− r2
Referring to the table of t-distribution for (n–2) degree of freedom, we can find
the critical value for t at any desired level of significance (5% level of
significance is commonly used). If the calculated value of t (as obtained by the
above formula) is less than or equal to the table value of t, we accept the null
hypothesis (H0), meaning that the correlation between the two variables is not
significantly different from zero.
Illustration 3
Solution: Let us take the null hypothesis (H0) that the variables in the
population are uncorrelated.
Applying t-test,
n−2 12 − 2
t=r 2
= 0.55
1− r 1 − 0.552
From the t-distribution (refer the table given at the end of this unit) with 10
degrees of freedom for a 5% level of significance, we see that the table value
of t0.05/2, (10–2) = 2.228. The calculated value of t is less than the table value of
t. Therefore, we can conclude that this r of 0.55 for n = 12 is not significantly
different from zero. Hence our hypothesis (H0) holds true, i.e., the sample
variables in the population are uncorrelated. 11
Relational and Let us take another illustration to test the significance.
Trend Analysis
Illustration 4
Solution: Let us take the hypothesis that the variables in the population are
uncorrelated. Apply the t-test:
n−2 100 − 2
t=r = 0.55
1− r2 1 − 0.552
= 6.52
Referring to the table of the t-distribution for n–2 = 98 degrees of freedom, the
critical value for t at a 5% level of significance [t0.05/2, (10–2)] = 1.99
(approximately). Since the calculated value of t (6.52) exceeds the table value
of t (1.99), we can conclude that there is statistically significant association
between the variables. Hence, our hypothesis does not hold true.
6∑ d 2
R =1− where, N = Number of pairs of ranks, and Σd2 =
N3 − N
squares of difference between the ranks of two variables.
Illustration 5
Salesmen employed by a company were given one month training. At the end
of the training, they conducted a test on 10 salesmen on a sample basis who
were ranked on the basis of their performance in the test. They were then
posted to their respective areas. After six months, they were rated in terms of
their sales performance. Find the degree of association between them.
Salesmen: 1 2 3 4 5 6 7 8 9 10
Ranks in
training (X): 7 1 10 5 6 8 9 2 3 4
Ranks on
sales
Peformance
(Y): 6 3 9 4 8 10 7 2 1 5
12
Solution: Table 10.3: Calculation of Coefficient of Rank Correlation. Correlation and Simple
Regression
Salesmen Ranks Secured Ranks Secured Difference
in Training on Sales in Ranks D2
X Y D = (X–Y)
1 7 6 1 1
2 1 3 –2 4
3 10 9 1 1
4 5 4 1 1
5 6 8 –2 4
6 8 10 –2 4
7 9 7 2 4
8 2 2 0 0
9 3 1 2 4
10 4 5 –1 1
ΣD2 = 24
6∑ D 2 6 × 24
R =1 − = 1 −
N3 − N 103 − 10
144
=1 − = 0.855
990
we can say that there is a high degree of positive correlation between the
training and sales performance of the salesmen.
Now we proceed to test the significance of the results obtained. We are
interested in testing the null hypothesis (H0) that the two sets of ranks are not
associated in the population and that the observed value of R differs from zero
only by chance. The test which is used is t-statistic.
n−2 10 − 2
t=R = 0.855
1− R2 1 − 0.8552
Referring to the t-distribution table for 8 d.f (n–2), the critical value for t at a
5% level of significance [t0.05/2, (10–2)] is 2.306. The calculated value of t is
greater than the table value. Hence, we reject the null hypothesis concluding that
the performance in training and on sales are closely associated.
13
Relational and Tied Ranks
Trend Analysis
Sometimes there is a tie between two or more ranks in the first and/or second
series. For example, there are two items with the same 4th rank, then instead
of awarding 4th rank to the respective two observations, we award 4.5 (4+5/2)
for each of the two observations and the mean of the ranks is unaffected. In
such cases, an adjustment in the Spearman’s formula is made. For this, Σd2 is
(t 3 − t)
increased by for each tie, where t stands for the number of observations
12
in each tie. The formula can thus be expressed as:
t3 − t t3 − t
6 ∑ d 2 + + + …
12 12
r =1−
N −N
3
1) Compute the degree of relationship between price of share (X) and price of
debentures over a period of 8 years by using Karl Pearson’s formula and test
the significance (5% level) of the association. Comment on the result.
Price of 42 43 41 53 54 49 41 55
shares:
Price of 98 99 98 102 97 93 95 94
debentures:
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
2) Consider the above exercise and assign the ranks to price of shares and price
of debentures. Find the degree of association by applying Spearman’s formula
and test its significance.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
14
Correlation and Simple
10.4 SIMPLE LINEAR REGRESSION Regression
When we identify the fact that the correlation exists between two variables, we
shall develop an estimating equation, known as regression equation or estimating
line, i.e., a methodological formula, which helps us to estimate or predict the
unknown value of one variable from known value of another variable. In the
words of Ya-Lun-Chou, “regression analysis attempts to establish the nature of
the relationship between variables, that is, to study the functional relationship
between the variables and thereby provide a mechanism for prediction, or
forecasting.” For example, if we confirmed that advertisment expenditure
(independent variable), and sales (dependent variable) are correlated, we can
predict the required amount of advertising expenses for a given amount of sales
or vice-versa. Thus, the statistical method which is used for prediction is called
regression analysis. And, when the relationship between the variables is linear,
the technique is called simple linear regression.
Hence, the technique of regression goes one step further from correlation and
is about relationships that have been true in the past as a guide to what may
happen in the future. To do this, we need the regression equation and the
correlation coefficient. The latter is used to determine that the variables are
really moving together.
Yi = β0 + β1 Xi + ei
wherein
β0 = Y-intercept,
ei = error term (i.e., the difference between the actual Y value and the value
of Y predicted by the model.
i) Regression of Y on X
ii) Regression of X on Y.
15
Relational and When we draw the regression lines with the help of a scatter diagram as
Trend Analysis shown earlier in Fig. 10.1, we may get an infinite number of possible regression
lines for a set of data points. We must, therefore, establish a criterion for
selecting the best line. The criterion used is the Least Squares Method.
According to the least squares criterion, the best regression line is the one that
minimizes the sum of squared vertical distances between the observed (X, Y)
points and the regression line, i.e., ∑ ( Y − Ŷ ) 2 is the least value and the sum of
the positive and negative deviations is zero, i.e., ∑ (Y − Ŷ) = 0 . It is important
to note that the distance between (X, Y) points and the regression line is called
the ‘error’.
Regression Equations
As we discussed above, there are two regression equations, also called
estimating equations, for the two regression lines (Y on X, and X on Y). These
equations are, algebraic expressions of the regression lines, expressed as
follows:
Regression Equation of Y on X
Ŷ = a + bx
Ŷ − Y = byx ( X − X )
(∑ X ) (∑ Y )
σy ( ∑ XY ) −
byx = r = N
σx 2 (∑ X ) 2
∑X −
N
Regression equation of X on Y
X̂ = a + by
X̂ − X = bxy ( Y − Y )
(∑ X ) (∑ Y )
∑ XY −
σx N
bxy = r =
σy (∑ Y )2
∑ Y2 −
N
It is worthwhile to note that the estimated simple regression line always passes
through X and Y (which is shown in Figure 10.3). The following illustration
shows how the estimated regression equations are obtained, and hence how
they are used to estimate the value of Y for given X value.
16
Illustration 6 Correlation and Simple
Regression
(Rs. in lakh)
Advertisement
Expenditure: 0.8 1.0 1.6 2.0 2.2 2.6 3.0 3.0 4.0 4.0 4.0 4.6
Sales: 22 28 22 26 34 18 30 38 30 40 50 46
Solution:
(Rs. in lakh)
Advertising Sales
(X) (Y) X2 Y2 XY
0.8 22 0.64 484 17.6
1.0 28 1.00 784 28.0
1.6 22 2.56 484 35.2
2.0 26 4.00 676 52.0
2.2 34 4.84 1156 74.8
2.6 18 6.76 324 46.8
3.0 30 9.00 900 90.0
3.0 38 9.00 1,444 114.0
4.0 30 16.00 900 120.0
4.0 40 16.00 1600 160.0
4.0 50 16.00 2,500 200.0
4.6 46 21.16 2,116 211.6
Now we establish the best regression line (estimated by the least square
method).
Ŷ − Y = byx (X − X )
384 32.8
Y= = 32 ; X = = 2.733 Ŷ − Y = byx (X − X )
12 12
(∑ X ) (∑ Y )
∑ XY −
byx = N
2 (∑ X )2
∑X −
N
17
Relational and
(32.8) (384)
Trend Analysis
1,150 −
= 12 = 5.801
(32.8) 2
106.96 −
12
Ŷ − 32 = 5.801 (X − 2.733)
Ŷ = 5.801X − 15.854 + 32 = 5.801X + 16.146
or Ŷ = 16.146 + 5.801X
which is shown in Figure 10.3. Note that, as said earlier, this line passes
through X (2.733) and Y (32).
observed points used to
fit the estimating line
points on the estimating
Y
line
50
Estimating
Ŷ = 16.143 + 5.01X line
40
Positive
Error Ŷ Negative
Y = 32
Error
Sales (Rs. Lac)
30 Y
Y
20 Ŷ
10
0 X
0 1 2 3 4 5
x = 2.73
Advertising (Rs. Lac)
Figure 10.3: Least Squares Regression Line of a Company’s Advertising Expenditure
and Sales.
Ŷ = 16.146 + 5.801X
wherein Ŷ = estimated sales for given value of X, and
X = level of advertising expenditure.
18
To find Ŷ , the estimate of expected sales, we substitute the specified
advertising level into the regression model. For example, if we know that the Correlation and Simple
Regression
company’s marketing department has decided to spend Rs. 2,50,000/- (X = 2.5)
on advertisement during the next quarter, the most likely estimate of sales ( Ŷ )
is :
= Rs. 30,64,850
Regression Equation of X on Y
X̂ − X = bxy (Y − Y)
(∑ X ) (∑ Y ) (32.8) (384)
∑ XY − 1,150 −
bxy = N = 12 = 0.093
2 (∑ Y )
2 (384) 2
∑Y − 13368 −
N 12
X̂ = – 0.243 + 0.093Y
The following points about the regression should be noted:
1) The geometric mean of the two regression coefficients (byx and bxy) gives
coefficient of correlation.
2) Both the regression coefficients will always have the same sign (+ or –).
19
Relational and 10.5.1 Standard Error of Estimate
Trend Analysis
Once the line of best fit is drawn, the next process in the study of regression
analysis is how to measure the reliability of the estimated regression equation.
Statisticians have developed a technique to measure the reliability of the
estimated regression equation called “Standard Error of Estimate (Se).” This Se
is similar to the standard deviation which we discussed in Unit-9 of this course.
We will recall that the standard deviation is used to measure the variability of a
distribution about its mean. Similarly, the standard error of estimate
measures the variability, or spread, of the observed values around the
regression line. We would say that both are measures of variability. The
larger the value of Se, the greater the spread of data points around the
regression line. If Se is zero, then all data points would lie exactly on the
regression line. In that case the estimated equation is said to be a perfect
estimator. The formula to measure Se is expressed as:
Se =
∑ (Y − Ŷ) 2
n
Illustration 7
R&D (Rs. lakh): 2.5 3.0 4.2 3.0 5.0 7.8 6.5
Solution: To calculate Se for this problem, we must first obtain the value of
2
∑ ( Y − Ŷ ) . We have done this in Table 10.5.
20
Table 10.5: Calculation of Σ (Y- Ŷ )2 Correlation and Simple
(Rs. in lakh) Regression
Σ (Y − Ŷ) 2= 24.62
We can, now, find the standard error of estimate as follows.
∑ (Y − Ŷ)
2
Se =
n
24.62
= 1.875
7
2
Explained var iation ∑ ( Y − Ŷ )
R2 = or , 1 − 2
Total var iation ∑ (Y − Y )
21
Relational and
(∑ Y) 2
∑ (Y − Y ) = ∑ Y2 −
Trend Analysis 2
R2 = r 2
Refer to Illustration 6, where we have computed ‘r’ with the help of regression
coefficients (bxy and byx), as an example for R2
r = 0.734
R2 = r2 = 0.7342 = 0.5388
This means that 53.88 per cent of the variation in the sales (Y) can be
explained by the level of advertising expenditure (X) for the company.
You are given the following data relating to age of Autos and their maintenance
costs. Obtain the two regression equations by the method of least squares and
estimate the likely maintenance cost when the age of Auto is 5 years and also
compute the standard error of estimate.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
Least Squares Criterion: The criterion for determining a regression line that
minimizes the sum of squared errors.
2. R = – 0.185
t = – 1.149
C) Y on X : Ŷ = 5 + 3.25x
X on Y : X̂ = – 3 + 0.297y
Students: A B C D E F G H I J
Rank by
Ist judge: 5 2 4 1 8 9 7 6 3 10
Rank by
IInd judge: 1 9 7 8 10 2 4 5 3 6
Find out whether the judges are in agreement with each other or not and apply
the t-test for significance at 5% level.
9) A sales manager of a soft drink company is studying the effect of its latest
advertising campaign. People chosen at random were called and asked how
many bottles they had bought in the past week and how many advertisements
of this product they had seen in the past week.
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
25
Relational and
Trend Analysis 10.11 FURTHER READING
A number of good text books are available for the topics dealt with in this unit.
The following books may be used for more indepth study.
Richard I. Levin and David S. Rubin, 1996, Statistics for Management.
Prentice Hall of India Pvt. Ltd., New Delhi.
Peters, W.S. and G.W. Summers, 1968, Statistical Analysis for Business
Decisions, Prentice Hall, Englewood-cliffs.
Hooda, R.P., 2000, Statistics for Business and Economics, MacMillan India
Ltd., New Delhi.
Gupta, S.P. 1989, Elementary Statistical Methods, Sultan Chand & Sons :
New Delhi.
Chandan, J.S. - Statistics for Business and Economics, Vikas Publishing
House Pvt. Ltd., New Delhi.
APPENDIX : TABLE OF t-DISTRIBUTION AREA
The table gives points of t- distribution corresponding to degrees of freedom and the upper tail area
(suitable for use n one tail test).
0 tα
Values of ta, m
27
Relational and
Trend Analysis UNIT 11 TIME SERIES ANALYSIS
STRUCTURE
11.0 Objectives
11.1 Introduction
11.2 Definition and Utility of Time Series Analysis
11.3 Components of Time Series
11.4 Decomposition of Time Series
11.5 Preliminary Adjustments
11.6 Methods of Measurement of Trend
11.6.1 Freehand Method
11.6.2 Least Square Method
11.7 Let Us Sum Up
11.8 Key Words
11.9 Answers to Self Assessment Questions
11.10 Terminal Questions/Exercises
11.11 Further Reading
11.0 OBJECTIVES
After studying this unit, you should be able to:
11.1 INTRODUCTION
In the previous units, you have learnt statistical treatment of data collected for
research work. The nature of data varied from case to case. You have come
across quantitative data for a group of respondents collected with a view to
understanding one or more parameters of that group, such as investment, profit,
consumption, weight etc. But when a nation, state, an institution or a business
unit etc., intend to study the behaviour of some element, such as price of a
product, exports of a product, investment, sales, profit etc., as they have
behaved over a period of time, the information shall have to be collected for a
fairly long period, usually at equal time intervals. Thus, a set of any quantitative
data collected and arranged on the basis of time is called ‘Time Series’.
Depending on the research objective, the unit of time may be a decade, a year,
a month, or a week etc. Typical time series are the sales of a firm in
successive years, monthly production figures of a cement mill, daily closing
price of shares in Bombay stock market, hourly temperature of a patient.
Usually, the quantitative data of the variable under study are denoted by y1, y2,
...yn and the corresponding time units are denotecd by t1, t2......tn. The variable
‘y’ shall have variations, as you will see ups and downs in the values. These
changes account for the behaviour of that variable.
Instantly it comes to our mind that ‘time’ is responsible for these changes, but
this is not true. Because, the time (t) is not the cause and the changes in the
variable (y) are not the effect. The only fact, therefore, which we must
understand is that there are a number of causes which affect the variable and
have operated on it during a given time period. Hence, time becomes only the
2 8 basis for data analysis.
Forecasting any event helps in the process of decision making. Forecasting is Time Series Analysis
possible if we are able to understand the past behaviour of that particular
activity. For understanding the past behaviour, a researcher needs not only the
past data but also a detailed analysis of the same. Thus, in this unit we will
discuss the need for analysis of time series, fluctuations of time series which
account for changes in the series over a period of time, and measurement of
trend for forecasting.
Another question: Shall sunflower oil be sold again in future for Rs. 60 per
kg? No doubt, your answer would be ‘Yes’. Have you ever thought about how
you answered the above two questions? Probably you have not! The analysis of
these answers shall lead us to arrive at the following observations:
– There are several causes which affect the variable gradually and permanently.
Therefore we are prompted to answer ‘No’ for the first question.
2 9
Relational and – There are several causes which affect the variable for the time being only. For
Trend Analysis this reason we are prompted to answer ‘Yes’ for the second question.
The causes which affect the variable gradually and permanently are termed as
“Long-Term Causes”. The examples of such causes are: increase in the rate of
capital formation, technological innovations, the introduction of automation,
changes in productivity, improved marketing etc. The effect of long term causes
is reflected in the tendency of a behaviour, to move in an upward or downward
direction, termed as ‘Trend’ or ‘Secular Trend’. It reveals as to how the time
series has behaved over the period under study.
The causes which affect the variables for the time being only are labelled as
“Short-Term Causes”. The short term causes are further divided into two parts,
they are ‘Regular’ and ‘Irregular’. Regular causes are further divided into two
parts, namely ‘cyclical causes’ and ‘seasonal causes’. The cyclical variations
are also termed as business cycle fluctuations, as they influence the variable. A
business cycle is composed of prosperity, recession, depression and recovery.
The periodic movements from prosperity to recovery and back again to
prosperity vary both in time and intensity. The seasonal causes, like weather
conditions, business climate and even local customs and ceremonies together
play an important role in giving rise to seasonal movements to almost all the
business activities. For instance, the yearly weather conditions directly affect
agricultural production and marketing.
It is worthwhile to say that the seasonal variations analysis will be possible only
if the season-wise data are available. This fact must be checked first. For
analysing the seasonal effects various methods are available. Among them
seasonal index by ‘Ratio to Moving Average Method’ is the most widely used.
However, if collected data provides only yearly values, there is no possibility of
obtaining seasonal variations. Therefore, the residual amount after eliminating
trend will be the effect of irregular or random causes.
Y=T+S+C+I
Y=T×S×C×I
In this model the values of all the components, except trend values, are
expressed as percentages.
In business research, normally, the multiplicative model is more suited and used
more frequently for the purpose of analysis of time series. Because, the data
related to business and economic time series is the result of interaction of a
number of factors which individually cannot be held responsible for generating
any specific type of variations.
Components
Year Quarter Series Trend Seasonal Cyclical-
(O) (T) (100 S) erratic
(100 C1)
1 1 79 80 120 82
2 58 85 80 85
3 84 90 92 102
4 107 95 108 105
3 1
Relational and According to multiplicative model
Trend Analysis
Y=T×S×C×I
120 82
Thus, 79 (1 year and 1 quarter) = 80 × ×
100 100
120 108
130 (2 year and 1 quarter) = 100 × ×
100 100
Thus each quarterly figure (Y) is the product of the T, S, and CI. Such a
synthetic composition looks like an actual time series and has encouraged use
of the model as the basis for the analysis of time series data.
a) Time is the cause for the ups and downs in the values of the variable under
study.
b) The variable under study in time series analysis is denoted by ‘y’.
c) ‘Trend’ values are a major component of the time series.
d) Analysis of time series helps in knowing current accomplishment.
e) Weather conditions, customs, habits etc., are causes for cyclical variations.
f) The analysis of time series is done to know the expected quantity
change in the variable under study.
..................................................................................................................
..................................................................................................................
3 2
..................................................................................................................
Time Series Analysis
11.6 METHODS OF MEASUREMENT OF TREND
The effect of long-term causes is seen in the trend values we compute. A
trend is also known as ‘secular trend’ or ‘long-term trend’ as well. There are
several methods of isolating the trend of which we shall discuss only two
methods which are most frequently used in the business and economic time
series data analysis. They are: Free Hand Method, and Method of Least
Square.
Though this method is very simple, it does not have a common acceptance
because it gives varying trend values for the same data when efforts are made
by different persons or even by the same persons at different times. It is to be
noted that free-hand method is highly subjective and therefore, different
researchers may draw different trend lines from the same data set. Hence, it is
not advisable to use it as a basis for forecasting, particularly, when the time
series is subject to very irregular movements. Let us consider an illustration to
draw a trend line by free-hand method.
Illustration 1
From the following data, find the trend line by using Free hand (graphline)
Method
Years: 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003
Foodgrain production 35, 55, 40, 85, 135, 110, 130, 150, 130, 120
(lakh tonnes)
180
160
140
Production (lakh tones)
120
100
80 Original data
60
Trend line
40
20
0
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
Years
Fig. 1: Food Grain Production (in lakh tons) 3 3
Relational and 11.6.2 Least Square Method
Trend Analysis
This is also known as straight line method. This method is most commonly used
in research to estimate the trend of time series data, as it is mathematically
designed to satisfy two conditions. They are:
The straight line method gives a line of best fit on the given data. The straight
line which can satisfy the above conditions and make use of the regression
equation, is given by :
Yc = a + bx
where, ‘Yc’ represents the trend value of the time series variable y, ‘a’ and ‘b’
are constant values of which ‘a’ is the trend value at the point of origin
and ‘b’ is the amount by which the trend value changes per unit of
time, and ‘x’ is the unit of time (value of the independent variable).
The values of constants, ‘a’ and ‘b’, are determined by the following two
normal equations.
∑y = na + b ∑x .................(i)
Therefore, the values of two constants are obtained by the following formulae:
∑y ∑xy
a= , and b = 2
N ∑x
It is to be noted that when the number of time units involved is even, the point
of origin will have to be chosen between the two middle time units.
Illustration 2
3 4
Years Sales (in ’000 tonnes) Time Series Analysis
1998 70
1999 75
2000 90
2001 98
2002 85
2003 91
2004 100
Solution: To find the straight line equation (Yc = a + bx) for the given time
series data, we have to substitute the values of already arrived expression, that
is:
∑y ∑ xy
a= , and
N ∑ x2
In order to make the total of x = ‘zero’, we must take median year (i.e., 2001)
as origin. Study the following table carefully to understand the procedure for
fitting the straight line.
∑y ∑ xy 117
= 87 ; b = = = 4.18
609
a= = 2 28
N 7 ∑x
Yc = 87 + 4.18x
From the above equation, we can also find the monthly increase in sales as
follows:
4.180
= 348.33 tons
12
The reason for this is that the trend values increased by a constant amount ‘b’
every year. Hence the annual increase in sales is 4.18 thousand tons.
3 5
Relational and Trend values are to be obtained as follows:
Trend Analysis
Y1998 = 87 + 4.18 (–3) = 74.5
Y1999 = 87 + 4.18 (–2) = 78.6 and so on ........
Predicting with decomposed components of the time series: The
management wants to estimate fertiliser sales for the years 2006 and 2008.
Estimation of sales for 2006, ‘x’ would be 5 (because for 2004 ‘x’ was 3).
Y2006 = 87 + 4.18 (5) = 107.9 thousand tonnes.
Estimation of sales for 2008, ‘x’ would be 7
Y2008 = 87 + 4.18 (7) = 116.3 thousand tonnes.
Years Production
1996 40
1997 60
1998 45
1999 83
2000 130
2001 135
2002 150
2003 120
2004 200
3 6
Years y x x2 xy yc Time Series Analysis
The quantitative values of the variable under study are denoted by y1, y2, y3......
and the corresponding time units are denoted as x1, x2, x3...... . The variable ‘y’
shall have variations, you will see ups and downs in the values. There are a
number of causes during a given time period which affect the variable.
Therefore, time becomes the basis of analysis. Time is not the cause and the
changes in the values of the variable are not the effect.
The causes which affect the variable gradually and permanently are termed as
Long-term causes. The causes which affect the variable only for the time being
are termed as Short-term causes. The time series are usually the result of the
effects of one or more of the four components. These are trend variations (T),
seasonal variations (S), Cyclical variations (C) and Irregular variations (I).
When we try to analyse the time series, we try to isolate and measure the
effects of various kinds of these components on a series.
1) Additive model, which considers the sum of various components resulting in the
given values of the overall time series data and symbolically it would be
expressed as: Y = T + C + S + I.
The trend analysis brings out the effect of long-term causes. There are
different methods of isolating trends, among these we have discussed only two
methods which are usually used in research work, i.e. free hand and least
square methods.
3 7
Relational and Long-term predictions can be made on the basis of trends, and only the least
Trend Analysis square method of trend computation offers this possibility.
‘y’ 24 28 38 33 49 50 66 68
5) The production (in thousand tons) in a sugar factory during 1994 to 2001
has been as follows:
Year 1994 1995 1996 1997 1998 1999 2000 2001
Produ- 35 38 49 41 56 58 76 75
ction
(Hint: The point of origin must be taken between 1997 and 1998).
i) Find the trend values by applying the method of least square.
ii) What is the monthly increase in production?
3 8 iii) Estimate the production of sugar for the year 2008.
6) The following data relates to a survey of used car sales in a city for the Time Series Analysis
period 1993-2001. Predict sales for 2006 by using the linear trend
equation.
Years 1993 1994 1995 1996 1997 1998 1999 2000 2001
Sales 214 320 305 298 360 450 340 500 520
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
3 9
Relational and Trend
Analysis UNIT 12 INDEX NUMBERS
STRUCTURE
12.0 Objectives
12.1 Introduction
12.2 Meaning and Concept of Index Numbers
12.3 Uses of Index Numbers
12.4 Issues in Construction of Index Numbers
12.5 Classification of Index Numbers
12.6 Methods of Constructing Index Numbers
12.6.1 Unweighted Index Numbers
12.6.2 Weighted Index Numbers
12.7 Splicing and Deflating of Index Numbers
12.8 Let Us Sum Up
12.9 Key Words
12.10 Answers to Self Assessment Exercises
12.11 Terminal Questions/Exercises
12.12 Further Reading
12.0 OBJECTIVES
After studying this unit, you should be able to:
12.1 INTRODUCTION
In the previous Unit-11, we have discussed the analysis of time series. In this
unit we shall discuss the methods of constructing various types of index
numbers for different purposes. This device is an extension of the time series
analysis because an index number combines two or more time series variables
related to non-comparable units. You would have read in newspapers or heard
on the television/the radio that the cost of living index has increased by so
many points, hence for government employees another slab of Dearness
Allowance has been declared. Probably you might have wondered what is this
cost of living index?
Many of you must also be aware of the Stock Exchange Share Price Index –
commonly referred to as BSE SENSEX or, more recently, NSE SENSEX. In
fact, these various types of index series have come to be used in many
activities such as industrial production, export, prices, etc. In this Unit, you will
study and understand the meaning and uses of index numbers, various problems
resulting from the incorrect use of index numbers, methods for construction of
various index numbers, and their limitations.
40
Index numbers
12.2 MEANING AND CONCEPT OF INDEX NUMBERS
When we talk that the general level of industrial production has registered an
increase of 4 per cent, it is obvious that we are referring to the production of
all those items that are produced by the industrial sector. However, production
of some of these items may be increasing while that of others may be
decreasing or may remain constant. The rate of increase or decrease and the
units in which these items are expressed may differ. For instance, cement may
be quoted per kg, cloth may be per meters, cars may be per unit etc. In such
a situation, when the purpose is to measure the changes in the average level of
prices or production of industrial products for comparing over a time or with
respect to geographic location, it is not appropriate to apply the technique of
measure of central tendency because it is not useful when series are expressed
in different units or/and in different items.
It is in these situations, that we need a specialised average, known as index
numbers. These are often termed as ‘economic barometers’.
An index number may be defined as a special average which helps in
comparison of the level of magnitude of a group of related variables under two
or more situations.
Index numbers are a series of numbers devised to measure changes over a
specified time period (the time period may be daily, weekly, monthly, yearly, or
any other regular time interval), or compare with reference to one variable or a
group of related variables. Thus, each number in a series of specified index
number is:
a) A pure number i.e., it does not have any unit.
b) Calculated according to a pre-determined formula.
c) Generated at regular time intervals, sometimes during the same time interval at
different places.
d) The regular generation of numbers form a chronological series.
e) With reference to some specified period and number known as base period and
base number, the latter is always 100. For example, if the consumer price index,
with base year 1996 is calculated to be 180 for the year 2003, it means that
consumer prices have increased by 80 per cent in 2003 as compared to the
prices prevalent in 1996.
41
Relational and Trend example, an individual earns Rs. 100/- in the year 1970 and his earnings increase
Analysis to Rs. 300/- in the year 1980. If during this period, consumer price index
increases from 100 to 400 then the consumer is not able to purchase the same
quantity of different commodities with Rs. 300, which he was able to purchase
in the year 1970 with his income of Rs. 100/-. This means the real income has
declined. Thus real income can be calculated by dviding the actual income by
dividing the consumer price index:
300
= = Rs. 75 / − with respect to 1970 as base year.
400
Therefore, the consumer’s real income in the year 1980 is Rs. 75/- as compared
to his income of Rs. 100/- in the year 1970. We can also say that because of
price increase, even though his income has increased, his purchasing power has
decreased.
2) Different types of price indices are used for wage and salary negotiations, for
compensating in price rise in the form of DA (Dearness Allowance).
3) Various indices are useful to the Government in framing policies. Some of these
include taxation policies, wage and salary policies, economic policies, custom
and tariffs policies etc.
4) Index numbers can also be used to compare cost of living across different cities
or regions for the purpose of making adjustments in house rent allowance, city
compensatory allowance, or some other special allowance.
5) Indices of Industrial Production, Agricultural Production, Business Activity,
Exports and Imports are useful for comparison across different places and are
also useful in framing industrial policies, import/export policies etc.
6) BSE SENSEX is an index of share prices for shares traded in the Bombay
Stock Exchange. This helps the authorities in regulating the stock market. This
index is also an indicator of general business activity and is used in framing
various government policies. For example, if the share prices of most of the
companies comprising any particular industry are continuously falling, the
government may think of changes in its policies specific to that industry with a
view to helping it.
7) Sometimes, it is useful to correlate index related to one industry to the index of
another industry or activity so as to understand and predict changes in the first
industry. For example, the cement industry can keep track of the index of
construction activity. If the index of construction activity is rising, the cement
industry can expect a rise in demand for cement.
1) Collection of Data
Data collection through a sample method is one of the issues in the construction
of index numbers. The data has to be as reliable, adequate, accurate,
42
comparable, and representative, as possible. Here a large number of questions Index numbers
need to be answered. The answers ultimately depend on the purpose and
individual judgement. For example, one needs to decide the following:
iii) Timings of Data Collection: It is also equally important to collect the data
at an appropriate time. Referring to the example of consumer price index,
prices are likely to vary on different days of the month. For certain
commodities prices may vary at different times of the same day. Take an
example, vegetable prices are usually high in the morning when fresh vegetables
arrive and are low in the late evening when sellers are closing for the day and
wish to clear the perishable stock. For each commodity, individual judgement
needs to be exercised to represent reality and to serve the purpose for which
an index is to be used.
A base period is the reference period for comparing and analysing the changes
in prices or quantities in a given period. For many index number series, value of
a particular time period, usually a year, is taken as reference period against
which all subsequent index numbers in the series are calculated and compared.
In yet other cases, we may be required to compare one index number series
against another series. In such a context, a ‘base’ common to all series is more
appropriate.
Different methods of indices give different results, when applied to the same
data. Utmost care must be taken in selection of a formula which is the most
suitable for the purpose. Whether to use an unweighted or weighted index is a
difficult question to answer. It depends on the purpose for which the index
number is required to be used. For example, if we are interested in an index
for the purpose of negotiating wages or compensating for price rise, only a
weighted index would be worthwhile to use.
Price Indices: This type of indices is the most frequently used. Price indices
consider prices of a commodity or a group of commodities and compare
changes of prices from one period to another period and also compare the
difference in price from one place to another. For example, the familiar
Consumer Price Index measuring overall price changes of consumer
commodities and services is used to define the cost of living.
Value Indices: Value indices actually measure the combined effects of price
and quantity changes. For many situations either a price index or quantity index
may not be enough for the purpose of a comparison. For example, an index
may be needed to compare cost of living for a specific group of persons in a
city or a region. Here comparsion of expenditure of a typical family of the
group is more relevant. Since this involves comparing expenditure, it is the value
index which will have to be constructed. These indices are useful in production
decisions, because it avoids the effects of inflation.
∑ p1 q
Value indices = 1
× 100
44 ∑ p1 q
1
Self Assessment Exercise A Index numbers
1) State with reasons, whether the following statements are TRUE or FALSE.
a) Index numbers are specialised averages.
b) The index number for a base year is always zero.
c) A value index measures either price or quantity changes.
d) In times of inflation, a quantity index provides a better measure of actual
output than a corresponding value index.
e) Through appropriate indices, normal increase can be transformed into real
income.
f) Probability sampling is the most appropriate method for selecting
commodities while constructing indices.
2) In magazines and newspapers you might have come across many index
numbers. Name four such index numbers and briefly state what does
each one of them indicate?
..................................................................................................................
..................................................................................................................
..................................................................................................................
3) List out the problems that arise in connection with the construction of an
index number.
..................................................................................................................
..................................................................................................................
..................................................................................................................
4) Try to cite one example each where (a) price index, (b) quantity index,
and (c) value index is not appropriate.
..................................................................................................................
..................................................................................................................
..................................................................................................................
45
Relational and Trend Different formulae have been introduced by statisticians for constructing
Analysis
composite index numbers. They may be categorized into two broad groups as
given below:
The formula and its use in constructing each category of indices, listed above,
are discussed in the following sections. Let us first acquaint ourselves with the
symbols used in construction of index numbers. They are as follows:
P1 denotes price per unit of the same commodity in the current period (current
period is one in which the index number is calculated with reference to the
base period).
Capital letters P, Q, and V are used for denoting price index, quantity index,
and value index numbers, respectively.
Thus, P01 refers to price index for period 1. (P1) with respect to base period
(P0). Similar meanings are assigned to quantity (Q01) and value (V01) indices. It
may be noted that indices are expressed in per cent.
This type of indices are also referred to as simple index numbers. In this
method of constructing indices, weights are not expressly assigned. These are
further classified under two categories:
⎛ ∑ P1 ⎞
P01 = ⎜ ⎟ × 100
⎝ ∑ P0 ⎠
Similarly, the quantity index may be expressed as:
⎛ ∑q ⎞
Q 01 = ⎜⎜ 1 ⎟⎟ × 100
⎝ ∑ q0 ⎠
For example, consider the sample data given below for the year 1990 and 2000
for construction of price index and quantity index.
46
Illustration 1 Index numbers
The price index for the year 2000 with reference to base year 1990 the simple
aggregative method is
⎛ ∑ P1 ⎞ 2271 .1
P01 = ⎜⎜ ⎟ × 100 = × 100 = 156 .54
⎟
⎝ ∑ P0 ⎠ 1450 .8
Thus, the prices in respect of commodities considered in the index have shown
an increase of 56.54 per cent in 2000 as compared to 1990.
1) The unit size affects the index number. For instance, in the above illustration if
the price of wheat was quoted in terms of per kg. Rs. 7/- in 1990 and Rs. 9.5 in
2000) the index might be very different.
2) Relative importance of different commodities is not reflected in the index. For
example, in the above illustration a total of Rs. 2,800/- is spent on wheat, which
is the most important item of expenditure. This is not reflected in this method.
Analogously, the Quantity Index by the simple aggregate method is:
⎛ ∑q ⎞
Q01 = ⎜⎜ 1 ⎟⎟ × 100
⎝ ∑ q0 ⎠
Consider the illustration 1 for quantity index
1045.5
Q 01 = × 100 = 124.61
839
Here, you should note that the ‘P’ in the formulae of price index will be
replaced by ‘q’ in constructing index. This expression is applicable to the
formulae of different methods.
Limitation: The units of quantities being different cannot be added and the
quantities do not represent appropriate variables for the purpose of comparing
expenditure.
47
Relational and Trend 2) Simple Average of Relatives Index
Analysis
In this method of constructing price index, first of all price relatives have to be
computed for the different items included in the index then the average of these
is calculated simbolically,
⎛ P1 ⎞
∑⎜ × 100 ⎟ Sum of the
P 01 = ⎝ ⎠ or Pr ice Re latives
P0
N No. of items
Using the same data by considering only prices given in the illustration-1, the
computation of price index as simple average of price relatives is as follows:
Illustration-2
Table 12.2: Computation of Index by Simple Average of Relatives Method
⎛ P1 ⎞
⎜⎜
∑ × 100 ⎟⎟
P01 = ⎝ ⎠ = 763 .9 = 152 .78
P0
N 5
Thus, the index of simple average of price relatives shows 52.78 per cent
increase in price.
⎛ q1 ⎞
∑⎜
⎜ q × 100 ⎟⎟
Q 01 = ⎝ 0 ⎠
N
Which you may compute on your own by using the data given in Illustration-1.
This method also has its limitations. First, each price/quantity relative is given
equal importance, which is not realistic. Secondly, the arithematic mean is not
the right type of average for ratios, and percentages.
Calculate i) the price index number by the simple aggregative and average of
relatives methods from the following data (price per kg).
48
ii) What are the limitations of both the methods? Index numbers
Apple 35 60
Mango 30 45
Watermelon 5 10
..................................................................................................................
..................................................................................................................
..................................................................................................................
In the earlier two methods each item received equal weight/importance in the
construction of an index, whereas in the weighted index methods, weights are
expressly assigned to each item which is included in an index construction. This
weighting allows us to consider more information than just the change in price/
quantity over time. The problem only is to decide how much weight
(importance) to consider for each of the items included in the sample. This is
further divided into two methods.
In this group, we shall study three specific methods commonly used in business
research. They are: (a) Laspeyre’s index, (b) Paasche’s index, and (c) Fisher’s
index. After understanding the concepts of the three indices we will take up an
illustration for construction of these indices.
⎛ ∑Pq ⎞
Price Index (P01La) = ⎜⎜ ∑ P q ⎟⎟ ×100 ,
1 0
and
⎝ 0 0⎠
⎛ ∑q P ⎞
Quantity Index = (Q01La) = ⎜⎜ ∑ q P ⎟⎟ ×100
1 0
⎝ 0 0⎠
Since each index number depends upon price and quantity of the same base
year, the researcher can compare the index of one period directly with the
49
Relational and Trend index of another period. For instance, assume that the cement price index is
Analysis 115 in 1995 and 143 in 2001, taking 1991 as base year. The firm concludes that
the price level of cement has increased by 15 per cent from 1991 to 1995 and
has increased 43% from 1991 to 2000.
⎛ ∑Pq ⎞
Price Index (P01Pa) = ⎜⎜ ΣP q ⎟⎟ × 100
1 1
⎝ 0 1⎠
⎛ ∑q P ⎞
Quantity Index (Q01Pa) = ⎜⎜ ∑ q P ⎟⎟ × 100
1 1
⎝ 0 1⎠
From the practical point of view, Laspeyre’s index is usually preferred over
Paasche’s index. This is because as long as base period is fixed, the weights
assigned will remain unchanged. Therefore, calculations and comparisons are
easier. On the other hand, weights in Paasche’s formula continue to change
with the change in the current year so that the price index for every year has
to be computed using fresh/different weights.
c) Fisher’s Ideal Index: Irving Fisher used geometric mean of the Laspeyre’s
and Paache’s indices to overcome the shortcomings of both. Thus,
⎛ ∑ P1 q 0 ⎞ ⎛ ∑ P1 q 1 ⎞
⎜ ⎟ ⎜ ⎟ × 100
⎜Σ P q ⎟ ⎜Σ P q ⎟
F
Fisher’s Price Index (P01 ) =
⎝ 0 0 ⎠ ⎝ 0 1 ⎠
⎛ ∑q P ⎞ ⎛ ∑ q1P1 ⎞
Q F 01 = ⎜⎜ 1 0 ⎟⎜ ⎟
⎟ ⎜ Σ q P ⎟ × 100
⎝ Σ q 0 P0 ⎠⎝ 0 1 ⎠
Illustration-3
Table 12.3: Computation of Weighted Aggregates Index
∑ P1q 0
i) Laspeyre’s Price Index or P01La = ∑ Poq 0 × 100
10824
= × 100 = 118.94
9100
This shows that prices for the group (sample commodities) have increased by
18.94 per cent in 2000 as compared to those prevailing in 1995.
∑q P0
Q 01 = 1
× 100
∑ q 0 P0
The sum of q1, P0, and q0 P0 may be taken from the Table 12.3 as Σp0q1 =
Σq1p0, and ΣP0q0 = Σq0P0.
10900
Q01La = × 100 = 119.78
9100
This shows a 19.78 per cent increase in aggregate quantity consumption for this
group in 2000 as compared to 1995.
∑ P1 q 1
ii) Paache’s Price Index or P01La = × 100
∑ P0 q 1
13100
= × 100 = 120.18
10900
Thus, according to the Paache’s Index the price index reveals an increase of
20.18 per cent in prices in 2000 as against 1995. 51
Relational and Trend Analogously, Paasche’s quantity index is
Analysis
∑ q1 P1
Q 01 = × 100
Pa
∑ q 0 P1
The values of Σq1P1 and Σq0P1 in the Table 12.3, as they are equivalent to
ΣP1q1 and ΣP1q0, respectively. Thus,
Pa 13100
Q 01 = × 100 = 121.03
10824
It shows a 21.03 per cent increase in quantity consumption for this group in
2000 as compared to 1995.
⎛ ∑P1 q 0 ⎞ ⎛ ∑P1 q 1 ⎞
⎜ ⎟ ⎜ ⎟ 100
iii) Fisher’s Index or P01F = ⎜ ∑P q ⎟ ⎜ ∑P q ⎟
⎝ 0 0 ⎠ ⎝ 0 1 ⎠
F ⎛ 10824 ⎞ ⎛ 13100 ⎞
P01 = ⎜ ⎟⎜ ⎟ 100
⎝ 9100 ⎠ ⎝ 10900 ⎠
⎛ ∑ q 1 P0 ⎞ ⎛ ∑ q 1 P1 ⎞
⎜ ⎟ ⎜ ⎟
Fisher’s Quantity Index or Q01F = ⎜∑ q P ⎟ ⎜∑q P ⎟ 100
⎝ 0 0 ⎠ ⎝ 0 1⎠
which you may compute and interpret on your own using the data in the Table
12.3.
In this method, the construction of the index number is similar to the simple
average of relatives method, in respect of computation of price relatives, as
discussed in Section 12.6.1. However, to overcome the limitation of simple
average of relatives method, the weights used are the values of consumption
for each commodity either in the base period, or in the current period.
⎡⎛ P ⎞ ⎤
∑ ⎢ ⎜ 1 × 100 ⎟ P0 q 0 ⎥
⎢⎣ ⎜⎝ P0 ⎟
⎠ ⎥⎦ ∑ PV
P01 = , in simple
∑P q ∑V
0 0
As an illustration let us consider the data given in Table 12.4 which also
contains required computations for constructing index number through weighted
average of relatives method.
52
Illustration-4 Index numbers
Table 12.4: Computation of Index Number through Weighted Average of Relatives Method
Year Year V P PV
1990 2000
⎛ P1 ⎞
Items (base year) (base year) (P0q0) ⎜ ×100 ⎟
⎝ P0 ⎠
Price Qty. Price Qty. Price
P0 q0 P1 q1 Relatives
A 7 25 12 21 175 171.43 30000.25
B 2 12 2.5 12 24 125.00 3000.00
C 3 4 5 3 12 166.67 2000.04
∑ PV 35000.29
Then, the price index (P01) = = = 165.88
∑V 211
This means that according to this method, the rise in prices in 2000 as
compared to the base year 1990 is 65.88 per cent. In this method, the index of
quantity relatives is expressed as:
⎡⎛ q ⎞ ⎤
∑ ⎢⎜ 1 × 100 ⎟ q P ⎥
⎢⎣⎜⎝ q 0 ⎟ 0 0⎥
⎠ ⎦ ∑ qV
Q01 = =
∑q P ∑V
0 0
which you may compute and interpret on your own by using the data in
Table 12.4.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
.................................................................................................................. 53
Relational and Trend
Analysis 12.7 SPLICING AND DEFLATING OF INDEX
NUMBERS
Splicing
Sometimes, a specific situation may arise for shifting the base period of an
index number series to some recent period. For instance, in course of time a
few commodities which are being considered for constructing indices may get
replaced with new commodities, as a result their relative weightage may also
change. In some cases, the weights may have become outdated and we may
take into account the revised weights. Consequently, whatever be the reasons,
index number series loses continuity and now we have two different index
number series with different base periods which are not directly comparable. It
is, therefore, essential to connect these two different series of indices into one
continuous series. The statistical procedure involved in connecting these two
series of indices to make continuity is termed as ‘Splicing’. Thus, splicing
means reducing two overlapping series of indices with different base periods
into a continuous index number series. In equation form, we can say,
Old index No. of
new base period
Spliced Index Numbers = New Index No. of current period ×
100
Illustration 5
Table 12.5: Splicing the New Series of Indices with the Old Series of Indices
In the above illustration, old series was discontinued in 1993 and in that year
new series was started. As shown in Column No. 4, splicing took place at the
base year 1993 of the new series.
Alternatively, splicing may be done to the old index number series with new
index number series. It means instead of carrying old series forward, new
series may be brought backwards. To do this, the formula looks like this:
54
Index numbers
100
Old Index No. of current period ×
Old index No. of new base period
Under this approach (Splicing the old series with the new series) the spliced
indices are as follows:
Year 1990 1991 1992 193 1994 1995 1996 1997 1998
The index A was started in 1995 and continued upto 1998 in which year
another index B was started. Splice the index B to index A so that a
continuous series of index number from 1995 upto date may be available.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
Deflating
As we know that the price of goods and series gradually increases, as a result
the purchasing power of money (value of money) decreases. Consequently, the
real wages become less than the money wage. In such a situation the real
wage may be obtained by reducing the money wage to the extent the price
level has risen. Thus, the process of finding out the real wage by applying
appropriate price indices to the money wages so as to allow for the changes in
the price level is called ‘deflating’. We may express this process by the
following formulae:
Money wage
Re al wage = × 100, and
Pr ice index
Let us take an example consisting of the following data related to wages and
price index of different years. It would illustrate the procedure of constructing
real wage index numbers.
55
Relational and Trend Illustration 6
Analysis
Table 12.6: Construction of Real Wage Index
Index numbers can be used in several ways, such as study trends and
tendencies of business activities, provide guidelines in framing suitable policies,
measure real purchasing power of money, help in transforming nominal wage
into real wage and so on. The researcher may face various problems in the
construction of different types of indices. They may be selection of the base
period, collection of data, selection of commodities, choice of averages and
weights, selection of an appropriate index. These issues must be clarified before
constructing indices.
There are three principal types of indices (i) price indices, (ii) quality indices,
and (iii) value indices. Among these three, price indices is the most common in
analysing the data.
Index Numbers
Unweighted Weighted
Cost of Living Index: Numbers represent the average change in the prices
paid by the consumer on specified goods and services over a period of time,
popularly known as “Consumer Price Index Number”.
Base period: It is the reference period against which comparisons are made.
Price Index: A measure of how much the price variables change over a
period of time.
Value Index: A measure for changes in total monetary worth over a time.
57
Relational and Trend
Analysis
12.11 TERMINAL QUESTIONS/EXERCISES
1) What do you mean by an index number? Explain the uses of index numbers for
analysing the data.
2) Discuss various issues that arise in connection with the construction of an index
number.
3) Briefly explain different methods for construction of indices and their
limitations.
4) Why do we consider Fisher’s index as an ideal index?
5) Write short notes on:
a) Price Index
b) Quantity Index
c) Splicing of Indices
d) Deflating of Indices.
2000 2004
Material Inventory Price Inventory Price
(Rs.) (Rs.)
A 96 45 108 41
B 495 26 523 32
C 1,425 5 1,608 8
D 208 12 196 9
Find the price indices and quantity indices by using the methods of unweighted
index numbers and comment on the results.
7) A department of Statistics has collected the following data describing the prices
and quantities of harvested crops for the years 1990, 2000 and 2004 (Price in
Qtls. and Production in tons).
Item 1990 2000 2004
Price Production Price Production Price Production
Paddy 200 1,050 500 1,300 600 1,450
Wheat 250 940 550 1,220 700 1,450
Groundnut 350 400 800 500 1,000 480
Construct the price and quantity indices of Laspeyre’s Index, Paache’s Index
and Fisher’s Index in 2000 and 2004, using 1990 as the base period. Give your
comments on the results.
8) From the given data in Problem No. 7, find out the following:
i) Weighted average of relative price index numbers for 2004, using 1990 and
2000 as the base.
ii) Weighted average of relative quantity index for 2004, using 2000 as the base.
58 iii) Give your comments on the price indices.
Index numbers
9. Two price index series of cement are given below. Splice the old series with the
new series. By what per cent did the price of cement rise between 1995 and
2000.
Year Old series New series
Base (1990) Base (1998)
1995 156.6 -
1996 174.8 -
1997 162.3 -
1998 160.0 100.0
1999 - 106.4
2000 - 114.1
2001 - 112.2
10) Given below is the annual income of an Engineer and the general index number
of prices during 1997–2004. Construct the index number to show the change in
the real income of the Engineer.
Note:
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
59
Probability and
UNIT 13 PROBABILITY AND PROBABILITY Probability Rules
RULES
STRUCTURE
13.0 Objectives
13.1 Introduction
13.2 Meaning and History of Probability
13.3 Terminology
13.4 Fundamental Concepts and Approaches to Probability
13.5 Probability Rules
13.5.1 Addition Rule for Mutually Exclusive Events
13.5.2 Addition Rule for Non-mutually Exclusive Events
13.6 Probability Under Statistical Independence
13.7 Probability Under Statistical Dependence
13.8 Bayes’ Theorem: Revision of A- Priori Probability
13.9 Let Us Sum Up
13.10 Key Words
13.11 Answers to Self Assessment Exercises
13.12 Terminal Questions/Exercises
13.13 Further Reading
13.0 OBJECTIVES
After studying this unit, you should be able to:
l comprehend the concept of probability,
l acquaint yourself with the terminology related to probability,
l understand the probability rules and their application in determining probability,
l differentiate between determination of probability under the condition of
statistical independence and statistical dependence,
l apply probability concepts and rules to real life problems, and
l appreciate the relevance of the study of probability in decision making.
13.1 INTRODUCTION
In the previous units we have discussed the application of descriptive statistics.
The subject matter of probability and probability rules provide a foundation for
Inferential Statistics. There are various business situations in which the decision
makers are forced to apply the concepts of probability. Decision making in
various situations is facilitated through formal and precise expressions for the
uncertainties involved. For instance formal and precise expression of
stockmarket prices and product quality uncertainties, may go a long way to help
analyse, and facilitate decision on portfolio and sales planning respectively.
Probability theory provides us with the means to arrive at precise expressions
for taking care of uncertainties involved in different situations.
This unit starts with the meaning of probability and its brief historical evolution.
Its meaning has been described. The next section covers fundamental concept
of probability as well as three approaches for determining probability. These
approaches are : i) Classical approach; ii) Relative frequency of occurrence
approach, and iii) Subjective approach. 5
Probability and Hypothesis Thereafter the addition rule for probability has been explained for both mutually
Testing
exclusive events and non-mutually exclusive events. Proceeding further the unit
addresses the important aspects of probability rules, the conditions of statistical
independence and statistical dependence. The concept of marginal, joint, and
conditional probabilities have been explained with suitable examples.
If the conditions of certainty only were to prevail, life would have been much
more simple. As is obvious there are numerous real life situations in which
conditions of uncertainty and risk prevail. Consequently, we have to rely on the
theory of chance or probability in order to have a better idea about the possible
outcomes. There are social, economic and business sectors in which decision
making becomes a real challenge for the managers. They may be in the dark
about the possible consequences of their decisions and actions. Due to
increasing competitiveness the stakes have become higher and cost of making a
wrong decision has become enormous.
13.3 TERMINOLOGY
Before we proceed to discuss the fundamental concepts and approaches to
determining probability, let us now acquaint ourselves with the terminology
relevant to probability.
6
ii) Trial and Events: To conduct an experiment once is termed as trial, while Probability and
Probability Rules
possible outcomes or combination of outcomes is termed as events. For
example, toss of a coin is a trial, and the occurence of either head or a tail is an
event.
iii) Sample Space: The set of all possible outcomes of an experiment is called the
sample space for that experiment. For example, in a single throw of a dice, the
sample space is (1, 2, 3, 4, 5, 6).
iv) Collectively Exhaustive Events: It is the set of all possible events that can
result from an experiment. It is obvious that the sum total of probability value of
each of these events will always be one. For example, in a single toss of a fair
coin, the collectively exhaustive events are either head or tail. Since
vi) Equally Likely Events: When all the possible outcomes of an experiment
have an equal probability of occurance, such events are called equally likely
events. For example, in case of throwing of a fair coin, we have already seen
that
P(Head) = P (Tail) = 0.5
Many common experiments in real life also can have events, which have all of
the above properties. The best example is that of a single toss of a coin, where
both the possible outcomes or events of either head or tail coming on top are
collectively exhaustive, mutually exclusive and equally likely events.
(i) The value of probability of any event lies between 0 to 1. This may be
expressed as follows:
0 ≤ P (Event) ≤ 1
If the value of probability of an event is equal to zero, then the event is never
expected to occur and if the probability value is equal to one, the event is
always expected to occur.
(ii) The sum of the simple probabilities for all possible outcomes of an activity must
be equal to one.
Before proceeding further, first of all, let us discuss different approaches to
defining probability concept.
Approaches to Probability
There are three approaches to determine probability. These are :
7
Probability and Hypothesis a) Classical Approach: The classical approach to defining probability is based on
Testing
the premise that all possible outcomes or elementary events of experiment are
mutually exclusive and equally likely. The term equally likely means that each of
all the possible outcomes has an equal chance of occurance. Hence, as per this
approach, the probability of occuring of any event ‘E’ is given as:
Example: When we toss a fair coin, the probability of getting a head would be:
3
Similarly, when a dice is thrown, the probability of getting an odd number is
6
1
or .
2
The premise that all outcomes are equally likely assumes that the outcomes are
symmetrical. Symmetrical outcomes are possible when a coin or a die being
tossed are fair. This requirement restricts the application of probability only to
such experiments which give rise to symmetrical outcomes. The classical
approach, therefore, provides no answer to problems involving asymmetrical
outcomes. And we do come across such situations more often in real life.
Thus, the classical approach to probability suffers from the following limitations:
i) The approach cannot be useful in such a case, when it is not possible in the
events to be considered “equally likely”. ii) It fails to deal with questions like,
what is the probability that a bulb will burn out within 2,000 hours? What is the
probability that a female will die before the age of 50 years? etc.
This approach too has limited practical utility because the computation of
probability requires repetition of an experiment a large number of times. This is
practically true where an event occurs only once so that repetitive occurrences
under precisely the same conditions is neither possible nor desirable.
Impartantly, these three approaches compliment one another because where one
fails, the other takes over. However, all are identical in as much as probability
is defined as a ratio or a weight assigned to the likelylihood of occurrence of
an event.
The following rules of probability are useful for calculating the probability of an
event/events under different situations.
If A and B are mutually exclusive events, this rule is depicted in Figure 13.1,
below.
P(A) P(B)
Figure: 13.1
The essential requirement for any two events to be mutually exclusive is that
there are no outcomes common to the occurance of both. This condition is
satisfied when sample space does not contain any outcome favourable to the
occurance of both A and B means A ∩ B = φ
So, P (E) = 1 − P (E | ) = 1 − P( E ) .
P(A) P(B)
P(A and B) = (A ∩ B)
Figure: 13.2
Thus, it is clear that the probability of outcomes that are common to both the
events is to be subtracted from the sum of their simple probability.
Solution: These events are not mutually exclusive, so the required probability
of drawing a Jack or a spade is given by:
Solution: P (Male or over 35) = P (Male) + P (over 35) – P (Male and over
35)
3 2 1 4
= + − =
5 5 5 5
P (H) = ½ = 0.5
Another example can be given in a throw of a fair die, the marginal probability
of the face bearing number 3, is:
P(3) = 1/6 = 0.166
Since, the tosses of the die are independent of each other, this is a case of
statistical independence.
Take another example: When a fair die is thrown twice in quick succession,
then to find the probability of having 2 in the 1st throw and 4 in second throw
is, given as:
13
Probability and Hypothesis P (2 in 1st throw and 4 in 2nd throw)
Testing
= P (2 in the 1st throw) × P (4 in the 2nd throw)
1 1 1
= × = = 0.028
6 6 36
For example, if we want to find out what is the probaility of heads coming up
in the second toss of a fair coin, given that the first toss has already resulted in
head. Symbolically, we can write it as:
P (H2/H1)
As, the two tosses are statistically independent of each other
so, P (H2/H1) = P (H2)
The following table 13.1 summarizes these three types of probabilities, their
symbols and their mathematical formulae under statistical independence.
Table 13.1
For example, if the first child of a couple is a girl to find the probability that
the second child will be a boy. In this case:
P (B/G) = P (B)
As both the events are independent of each other, the conditional probability of
having the second child as a boy on condition that their first child was a girl, is
equal to the marginal probability of having the second child as a boy.
For example, take the case of rain in different states of India. Suppose, the
probability of having rain in different states of India is given as:
For example, an urn contains 3 white balls and 7 black balls. We draw a ball
from the Urn, replace it and then again draw a second ball. Now, we have to
find the probability of drawing a black ball in the second draw on condition that
the ball drawn in the first attempt was a white one.
a) The time until the failure of a watch and of a second watch marketed by
different companies – yes/no
b) The life span of the current Indian PM and that of current Pakistani
President – Yes/no.
c) The takeover of a company and a rise in the price of its stock – Yes/no.
3) What is the probability that in selecting two cards one at a time from a deck
with replacement, the second card is
4) A bag contains 32 marbles: 4 are red, 9 are black, 12 are blue, 6 are yellow and
1 is purple. Marbles are drawn one at a time with replacement. What is the
probability that:
a) The second marble is yellow given the first one was yellow?
b) The second marble is yellow given the first one was black?
c) The third marble is purple given both the first and second were purple?
15
Probability and Hypothesis ..................................................................................................................
Testing
..................................................................................................................
..................................................................................................................
There are three types of probability under statistical dependence case. They
are:
a) Conditional Probability;
b) Joint Probability;
c) Marginal Probability
Let us discuss the concept of the three types.
The conditional probability of event A, given that the event B has already
occured, can be calculated as follows:
P (AB)
P (A / B) =
P (B)
(i) P(CD) = 3/10 = Joint Probability of drawn ball becoming a coloured as well
as a dotted one.
16
Similarly, P (CS) = 1/10, P (GD) = 2/10, and P (GS) = 4/10 Probability and
Probability Rules
P(DC)
So, = P(D / C) =
P(C)
where, P(C) = Probability of drawing a coloured ball from the box = 4/10 (4
coloured balls out of 10 balls).
3 / 10
∴ P (D / C) = = 0.75
4 / 10
ii)Similarly, P(S/C) = Conditional probability of drawing a stripped ball on the
condition of knowing that it is a coloured one.
P(SC) 1 / 10
= = = 0.25
P(C) 4 / 10
Thus, the probability of coloured and dotted ball is 0.75. Similarly, the
probability of coloured and stripped ball is 0.25.
b) Continuing the same illustration, if we wish to find the probability of
(i) P (D/G) and (ii) P (S/G)
P (DG) 2 / 10 1
Solution: i) P (D / G ) = P (G ) = 6 / 10 = 3 = 0.33
P (SG ) 4 / 10 2
ii) (S / G ) = P ( G ) = 6 / 10 = 3 = 0 .66
P (GD) 2 / 10
Solution: (i) P (G / D) = P (D) = 5 / 10 = 0.4
P ( CD ) 3 / 10
and (ii) P ( C / D ) = P ( D ) = 5 / 10 = 0 .6
P (CS) 1 / 10
Solution: (i) P (C / S) = P (S) = 5 / 10 = 0.2
P (GS) 4 / 10
(ii) P (G / S) = P (S) = 5 / 10 = 0.8
The formula for calculating joint probability of two events under the condition of
statistical independence is derived from the formula of Bayes’ Theorem. 17
Probability and Hypothesis Therefore, the joint probability of two statistically dependent events A and B is
Testing given by the following formula:
Since, P (A/B) = P (B/A), So the product on the RHS of the formula must
also be equal to each other.
Notice that this formula is not the same under conditions of statistical
independence, i.e., P (BA) = P(B) × P (A). Continuing with our previous
illustration 4, of a box containing 10 balls, the value of different joint
probabilities can be calculated as follows:
Converting the above general formula i.e., P (AB) = P (A/B) × P (B) into our
illustration and to the terms coloured, dotted, stripped, and grey, we would have
calculated the joint probabilities of P (CD), P (GS), P (GD), and P (CS) as
follows:
Note: The values of P (C/D), P (G/S), P (G/D), and P (C/S) have been already
computed in conditional probability under statistical dependence.
c) Marginal Probability Under the Condition of Statistical Dependence
Finally, we discuss the concept of marginal probability under the condition of
statistical dependence. It can be computed by summing up all the probabilities
of those joint events in which that event occurs whose marginal probability we
want to calculate.
Solution: i) We can obtain the marginal probability of the event dotted balls by
adding the probabilities of all the joint events in which dotted balls occured.
In the same manner, we can compute the joint probabilities of the remaining
events as follows:
18
iv) P (S) = P (CS) + P (GS) = 1/10 + 4/10 = 0.5
The following table 13.2 summarizes three types of probabilities, their symbols Probability and
Probability Rules
and their mathematical formulae under statistical dependence.
1) According to a survey, the probability that a family owns two cars of annual
income greater than Rs. 35,000 is 0.75. Of the households surveyed, 60 per
cent had incomes over Rs. 35,000 and 52 per cent had two cars. What is the
probability that a family has two cars and an income over Rs. 35,000 a year?
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
2) Given that P(A) = 3/14, P (B) = 1/6, P(C) = 1/3, P (AC) = 1/7 and P (B/C)
= 5/21, find the following probabilities: P (A/C), P (C/A), P (BC), P (C/B).
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
3) At a restaurant, a social worker gathers the following data. Of those visiting the
restaurant, 59% are male, 32 per cent are alcoholics and 21 per cent are male
alcoholics. What is the probability that a random male visitor to the restaurant is
an alcoholic?
..................................................................................................................
..................................................................................................................
..................................................................................................................
Priori
probability
Bayes’ Posterior
Process Probabilities
New
Information
P (A / B)
P (A / B) =
P (B)
Now, we can find out the value of P (F/3), as well as P (L/3), by using the
formula
P (F and3) 0.083
P (F / 3) = = = 0.216 and
P (3) 0.383
P (L and 3) 0.300
P (L / 3) = = = 0.784
P (3) 0.383
Our original estimates of probability of fair die being rolled was 0.5 and
similarly for loaded die was again 0.5. But, with the single roll of the die, the
probability of the loaded die being rolled is given that 3 has appeared on the
top, increases to 0.78, while for rolled die to be the fair one decreases to 0.22.
This example illustrated the power of Bayes’s theorem.
1. There are two machines, A and B, in a factory. As per the past information,
these two machines produced 30% and 70% of items of output respectively.
Further, 5% of the items produced by machine A and 1% produced by machine
B were defective. If a defective item is drawn at random, what is the
probability that the defective item was produced by machine A or machine B?
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
21
Probability and Hypothesis
Testing 13.9 LET US SUM UP
At the beginning of this unit the historical evolution and the meaning of
probability has been discussed. Contribution of leading mathematicians has been
highlighted. Fundamental concepts and approaches to determining probability
have been explained. The three approaches namely; the classical, the relative
frequency, and the subjective approaches are used to determine the probability
in case of risky and uncertain situation have been discussed.
2. a) 1/2; b) 1/2.
3. a) P (Face2/Red1) = 3/13
b) P (Ace2/Face1) = 1/13
3. 0.356.
Supplementary Illustrations
Here, we have
P (A) = 0.25
P (B) = 0.40 and
P ( A ∪ B) = 0.5, then P ( A ∩ B) = ?
P (A ∪ B) = P (A) + P (B) – P (A ∩ B)
A non-leap year consists of 365 days, i.e., a total 52 full weeks and one
extra day. So, a non-leap year contain 53 Mondays, only when that extra
day must also be a Monday.
But, as that day can be from any of the following set, viz., either Sunday
or Monday or Wednesday or Thursday or Friday or Saturday.
4) What is the probability of having at least one head on two tosses of a fair coin?
The possible ways in which a head may occur are H1 H2; H1 T2; T1 H2.
Each of these has a probability of 0.25.
The results are similar for P (H1 T2) and P (T1 H2) also. Since, the two
tosses are statistically independent events. Therefore, the probability of at
least one head on two tosses is 0.25 × 3 = 0.75.
5) Suppose we are tossing an unfair coin, where the probability of getting head in
a toss is 0.8. If we have to calculate the probability of having three heads in
three consecutive trials
Then, as given
24
If we have to calculate the probability of having three consecutive tails in Probability and
Probability Rules
three trials.
Suppose at random one ball in picked out from the urn, then we have to
find out the probability that:
P (WL) = 0.4
P (YL) = 0.3
P (WN) = 0.2
P (YN) = 0.1
Also, P (W) = 0.6
P (Y) = 0.4
P (L) = 0.7
P (N) = 0.3
As we also knew that:
P (W) = P (WL) + P (WN) = 0.4 + 0.2 = 0.6
P (Y) = P (YL) + P (YN) = 0.3 + 0.1 = 0.4
P (L) = P (WL) + P (YL) = 0.4 + 0.3 = 0.7
and P (N) = P (WN) + P (YN) = 0.2 + 0.1 = 0.3
So, (i) (L) = 0.7
P (LY) 0.3
(ii) for P (L/Y) = = = 0.75
P (Y) 0.4
P (CG ) 3 / 10
and P (C/D) = = = 0.6
P (D) 5 / 10
P (M 1 and D) 0.018
P (M1 / D) = = = 0.1837
P (D) 0.098
P (M 2 and D) 0.03
P (M 2 / D) = = = 0.3061
P (D) 0.098
P (M 3and D) 0.05
P ( M 3 / D) = = = 0.5102
P ( D) 0.098
1.000
These three conditional probabilities are called the posterior probabilities.
It is clear from the revised probability values that the probabilities of defective
units produced in M1 is 0.18, M2 is 0.31 and M3 is 0.51 against the past
probabilities 0.3, 0.2, and 0.5 respectively. And the probability that a defective
unit produced by this firm is 0.098.
26
Probability and
13.12 TERMINAL QUESTIONS/EXERCISES Probability Rules
4. State and prove the addition rule of probability for two mutually exclusive
events.
7. One ticket is drawn at random from an urn containing tickets numbered from 1
to 50. Find out the probability that:
i) It is a multiple of 5 or 7
ii) It is a multiple of 4 or 3
[Answer: i) 17/50, ii) 12/25]
8. If two dice are being rolled, then find out the probabilities that:
(b) If two coins are tossed once, what is the probability of getting
(i) Both heads
(ii) At least one head ?
[Answer: (a) 1/2 (b) (i) 1/4 (ii) 3/4]
10. Given that P(A) = 3/14P (B) = 1/6, P (C) = 1/3, P (AC) = 1/7, P (B/C) = 5/21.
27
Probability and Hypothesis 11. A T.V. manufacturing firm purchases a certain item from three suppliers X, Y
Testing
and Z. They supply 60%, 30% and 10% respectively. It is known that 2%, 5%
and 8% of the items supplied by the respective suppliers are defective. On a
particular day, the firm received items from three suppliers and the contents get
mixed. An item is chosen at random:
[Ans. P (D) = 0.035 P (X/D) = 0.34, P (Y/D) = 0.43, and P (Z/D) = 0.23].
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
Levin, R.I. and Rubin, D.S., 1991, Statistics for Management, PHI, : New
Delhi.
Hooda, R.P. 2001, Statistics for Business and Economics. MacMillan India
Limited, Delhi.
Gupta, S.P. Statistical Methods, 2000, Sultan Chand & Sons, Delhi.
28
Probability
UNIT 14 PROBABILITY DISTRIBUTIONS Distributions
STRUCTURE
14.0 Objectives
14.1 Introduction
14.2 Types of Probability Distribution
14.3 Concept of Random Variables
14.4 Discrete Probability Distribution
14.4.1 Binomial Distribution
14.4.2 Poisson Distribution
14.5 Continuous Probability Distribution
14.5.1 Normal Distribution
14.5.2 Characteristics of Normal Distribution
14.5.3 Importance and Application of Normal Distribution
14.6 Let Us Sum Up
14.7 Key Words
14.8 Answers to Self Assessment Exercises
14.9 Terminal Questions/Exercises
14.10 Further Reading
14.0 OBJECTIVES
After studying this unit, you should be able to:
14.1 INTRODUCTION
A probability distribution is essentially an extension of the theory of probability
which we have already discussed in the previous unit. This unit introduces the
concept of a probability distribution, and to show how the various basic
probability distributions (binomial, poisson, and normal) are constructed. All these
probability distributions have immensely useful applications and explain a wide
variety of business situations which call for computation of desired probabilities.
This means that the unity probability of a certain event is distributed over a set
of disjointed events making up a complete group. In general, a tabular recording
of the probabilities of all the possible outcomes that could result if random 2 9
Probability and (chance) experiment is done is called “Probability Distribution”. It is also
Hypothesis Testing termed as theoretical frequency distribution.
In the frequency distribution, the class frequencies add up to the total number
of observations (N), where as in the case of probability distribution the possible
outcomes (probabilities) add up to ‘one’. Like the former, a probability
distribution is also described by a curve and has its own mean, dispersion, and
skewness.
Table 14.2: Probability Distribution of the Possible No. of Heads from Two-toss
Experiment of a Fair Coin
No. of Tosses Probability of
Heads (H) outcomes P (H)
0 (T, T) 1/4 = 0.25
1 (H, T) + (T, H) 1/2 = 0.50
2 (H, H) 1/4 = 0.25
3 0
We must note that the above tables are not the real outcome of tossing a fair Probability
Distributions
coin twice. But, it is a theoretical outcome, i.e., it represents the way in which
we expect our two-toss experiment of an unbaised coin to behave over time.
The example given in the Introduction, we have seen that the outcomes of the
experiment of two-toss of a fair coin were expressed in terms of the number
of heads. We found in the example, that H (head) can assume values of 0, 1
and 2 and corresponding to each value, a probability is associated. This
uncertain real variable H, which assumes different numerical values depending
on the outcomes of an experiment, and to each of whose value a possibility
assignment can be made, is known as a random variable. The resulting
representation of all the values with their probabilities is termed as the
probability distribution of H.
H: 0 1 2
In the above situations, we have seen that the random variable takes a limited
number of values. There are certain situations where the variable under
consideration may have infinite values. Consider for example, that we are
interested in ascertaining the probability distribution of the weight of one kg.
coffee packs. We have reasons to believe that the packing process is such that
a certain percentage of the packs slightly below one kg., and some packs are
above one kg. It is easy to see that it is essentially by chance that the pack
will weigh exactly 1 kg., and there are an infinite number of values that the
random variable ‘weight’ can take. In such cases, it makes sense to talk of the
probability that the weight will be between two values, rather than the
probability of the weight taking any specific value. These types of random
variables which can take an infinitely large number of values are called
continuous random variables, and the resulting distribution is called a
continuous probability distribution. The function that specifies the probability
distribution of a continuous random variable is called the probability density
function (p.d.f.).
Table 14.4
100 0.3 30
110 0.6 66
120 0.1 12
Now, we will examine situations involving discrete random variables and discuss
3 2 the methods for assessing them.
Probability
14.4 DISCRETE PROBABILITY DISTRIBUTION Distributions
It is the basic and the most common probability distribution. It has been used to
describe a wide variety of processes in business. For example, a quality control
manager wants to know the probability of obtaining defective products in a
random sample of 10 products. If 10 per cent of the products are defective, he/
she can quickly obtain the answer, from tables of the binomial probability
distributions. It is also known as Bernoulli Distribution, as it was originated
by Swiss Mathematician James Bernoulli (1654-1705).
The binomial distribution describes discrete, not continuous, data resulting from
an experiment known as Bernoulli Process. Binomial distribution is a probability
distribution expressing the probability of one set of dichotomous alternatives, i.e.,
success or failure.
c) The trials are mutually independent i.e., the outcome of any trial is neither
affected by others nor affects others.
Assumptions i) Each trial has only two possible outcomes either Yes or No,
success or failure, etc.
ii) Regardless of how many times the experiment is performed, the probability of
the outcome, each time, remains the same.
Hence the following form of the equations, for carrying out computations of the
binomial probability is perhaps more convenient.
n!
P( r ) = prqn – r
r! (n − r ) !
If n is large in number, say, 50C3, then we can write (with the help of the
above explanation)
50 × 49 × 48
=
3× 2 ×1
Similarly,
75 × 74 × 73 × 72 × 71
= , and so on.
5× 4 × 3× 2 ×1
Illustration 1
A fair coin is tossed six times. What is the probability of obtaining four or more
heads?
Solution: When a fair coin is tossed, the probabilities of head and tail in case
3 4 of an unbiased coin are equal, i.e.,
p = q = ½ or 0.5 Probability
Distributions
n!
P( r ) = prqn −r
r! (n − r ) !
6!
P (4) = (0.5) 4 (0.5) 2
4 ! (6 − 4) !
6 × 5× 4 × 3× 2 ×1
= (0.625) (0.25)
(4 × 3 × 2 × 1) (2 × 1)
720
= (0.625) (0.25) = 15 × 0.625 × 0.25
(24) (2)
= 0.234
The probability of obtaining 5 heads is :
P(5) = 6C5(1/2)5 (1/2)6-5
6!
P (5) = (0.5) 5 (0.5)1
5 ! (6 − 5) !
6 × 5 × 4 × 3 × 2 × 1
= (0.03125) (0.5)
5 × 4 × 3 × 2 × 1 (1 × 1)
= 6 × (0.03125) (0.5)
= 0.094
The probability of obtaining 6 heads is : P(6) = 6C6 (1/2)6 (1/2)6-6
6!
P (6) = (0.5) 2 (0.5) 0
6 ! (6 − 6) !
6 × 5 × 4 × 3 × 2 × 1
= (0.015625) (1)
6 × 5 × 4 × 3 × 2 × 1 (1)
= 1 × 0.015625 × 1
= 0.016
∴ The probability of obtaining 4 or more heads is :
0.234 + 0.094 + 0.016 = 0.344
Illustration 2
1 4
q = 1 − =
5 5
By binomial probability law, the probability that out of 10 workers, ‘x’ workers
suffer from a disease is given by:
P(r) = nCr pr qn–r
10 − r
10 C . 1 r. 4 ; r = 0, 1, 2, …10
r
5 5
i) The required probability that exactly 2 workers will suffer from the disease is
given by :
2 10 − 2
1 4
P(2) = 10C 2
5 5
ii) The required probability that not more than 2 workers will suffer from the
disease is given by :
0 10 − 0
1 4
P (0) = 10C 0 = 0.107
5 5
1 10 −1
1 4
P (1) = 10C1 = 0.269
5 5
2 10 − 2
1 4
P (2) = 10 C 2 = 0.302
5 5
3 6
We can represent the mean of the binomial distribution as : Probability
Distributions
Mean (µ) = np.
where, n = Number of trials; p = probability of success
And, we can calculate the standard deviation by :
σ = npq
where, n = Number of trials; p = probability of success; and q = probability of
failure = 1–p
Illustration 3
If the probability of defective bolts is 0.1, find the mean and standard deviation
for the distribution of defective bolts in a total of 50.
∴ σ = 500 × .1 × .9 = 6.71
i) Determine the values of ‘p’ and ‘q’. If one of these values is known, the other
can be found out by the simple relationship p = 1–q and q = 1–p. If p and q are
equal, we can say, the distribution is symmetrical. On the other hand if ‘p’ and
‘q’ are not equal, the distribution is skewed. The distribution is positively
skewed, in case ‘p’ is less than 0.5, otherwise it is negatively skewed.
ii) Expand the binomial (p + q)n. The power ‘n’ is equal to one less than the
number of terms in the expanded binomial. For example, if 3 coins are tossed
(n = 3) there will be four terms, when 5 coins are tossed (n = 5) there will be 6
terms, and so on.
iii) Multiply each term of the expanded binomial by N (the total frequency), in
order to obtain the expected frequency in each category.
Let us consider an illustration for fitting a binomial distribution.
Illustration 4
Eight coins are tossed at a time 256 times. Number of heads observed at each
throw is recorded and the results are given below. Find the expected
frequencies. What are the theoretical values of mean and standard deviation?
Also calculate the mean and standard deviation of the observed frequencies.
1 1
= 8× × = 2 = 1.414
2 2
Note: The procedure for computation of mean and standard deviation of the
observed frequencies has been already discussed in Units 8 and 9 of this
course. Check these values by computing on your own.
3) The following data shows the result of the experiment of throwing 5 coins at a
time 3,100 times and the number of heads appearing in each throw. Find the
expected frequencies and comment on the results. Also calculate mean and
standard deviation of the theoretical values.
No. of heads: 0 1 2 3 4 5
frequency: 32 225 710 1,085 820 228
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
This would comparatively be simpler in dealing with and is given by the Poisson
distribution formula as follows:
mre–m
p (r ) = ,
r!
where, p (r) = Probability of successes desired
c) It consists of a single parameter “m” only. So, the entire distribution can be
obtained by knowing this value only.
In poisson distribution, the mean (m) and the variance (s2) represent the same
value, i.e.,
Mean = variance = np = m
Since, n is large and p is small, the poisson distribution is applicable. Apply the
formula:
mre–m
p (r ) =
r!
m 5e – m
p (5) = , where m = np = 200 × 0.02 = 4;
5!
e = 2.7183 (constant)
1
5 − 4 (1024)
∴ P (5) =
4 2.7183
= 2.71834
5 × 4 × 3 × 2 ×1 120
(1024) 0.0183
= = 0.156
120
Illustration 6
m 4 e−m
P (4) = , where, m = np = 30 (0.02) = 0.6
4!
e = 2.7183 (constant)
Illustration 7
0.439 r × e –0.439
We can write P( r ) = . Substituting r = 0, 1, 2, 3, and 4, we get
r!
the probabilities for various values of r, as shown below:
m r e −m 0.439 0 × 2.7183−0.439
(P0) = =
r! 0!
1 (0.6443)
= = 0.6443
1
N(P0) = (P0) × N = 0.6443 × 330 = 212.62
4 2
Thus, the expected frequencies as per poisson distribution are : Probability
Distributions
No. of defects (x) 0 1 2 3 4
Expected
frequencies (No. 212.62 93.34 20.49 3.0 0.33
of units) (f)
Note: We can use Appendix Table-2, given at the end of this block, to
determine poisson probabilities quickly.
3) Four hundred car air-conditioners are inspected as they come off the
production line and the number of defects per set is recorded below. Find
the expected frequencies by assuming the poisson model.
No. of defects : 0 1 2 3 4 5
4 3
Probability and
Hypothesis Testing 14.5 CONTINUOUS PROBABILITY DISTRIBUTION
In the previous sections, we have examined situations involving discrete random
variables and the resulting probability distributions. Let us now consider a
situation, where the variable of interest may take any value within a given
range. Suppose that we are planning to release water for hydropower
generation and irrigation. Depending on how much water we have in the
reservoir, viz., whether it is above or below the ‘normal’ level, we decide on
the quantity of water and time of its release. The variable indicating the
difference between the actual level and the normal level of water in the
reservoir, can take positive or negative values, integer or otherwise. Moreover,
this value is contingent upon the inflow to the reservoir, which in turn is
uncertain. This type of random variable which can take an infinite number of
values is called a continuous random variable, and the probablity distribution
of such a variable is called a continuous probability distribution.
Now we present one important probability density function (p.d.f), viz., the
normal distribution.
The normal distribution is the most versatile of all the continuous probability
distributions. It is useful in statistical inferences, in characterising uncertainities
in many real life situations, and in approximating other probability distributions.
As stated earlier, the normal distribution is suitable for dealing with variables
whose magnitudes are continuous. Many statistical data concerning business
problems are displayed in the form of normal distribution. Height, weight and
dimensions of a product are some of the continuous random variables which are
found to be normally distributed. This knowledge helps us in calculating the
probability of different events in varied situations, which in turn is useful for
decision-making.
Now we turn to examine the characteristics of normal distribution with the help
of the figure 14.1, and explain the methods of calculating the probability of
different events using the distribution.
Mean
Median
Mode
Normal probability distribution
is symmetrical around a vertical
line erected at the mean
Left hand tail extends
indefinitely but never Right hand tail extends
reaches the horizontal indefinitely but never
axis reaches the horizontal
axis
1) The curve has a single peak, thus it is unimodal i.e., it has only one mode and
has a bellshape.
3) The two tails of the normal probability distribution extend indefinitely but never
touch the horizontal axis.
Irrespective of the value of mean (µ) and standard deviation (σ), for a normal
distribution, the total area under the curve is 1.00. The area under the normal
curve is approximately distributed by its standard deviation as follows:
µ±1σ covers 68% area, i.e., 34.13% area will lie on either side of µ.
X−µ
Z=
σ
Where,
Step 2: Look up the probability of z value from the Appendix Table-3, given at
the end of this block, of normal curve areas. This Table is set up to
provide the area under the curve to any specified value of Z. (The
area under the normal curve is equal to 1. The curve is also called
the standard probability curve).
4 5
Probability and Let us consider the following illustration to understand as to how the table
Hypothesis Testing should be consulted in order to find the area under the normal curve.
Illustration 8
(a) Find the area under the normal curve for Z = 1.54.
Solution: Consulting the Appendix Table-3 given at the end of this block, we
find the entry corresponding to Z = 1.54 the area is 0.4382 and this measures
the Shaded area between Z = 0 and Z = 1.54 as shown in the following figure.
0.4382
µ 1.54
Solution: Since the curve is symmetrical, we can obtain the area between z =
–1.46 and Z = 0 by considering the area corresponding to Z = 1.46. Hence,
when we look at Z of 1.46 in Appendix Table-3 given at the end of this block,
we see the probability value of 0.4279. This value is also the probability value
of Z = –1.46 which must be shaded on the left of the µ as shown in the
following figure.
0.4279
-1.46 µ
4 6
0.987 Probability
Distributions
0.4013
µ 0.25
d) Find the area to the left of Z = 1.83.
Solution: If we are interested in finding the area to the left of Z (positive
value), we add 0.5000 to the table value given for Z. Here, the table value for
Z (1.83) = 0.4664. Therefore, the total area to the left of Z = 0.9664 (0.5000 +
0.4664) i.e., equal to the shaded area as shown below:
5.000
0.4664
µ 1.83
X−µ
Solution: Z=
σ
X = 72 inches; µ = 68.22 inches; and σ = 10.8 = 3.286
72 − 68 .22
∴Z = = 1.15
3.286
4 7
Probability and
Hypothesis Testing
0.3749
0.1251
68.22
µ 72
Area to the right of the ordinate at 1.16 from the normal table is (0.5–0.3749)
= 0.1251. Hence, the probability of getting soldiers above six feet is 0.1251 and
out of 1,000 soliders, the expectation is 1,000 × 0.1251 = 125.1 or 125. Thus,
the expected number of soldiers over six feet tall is 125.
Illustration 10
(a) 15,000 students appeared for an examination. The mean marks were 49 and the
standard deviation of marks was 6. Assuming the marks to be normally
distributed, what proportion of students scored more than 55 marks?
X−µ
Solution: Z =
σ
X = 55; µ = 49; σ = 6
55 − 49
∴Z = =1
6
For Z = 1, the area is 0.3413 (as per Appendix Table-3).
(b) If in the same examination, Grade ‘A’ is to be given to students scoring more
than 70 marks, what proportion of students will receive grade ‘A’?
X−µ
Solution: Z =
σ
X = 70; µ = 49; σ = 6
70 − 49
∴Z = = 3.5
6
The table gives the area under the standard normal curve corresponding to
Z = 3.5 is 0.4998
4 8
Illustration 11 Probability
Distributions
In a training programme (self-administered) to develop marketing skills of marketing
personnel of a company, the participants indicate that the mean time on the
programme is 500 hours and that this normally distributed random variable has a
standard deviation of 100 hours. Find out the probability that a participant selected
at random will take:
i) fewer than 570 hours to complete the programme, and
ii) between 430 and 580 hours to complete the programme.
Solution: (i) To get the Z value for the probability that a candidate selected at
random will take fewer than 570 hours, we have
x −µ 570 − 500
Z = =
σ 100
70
= = 0.7
100
0.2580
P( less than 570) Z= 0.7
= 0.7580
(µ) 570
Thus, the probability of a participant taking less than 570 hours to complete the
programme, is marginally higher than 75 per cent.
ii) In order to get the probability, of a participant chosen at random, that he will take
between 430 and 580 hours to complete the programme, we must, first, compute
the Z value for 430 and 580 hours.
x −µ
Z=
σ
580 − 500 80
Z for 580 = = = 0 .8
100 100 4 9
Probability and The table shows the probability values of Z values of –0.7 and 0.8 are 0.2580
Hypothesis Testing and 0.2881 respectively. This situation is shown in the following figure.
Z= –0.8
Z= –0.7
Thus, the probability that the random variables lie between 430 and 580 hours is
0.5461 (0.2580 + 0.2881).
(iii) To fit sampling distribution of various statistics like mean or variance etc.
Probability: Any numerical value between 0 and 1 both inclusive, telling about
the likelihood of occurrence of an event.
Probability Distribution: A curve that shows all the values that the random
variable can take and the likelihood that each will occur.
3) Define a binomial probability distribution. State the conditions under which the
binomial probability model is appropriate by illustrations.
a) What is the probability that at least one browsing customer will buy
something during a specified hour?
b) What is the probability that at least 4 browsing customers will buy
something during a specified hour?
c) What is the probability that no browsing customer will buy anything during a
specified hour?
d) What is the probability that not more than 4 browsing customers will
buy something during a specified hour?
[Ans. (a) .9953 (b) .7031 (c) .0047 (d) .5155]
10) Given a binomial distribution with n = 28 trials and p = .025, use the Poisson
approximation to the binomial to find:
11) The average number of customer arrivals per minute at a departmental stores is
2. Find the probability that during one particular minute:
12) A set of 5 fair coins was thrown 80 times, and the number of heads in each
throw was recorded and given in the following table. Estimate the probability of
the appearance of head in each throw for each coin and calculate the
theoretical frequency of each number of heads on the assumption that the
binomial law holds:
No. of heads: 0 1 2 3 4 5
Frequency: 6 20 28 12 8 6
13) Fit a poisson distribution to the following observed data and calculate the
expected frequencies:
Deaths: 0 1 2 3 4
Frequency: 122 60 15 2 1
5 3
Probability and 14) Given that a random variable X, has a binomial distribution with n = 50 trials
Hypothesis Testing and p = .25, use the normal approximation to the Binomial to find:
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
15.0 Objectives
15.1 Introduction
15.2 Point Estimation and Standard Errors
15.3 Interval Estimation
15.4 Confidence Limits, Confidence Interval and Confidence Co-efficient
15.5 Testing Hypothesis – Introduction
15.6 Theory of Testing Hypothesis – Level of Significance, Type-I and
Type-II Errors and Power of a Test
15.7 Two-tailed and One-tailed Tests
15.8 Steps to Follow for Testing Hypothesis
15.9 Tests of Significance for Population Mean–Z-test for variables
15.10 Tests of Significance for Population Proportion – Z-test for Attributes
15.11 Let Us Sum Up
15.12 Key Words and Symbols Used
15.13 Answers to Self Assessment Exercises
15.14 Terminal Questions/Exercises
15.15 Further Reading
15.0 OBJECTIVES
After studying this unit, you should be able to:
l estimate population characteristics (parameters) on the basis of a sample,
l get familiar with the criteria of a good estimator,
l differentiate between a point estimator and an interval estimator,
l comprehend the concept of statistical hypothesis,
l perform tests of significance of population mean and population proportion,
and
l make decisions on the basis of testing hypothesis.
15.1 INTRODUCTION
Let us suppose that we have taken a random sample from a population with a
view to knowing its characteristics, also known as its parameters. We are then
confronted with the problem of drawing inferences about the population on the
basis of the known sample drawn from it. We may look at two different
scenarios. In the first case, the population is completely unknown and we would
like to throw some light on its parameters with the help of a random sample
drawn from the population. Thus, if µ denotes the population mean, then we
intend to make a guess about it on the basis of a random sample. This is
known as estimation. For example, one may be interested to know the average
income of people living in the city of Delhi or the average life in burning hours
of a fluorescent tube light produced by ‘Indian Electrical’ or proportion of
people suffering from T.B. in city ‘B’ or the percentage of smokers in town
‘C’ and so on.
At this juncture, we must make a distinction between the two terms Estimator
and estimate. ‘T’ is defined to be an estimator of a parameter θ, if T esimtates
θ. Thus T is a statistic and its value may differ from one sample to another
sample. In other words, T may be considered as a random variable. The
probability distribution of T is known as sampling distribution of T. As already
discussed, the sample mean x is an estimator of population mean µ. The value
of the estimator, as obtained on the basis of a given sample, is known as its
estimate. Thus x is an estimator of µ, the average income of Delhi, and the
value of x i.e., Rs. 2,000/-, as obtained from the sample, is the estimate of µ.
In order to choose the best estimator among these estimators along with
“unbiasedness”, we introduce a second criterion, known as, minimum variance.
A statistic T is defined to be a minimum variance unbiased estimator (MVUE)
of θ if T is unbiased for θ and T has minimum variance among all the
unbiased estimators of θ. We may note that sample mean ( x ) is an MVUE for
µ.
∑ xi
We know that x = …(15.1)
n
(∑ x i )
∴ E( x ) = E
n
1
= .[∑ E (xi )]
n
1
= [ ∑ µ ] [x1, x2, …xn are taken from population having as
n µ population mean]
1
= . nµ
n
E( x ) = µ
1
= ∑σ
2
= [where σ2 is population variance]
n2
1 σ2
= 2
.[nσ 2 ] = ……(15.2)
n n
It can be proved that v ( x ) has the minimum variance among all the unbiased
estimators of µ.
Consistency: If T is an estimator of θ, then it is obvious that T should be in
the neighbourhood of θ. T is known to be consistent for θ, if the difference
between T and θ can be made as small as we please by increasing the sample
size n sufficiently.
We can further add that T would be a consistent estimator of θ if
i) E (T) → θ and
x nP
E (p) = E i = =P …(15.3)
n n
V(x i )
and V (p)= V (xi/n) =
n2
nP (1 − P )
=
n2
P (1 − P )
= …(15.4)
n
Thus if we take a random sample of size ‘n’ from a population where the
proportion of population possessing a certain characteristic is ‘P’ and the
sample contains x units possessing that characteristic, then an estimate of
population proportion (P) is given by:
x
P̂ = …(15.5)
n
In other words, the estimate of the population proportion is given by the
corresponding sample estimate i.e., P̂ = p …(15.6)
P(1 − P)
As v (p) = → 0 as n →∝
n
it follows from Eq. (15.4) that p is a consistent estimator of P. We can further
establish that p is an efficient as well as a sufficient estimator of P. Thus we
advocate the use of sample proportion to estimate the population proportion as
p which satisfies all the desirable properties of an estimator.
In order to estimate the proportion of people suffering from T.B. in city B, if
we find the number of people suffering from T.B. is ‘x’ in a random sample of
size ‘n’, taken from city B, then sample estimate p = x/n would provide the
estimate of the proportion of people in that city suffering from T.B. Similarly,
the percentage of smokers as found from a random sample of people of town
C would provide the estimate of the percentage of smokers in town C.
C) Estimation of Population Variance and Standard Error: Standard error
of a statistic T, to be denoted by S.E. (T), may be defined as the standard
deviation of T as obtained from the sampling distribution of T. In order to
compute the standard error of sample mean, it may be noted that from Eq.
σ
(15.2) S.E. ( x ) = for simple random sampling with replacement (SRSWR).
n
σ N−η
S.E. ( x ) = for simple random sampling without replacement
n N −1
(SRSWOR)].
where σ is the population standard deviation (S.D.), n is sample size, N is
N−η
population size and the factor is known as finite population corrector
N −1 5 9
Probability and (f.p.c.) or finite population multiplier (f.p.m.) which may be ignored for a large
Hypothesis Testing population.
In order to find S.E., it is necessary to estimate σ 2 or σ in case it is unknown.
If x1, x2 …, xn denote n sample observations drawn from a population with
mean µ and variance σ 2, then the sample variance:
∑ (x i − x)2
S2 = …(15.7)
n
may be considered to be an estimator of σ 2
Since E(x) = µ and V ( z/ ) = E (x–µ)2 = σ2 …(15.8)
We have
= ∑[(x i − µ) − (x − µ)]2
= ∑ (x i − µ) 2 − 2(x − µ).n (x − µ) + n ( x − µ) 2
[since Σ (xi–µ) = Σ xi – Σµ
= nx − nµ = n (x − µ)]
= ∑ (x i − µ) 2 − 2n (x − µ)2 + n (x − µ)2
σ2
And E ( x − µ) 2 = v ( x ) = …(15.10)
n
2 σ2
= ∑σ − n. = nσ 2 − σ 2 = (n–1) σ
2
n
n −1 2
∴ E(S2 ) = σ ≠ σ2 …(15.11)
n
2 n −1 2
As E (S ) = σ
n
n 2
∴E s = σ2 …(15.12)
n −1
n 2 (∑ x i − x) 2
6 0 thus s = = (s| ) 2 is an unbiased estinator of σ2 …(15.13)
n −1 n −1
Tests of Hypothesis–I
∑ (x i − x) 2
so, we use (s ) =
| 2
as an estimator of σ2 and
n −1
∑ (x i − x) 2
S| = as an estimator of σ
n −1
An estimate of S.E. ( x ) is given by:
S|
>
S| N−n
= for SRSWOR ……(15.14)
n N −1
P (1 − P)
From (15.4), it follows that v(p) =
n
P(1 − P)
S.E.(p) = for SRSWR
n
P (1 − P) N−n
= . for SRSWOR ……(15.15)
n N −1
An estimate of standard error of sample proportion is given by:
>
p (1 − p )
S.E. ( p ) = for SRSWR
n
p (1 − p) N−n
= . for SRSWOR ……(15.16)
n n
Let us consider the following illustrations to estimate variance from sample and
also estimate the standard error.
Illustration 1
A sample of 32 fluorescent lights taken from Indian Electricals was tested for
the lives of the lights in burning hours. The data are presented below:
s|
>
S.E .( x ) =
n
2
∑ (xi− x)2 ∑ xi − nx2
where, s = =
|
η −1 n −1
∑ xi
and x = , n = Sample size
n
We ignore f.p.c. as the population of lights is very large.
∑ ui − 490
∴u = = = − 15.3125
32 32
As u i = x i − 5000
∴ u = x − 5000
2 2
∑ u i − nu 23742 − 7503 .1248
(s | ) 2 = (s | x ) 2 = (s | ) 2 = = = 7410 .9316
n −1 31
∴ s| = 86.0868
s| 86.0868
hence S.E.(x) = = = 15.2183
n 32
so the estimate of the average life of lights as manufactured by Indian
Electricals is 4985 hours. Estimate of the population variance in 7410.9316
(hours)2 and the standard error is 15.2183 hours.
Illustration 2
x 70
Thus we have p = = = 0 .2
n 350
Hence the estimate of the proportions of smokers in the city is 0.2 or 20%.
Further
>
p (1 − p ) 0 .2 (1 − 0 .2 )
S.E. ( p ) = = = 0 .0214
n 350
6 3
Probability and Self Assessment Exercise A
Hypothesis Testing
1) State, with reasons, whether the following statements are true or false.
..............................................................................................................
.............................................................................................................
3) In choosing between sample mean and sample median – which one would you
prefer?
..............................................................................................................
.............................................................................................................
4) The monthly earnings of 20 families, obtained from a random sample from a
village in West Bengal are given below:
Find an estimate of the average monthly earnings of the village. Also obtain an
estimate of the S.E. of the sample estimate.
..............................................................................................................
6 4
............................................................................................................. Tests of Hypothesis–I
..............................................................................................................
.............................................................................................................
..............................................................................................................
5) In a sample of 900 people, 429 people are found to be consumers of tea. Estimate
the proportion of consumers of tea in the population. Also find the corresponding
standard error.
.............................................................................................................
..............................................................................................................
.............................................................................................................
Regarding the estimation of the average income of the people of Delhi city, one
may argue that it would be better to provide an interval which is likely to
contain the population mean. Thus, instead of saying the estimate of the
average income of Delhi is Rs. 2,000/-, we may suggest that, in all probability,
the estimate of the average income of Delhi would be from Rs. 1,900/- to Rs.
2,100/-. In the second example of estimating the average life of lights produced
by Indian Electricals where the estimate came out to be 4985 hours, the point
estimation may be a bone of contention between the producer and the potential
buyer. The buyer may think that the average life is rather less than 4985 hours.
An interval estimation of the life of lights might satisfy both the parties. Figure
15.1 shows some intervals for θ on the basis of different samples of the same
size from a population characterized by a parameter θ. A few intervals do not
contain θ.
6 5
Probability and
Hypothesis Testing
θ
Fig.15.1: Confidence Intervals to θ
P (t1 < θ) = α1
Where α1 and α2 are two small positive numbers. Combining these two
conditions, we may write:
Where α = α1 + α2
One may like to know why the term ‘confidence’ comes into the picture. If we
choose α1 and α2 such a way that α = 0.01, then the probability that θ would
belong to the random interval [t1, t2] is 0.99. In other words, one feels 99%
confident that [t1, t2] would contain the unknown parameter θ. Similarly if we
select α = 0.05, then P [t1 ≤ θ ≤ t2] = 0.95, thereby implying that we are 95%
confident that θ lies between t1 and t2. (15.17) suggests that as α decreases,
(1–α) increases and the probability that the confidence interval [t1, t2] would
include the parameter θ also increases. Hence our endeavour would be to
reduce ‘α’ and thereby increase the confidence co-efficient (1–α).
6 6
Referring to the estimation of the average life of lights (θ), if we observe that Tests of Hypothesis–I
θ lies between 4935 hours and 5035 hours with probability 0.98, then it would
imply that if repeated samples of a fixed size (say n = 32) are taken from the
population of lights, as manufactured by Indian Electricals, then in 98 per cent
of cases, the interval [4935 hours, 5035 hours] would contain θ, the average life
of lights in the population while in 2 per cent of cases, the interval would not
contain θ. In this case, the confidence interval for θ is [4935 hours, 5035
hours]. Lower Confidence Limit of θ is 4935 hours, Upper Confidence Limit of
θ is 5035 hours, and the Confidence Co-efficient is 98 per cent.
Our next task would be to select the basis for estimating confidence interval.
Let us assume that we have taken a random sample of size ‘n’ from a normal
population characterized by the two parameters µ and σ, the population mean
and standard deviation respectively. Thus, in the case of estimating a
Confidence Interval for average income of people dwelling in Delhi city, we
assume that the distribution of income is normal and we have taken a random
sample from the city. In another example concerning average life of fluorescent
lights as produced by Indian Electricals, we assume that the life of a
fluorescent light is normally distributed and we have taken a random sample
from the population of fluorescent lights manufactured by Indian Electricals.
Figure 15.2 shows percentage of area under Normal Curve. It can be shown
that if a random sample of size ‘n’ is drawn from a normal population with
mean ‘µ’ and variance σ2, then ( x ) , the sample mean also follows normal
distribution with ‘µ’ as mean and σ2/n as variance. Further as we have
observed in Section 15.2.
σ
S.E. ( x ) =
n
From the properties of normal distribution, it follows that the interval :
[µ − S.E. ( x ), µ + S.E. ( x )] covers 68.27% area.
The interval [µ − 2 S.E. ( x ), µ + 2 S.E.( x )] covers 95.45% area and the interval
34.135% 34.135%
13.59% 13.59%
2.14%
2.14% µ
–3 –2 –1
68.27% 1 2 3
95.45%
99.73%
x −µ
Or, P [−u ≤ ≤ u] = 1 − α
S.E. (x)
x −µ
Or, P [−u ≤ Z ≤ u ] = 1 − α [where Z =
S.E. (x) is a standard normal variable]
Or. 2 φ (u) = 2 − α
or. u = 1.645
Thus 100 (1–α) % or 100 (1–0.1)% or 90% confidence interval to population
mean µ is :
σ σ
Given by X − 1.645 , X + 1.645
n n
Putting α =0.05, 0.02 and 0.01 respectively in (15.19) and proceeding in a similar
σ 1.96σ
manner, we get 95% Confidence Interval to µ = x − 1.96 ,x+ …(15.20)
n n
σ σ
98% Confidence Interval to µ = x − 2.33 , x + 2.33 …(15.21)
n n
6 8
Tests of Hypothesis–I
σ σ
and 99% Confidence Interval to µ = x − 2.58 , x + 2.58 …(15.22)
n n
Theoretically we may take any Confidence interval by choosing ‘u’ accordingly.
However in a majority of cases, we prefer 95% or 99% Confidence Interval.
These are shown in Figure 15.3 and Figure 15.4 below.
σ σ
x − 1.96 x x + 1.96
n n
Fig. 15.3: 95% Confidence Interval for Population Mean
σ σ
x − 2.58 x x + 2.58
n n
Fig. 15.4: 99% Confidence Interval for Population Mean
σ σ
x − 1.96 , x + 1.96
n n
If the assumption of normality does not hold but ‘n’ is greater than 30, the
above 95% confidence interval still may be used for estimating population mean.
In case σ is unknown, it may be replaced by the corresponding unbiased
estimate of σ, namely S|, so long as ‘n’ exceeds 30. However, we may face a
difficult situation in case σ is unknown and ‘n’ does not exceed 30. This
problem has been discussed in the next unit (Unit-16). Similarly, 99%
confidence interval to µ is given by :
σ σ
x − 2.58 , x + 2.58
n n
In case σ is unknown. The 99% confidence interval to µ is :
S| S|
x − 2.58 , x + 2.58 …(15.23)
n n
in case σ is unknown and n > 30. 6 9
Probability and Interval Estimation of Unknown Population Proportion
Hypothesis Testing
It can be assumed that when n is large and neither ‘p’ nor (1–p) is small (one
may specify np ≥ 5 and n (1–p) ≥ 5), then the sample proportion p is
P (1 − P)
asymptotically normal with mean as P and S.E.(p) = , P being the
n
unknown population proportion in which we are interested. The estimate of S.E.
(p) is given by :
>
p (1 − p )
S.E.( p̂ ) =
n
Hence, 95% confidence interval to p is given by :
p (1 − p) p (1 − p)
p − 1.96 , p + 1.96 …(15.24)
n n
p(1 − p) p(1 − p )
p − 2.58 , p + 2.58 …(15.25)
n n
Illustration 3
In a random sample of 1,000 families from the city of Delhi, it was found that
the average of income as obtained from the sample is Rs. 2,000/-, it is further
known that population S.D. is Rs. 258. Find 95% as well as 99% confidence
interval to population mean.
σ σ
x − 1.96 , x + 1.96
n n
σ σ
x − 2.58 , x − 2.58
n n
σ 258
∴ x − 1.96 = Rs. 2000 − 1.96 × = Rs. 1984.01
n 1000
σ 258
x + 1.96 = Rs. 2000 + 1.96 × = Rs. 2015.99
n 1000
7 0
Tests of Hypothesis–I
σ 258
x − 2.58 = Rs. 2000 − 2.58 × = Rs. 1979
n 1000
σ 258
and x + 2.58 = Rs. 2000 + 2.58 × = Rs. 2021
n 1000
Hence we have
95% confidence interval to average income for the people of Delhi = [Rs.
1984.01 to Rs. 2015.99] and 99% confidence interval to average income for the
people of Delhi = [Rs. 1979 to Rs. 2021].
Illustration 4
Calculate the 95% and 99% confidence limits to the average life of fluorescent
lights produced by Indian Electricals.
S| S|
x − 1 .96 , x + 1 .96
n n
S| S|
Similarly, 99% confidence interval for µ = x − 2.58 , x + 2.58
n n
Where, x = Sample mean = 4985 hours, n = Sample size = 32; and
S| 86.0868
∴ x − 1.96 = 4985 − 1.96 × = 4955.17 hours
n 32
S| 86.0868
x + 1.96 = 4985 + 1.96 × = 5014.83 hours
n 32
S| 86.0868
x − 2.58 = 4985 − 2.58 × = 4945.74 hours
n 32
S| × 86.0868
x + 2.58 = 4985 + 2.58 = 5024.26 hours
n 32
Illustration 5
While interviewing 350 people in a city, the number of smokers was found to
7 1
Probability and be 70. Obtain 99% lower confidence limit and the corresponding upper
Hypothesis Testing confidence limit to the proportion of smokers in the city.
p (1 − p)
p − 2.58
n
and 99% Upper Confidence Limt to P is:
p (1 − p)
p + 2.58
n
provided np ≥ 5 and np (1–p) ≥ 5.
x 70
∴ p= = = 0 .2
n 350
As np = 350 × 0.2 = 70 and n (1–p) = 350 × 0.8 = 280 are rather large, we
can apply the formula for 99% Confidence Limit as mentioned already.
∴ 99% Lower Confidence Limit to P is :
0.2 × (1 − 0.2)
0.2 − 1.96 × = 0.2 − 0.0214 = 0.1786
350
99% Upper Confidence Limit to P is :
0.2 × (1 − 0.2)
0.2 + 1.96 × = 0.2 + 0.0214 = 0.2214
350
Hence 99% Lower Confidence Limit and 99% Upper Confidence Limit for the
proportion of smokers in the city are 0.1786 and 0.2214 respectively.
Illustration 6
In a random sample of 19586 people from a town, 2358 people were found to
be suffering from T.B. With 95% Confidence as well as 98% Confidence, find
the limits between which the percentage of the population of the town suffering
from T.B. lies.
Solution: Let x be the number of people suffering from T.B. in the sample and
‘n’ as the number of people who were examined. Then the proportion of
people suffering from T.B. in the sample is given by:
x 2358
p = = = 0.1204
n 19586
As np = x = 2358 and n (1–p) = n–np = n–x
= 19586–2358 = 17228
are both very large numbers, we can apply the formula for finding Confidence
Interval as mentioned in the previous section. Thus 95% Confidence Interval to
7 2
P, the proportion of the population of the town suffering from T.B., is given by : Tests of Hypothesis–I
p (1 − p) p (1 − p)
p − 1.96 , p + 1.96
n n
p (1 − p) p (1 − p)
p − 2.33 , p + 2.33
n n
= [0.1150, 0.1258]
Thereby, we can say with 95% confidence that the percentrage of population in
the town suffering from T.B. lies between 11.81 and 12.27 and with 98%
confidence that the percentage of population suffering from T.B. lies between
11.50 and 12.58.
Illustration 7
A famous shoe company produces 80,000 pairs of shoes daily. From a sample
of 800 pairs, 3% are found to be of poor quality. Find the limits for the number
of substandard pair of shoes that can be expected when the Confidence Level
is 0.99.
p (1 − p ) 0 .03 (1 − 0 .03 )
∴ S.E. ( p̂ ) = = = 0 .0060
n 800
1) State with reasons, whether the following statements are true or false.
a) Confidence Interval provides a range of values that may not contain the
parameter.
b) Confidence Interval is a function of Confidence Co-efficient.
c) 95% Confidence Interval for population mean is x ± 1.96 S.E. ( x ) .
d) While computing Confidence Interval for population mean, if the population
S.D. is unknown, we can always replace it by the corresponding sample S.D.
p (1 − p)
e) 99% Upper Confidence Limit for population proportion is p + 1.96 .
n
f) Confidence co-efficient does not contain Lower Confidence Limit and Upper
Confidence Limit.
p (1 − p )
g) If np ≥ 5 and np (1–p) ≥ 5, one may apply the formula p ± z α for
n
computing Confidence Interval for population proportion.
h) The interval µ ± 3 S.E. ( x ) covers 96% area of the normal curve.
2) Differentiate between Point Estimation and Interval Estimation.
...............................................................................................................
..............................................................................................................
4. Out of 25,000 customer’s ledger accounts, a sample of 800 accounts was taken
to test the accuracy of posting and balancing and 50 mistakes were found.
Assign limits within which the number of wrong postings can be expected with
99% confidence.
...............................................................................................................
...............................................................................................................
...............................................................................................................
5. A sample of 20 items is drawn at random from a normal population comprising
200 items and having standard deviation as 10. If the sample mean is 40,
obtain 95% Interval Estimate of the population mean.
...............................................................................................................
...............................................................................................................
...............................................................................................................
7 4
6) A new variety of potato grown on 400 plots provided a mean yield of 980 Tests of Hypothesis–I
quintals per acre with a S.D. of 15.34 quintals per acre. Find 99% Confidence
Limits for the mean yield in the population.
................................................................................................................
................................................................................................................
................................................................................................................
In order to answer this question, let us familiarise ourselves with a few terms
associated with the problem. A statement like ‘The average income of the
people belonging to the city of Delhi is Rs. 3,000 per month’ is known as a
null hypothesis. Thus, a null hypothesis may be described as an assumption or
a statement regarding a parameter (population mean, ‘µ’, in this case) or about
the form of a population. The term ‘null’ is used as we test the hypothesis on
the assumption that there is no difference or, to be more precise, no significant
difference between the value of a parameter and that of an estimator as
obtained from a random sample taken from the population. A hypothesis may
be simple or composite.
H0 : µ = 3,000
i.e., the null hypothesis is that the population mean is Rs. 3,000. Generally, we
write
H0 : µ = µ0
i.e., the null hypothesis is that the population mean is µ, whereas µ0 may be
any value as specified in a given situation.
In the present problem, one may argue that since many people of Delhi city are
living in the slums and even on the pavements, the average income should be
less than Rs. 3000. So one alternative hypothesis may be :
H1 : µ < 3,000 i.e., the average income is less than Rs. 3,000 or
symbolically:
or, H1 : µ < µ0 i.e., the population mean (µ) is less than µ0.
Again one may feel that since there are many multistoried buildings and many
new models of vehicles run through the streets of the city, the average income
must be more than Rs. 3,000. So another alternative hypothesis may be :
H2 : µ > 3000 i.e., the average income is more than Rs. 3,000.
Lastly, another group of people may opine that the average income is
significantly different from µ0. So the third alternative could be :
Now while testing H0 we are liable to commit two types of errors. In the first
case, it may be that H0 is true but x falls on ω and as such, we reject H0.
This is known as type-I error or error of the first kind. Thus type-I error is
committed in rejecting a null hypothesis which is, in fact, true. Secondly, it may
be that H0 is false but x falls on A and hence we accept H0. This is known as
type-II error or error of the second kind. So type-II error may be described as
the error committed in accepting a null hypothesis which is, in fact, false. The
two kinds of errors are shown in Table 15.3.
It is obvious that we should take into account both types of errors and must try
to reduce them.Since committing these two types of errors may be regarded as
random events, we may modify our earlier statement and suggest that an
appropriate test of hypothesis should aim at reducing the probabilities of both
types of errors. Let ‘α’ (read as ‘alpha’) denote the probability of type-I error
and ‘β’ (read as ‘beta’) the probability of type-II error. thus by definition, we
have
α = The probability of the sample point falling on the critical region when H0 is
true i.e., the value of θ is θ0 = P (x ∈ ω | θ0) …(15.26)
and β = The probability of the sample point falling on the critical region when
H1 is true, i.e., the value of θ is θ1
= P (x ∈ A | θ1) … (15.27)
Surely, our objective would be to reduce both type-I and type-II errors. But
since we have taken recourse to sampling, it is not possible to reduce both
types of errors simultaneously for a fixed sample size. As we try to reduce ‘α’,
β increases and a reduction in the value of β results in an increase in the value
of ‘α’. Thus, we fix α, the probability of type-I error to a given level (say, 5
per cent or 1 per cent) and subject to that fixation, we try to reduce β,
probability of type-II error. ‘α’ is also known as size of the critical region. It
is further known as level of significance as ‘α’ constitutes the basis for making
the difference (θ – θ0) as significant. The selection of ‘α’ level of significance,
depends on the experimenter.
: 1– β = 1 – P (x ∈ A | θ = θ1)
1.0
P(θ)
0
θ
Fig. 15.5: Power Curve of a Test
7 8 i.e., the probability that u0 would exceed uα/2 or u0 is less than u (1-α/2)
is α.
In order to test H0 : θ = θ0 against H1 : θ ≠ θ0, if we select a low value of α, Tests of Hypothesis–I
say α = 0.01, then (15.32) suggests that the probability u0 is greater than ua/2 or
u0 is less than u(1-α/2)is 0.01 which is pretty low. So on the basis of a random
sample drawn from the population, if it is found that u0 is greater than ua/2 or u0
is less than u(1–α/2), then we have rather strong evidence that H0 is not true.
Then we reject H0 : θ = θ0 and accept the alternative hypothesis H1 : θ ≠ θ0.
As shown in the following Figure 15.6, here the critical region lies on both tails
of the probability distribution of u.
u(1- α /2) u α /2
50 α % area 50 α % area
Fig. 15.6: Critical region of a two-tailed Test
If the sample point x falls on one of the two tails, we reject H0 and accept H1
: θ ≠ θ0. The statistical test for H0 : θ = θ0 against H1 : θ ≠ θ0 is known as
both-sided test or two-tailed test as the critical region, ‘ω’ lies on both sides of
the probability curve, i.e., on the two tails of the curve. The critical region is
ω : u0 ≥ uα/2 and ω : u0 ≤ u(1-α/2). It is obvious that a two-tailed test is
appropriate when there are reasons to believe that ‘u’ differs from θ0
significantly on both the left side and the right side, i.e., the value of the test
statistic ‘u’ as obtained from the sample is significantly either greater than or
less than the hypothetical value.
For testing the null hypothesis H0 : µ = 3000, i.e., the average income of the
people of Delhi city is Rs. 3000, one may think that the alternative hypothesis
would be H1 : µ ≠ 3000 i.e., the average income is not Rs. 3000 and as such,
we may advocate the application of a two-tailed test. Similarly, for testing the
null hypothesis that the average life of lights produced by Indian Electricals is
5,000 hours against the alternative hypothesis that the average life is not 5,000
hours, i.e., for testing H0 : µ = 5,000 against H1 : µ ≠ 5,000, we may prescribe
a two-tailed test. In the problem concerning the health of city B, we may be
interested in testing whether 20% of the population of city B really suffers from
T.B. i.e., testing H0 : P = 0.2 against H1 : P ≠ 0.2 and again a two-tailed test
is necessary and lastly regarding the harms of smoking, we may like to test H0
: P = 0.3 against H1 : P ≠ 0.3.
Right-tailed Tests
We may think of testing a null hypothesis against another pair of alternatives. If
we wish to test H0 : θ = θ0 against H1 : θ > θ0, then from (15.30) we have
P (u0 ≥ uα) = α. This suggests that a low value of α, say α = 0.01, implies
that the probability that u0 exceeds uα is 0.01. So the probability that u0
exceeds uα is rather small. Thus on the basis of a random sample drawn from
this population if it is found that u0 is greater than uα, then we have enough
evidence to suggest that H0 is not true. Then we reject H0 and accept H1. This
is exhibited in Figure 15.7 as shown below:
7 9
Probability and
Hypothesis Testing
uα
100 α % area
Fig. 15.7: Critical region of a right-tailed Test
As shown in figure 15.7, the critical region lies on the right tail of the curve.
This is a one-sided test and as the critical region lies on the right tail of the
curve, it is known as right-tailed test or upper-tailed test. We apply a right-
tailed test when there is evidence to suggest that the value of the statistic u is
significantly greater than the hypothetical value θ0. In case of testing about the
average income of the citizens of Delhi, if one has prior information to suggest
that the average income of Delhi is more than Rs. 3,000, then we would like to
test H0 : µ = 3,000 against H1 : µ > 3,000 and we select the right-tailed test.
In a similar manner for testing the hypothesis that the average life of lights by
Indian Electricals is more than 5,000 hours or for testing the hypothesis that
more than 20 per cent suffer from T.B. in city B or for testing the hypothesis
that the per cent of smokers in town C is more than 30, we apply the right-
tailed test.
Left-tailed test
Lastly, we may be interested to test H0 : θ = θ0 against H2 : θ < θ0.From
(15.31), we have P (u0 ≤ u1–α) = α. Choosing α = 0.01, this implies that the
probability that u0 would be less than uα is 0.01, which is surely very low. So, if
on the basis of a random sample taken from the population, it is found that u0
is less than u1-α, then we have very serious doubts about the validity of H0. In
this case, we reject H0 and accept H2 : θ < θ0. This is reflected in Figure 15.8
shown below.
100 α % u(1- α )
area
Fig. 15.8: Critical Region of a Left-tailed Test
2) Choose the appropriate test statistic ‘u’ and sampling distribution of ‘u’ under
H0. In most cases ‘u’ follows a standard normal distribution under H0 and
hence Z-test can be recommended in such a case.
3) Select α, the level of significance of the test if it is not provided in the given
problem. In most cases, we choose α = 0.05 and α = 0.01 which are known as
5% level of significance and 1% level of significance.
7) Draw your own conclusion in very simple language which should be understood
even by a layman.
H : µ ≠ µ0 or,
H1 : µ > µ0 or,
H2 : µ < µ0. 8 1
Probability and As we have discussed in Section 15.2, the best statistic for the parameter µ is
Hypothesis Testing
x . It has been proved in that Section, E ( x ) = µ .
σ
S.E. ( x ) =
n
As such the test statistic :
x − E(x) x −µ
z= =
S.E.(x) σ/ n
is a standard normal variable. Under H0, i.e., assuming the null hypothesis to be
true,
n (x − µ0 )
z0 = is a standard normal variable. As such, the test is known as
σ
standard normal variate test or standard normal deviate test or Z-test. In order
to find the critical region for testing H0 against H from (15.28) and (15.29), we
find that :
α
P (u 0 ≥ u ( α / 2 , ) =
2
α
and P ( u 0 ≤ u (1− α / 2 , ) =
2
If we denote the standard normal variate by Z, and the upper α-point of the
standard normal distribution by Zα, and by Z(1–α/2) = –Zα/2, (as the standard
normal distribution is symmetrical about 0), the lower α-point of the standard
normal distribution, then the above two equations are reduced to :
α
P ( Z0 ≥ Zα / 2 ) = ……(15.33)
2
α
and P ( Z0 ≤ − Zα / 2 ) = ……(15.34)
2
From (15.33), we have:
α
1 − P ( Z0 < Z(α / 2 ) =
2
α
or 1 − φ (Zα / 2 ) =
2
α
or φ ( Zα / 2 ) = 1 −
2
ω : z o ≥ 1 .96
n (x − µ0 )
Z0 = ……(15.36)
σ
Proceeding in a similar manner, the critical region for the two-tailed test at 1%
level of significance is given by :
ω : z o ≥ 2.58 ……(15.37)
or, 1 − φ (Zα ) = α
α
or, 1 − φ (Zα / 2 ) = …… (15.38)
2
Putting α = 0.05 in (15.38), we get
Hence the critical region for this right-tailed test at 5% level of significance is :
ω : Z0 ≥ 1.645
Similarly the critical region at 1% level of significance would be :
ω : Z0 ≥ 2.35
Finally if we make up our minds to test H0 against H2 : µ < µ0, then from
(15.35), we get
P (Z0 ≤ –Zα) = α
or, φ (− Z α ) = α
or, 1− φ ( Z α ) = α
or, φ (Z α ) = 1 − α
95 % area
Acceptance Region
–1.96 µ0 1.96
2.5 % area 2.5 % area
Figure 15.9: Two-tailed Critical Region for Testing Population Mean at 5% Level of Significance
99 % Area
Acceptance Region
–2.58 µ0 2.58
0.5 % area 0.5 % area
Figure 15.10: Two-tailed Critical Region for Testing Population Mean at 1% Level of Significance.
95 % Area
Acceptance Region
Critical Region
ω : Z0 ≥ 1.645
µ0 1.645
5 % Area
Figure 15.11: Right-tailed Critical Region for Testing Population Mean at 5% Level of Significance
8 4
Tests of Hypothesis–I
99 % Area
Acceptance Region
Critical Region
ω : Z0 ≥ 2.35
µ0 2.35
1 % area
Figure 15.12: Right-tailed Critical Region for Testing Population Mean at 1% Level of
Significance
95 % Area
Acceptance Region
Critical Region
ω : Z0 ≤ –1.645
5 % area –1.645 µ0
Figure 15.13: Left-tailed Critical Region for Testing Population Mean at 5% Level of
Significance
99 % Area
Acceptance Region
Critical Region
ω : Z0 ≤ –2.35
1 % area –2.35 µ0
Figure 15.14: Left-tailed Critical Region for Testing Population Mean at 1% Level of Significance
2
∑(Xi −x )
S| =
n −1 8 5
Probability and in the test statistic used in Case-I, provided we have a sufficiently large sample
Hypothesis Testing (as discussed earlier n should exceed 30). Thus we consider
n (x − µ0 )
Z0 =
s|
ω | Z 0 | ≥ 1 .96
ω : | Z0 | ≥ 2.58
ω : | Z0 | ≥ 1.645
and ω : Z0 ≥ 2.33 when the level of significance is 1%.
Lastly the critical region for the left-sided alternative H2 : µ > µ0 would be
provided by :
ω : Z0 ≤ –1.645
For example, if we want to test whether a fresh coin just coming out from a
mint is unbiased, then we are to test H0 : P = 0.5. Similarly, the problem of
testing whether 20% population of city B is suffering from T.B amounts to
testing Ho : P = 0.2 or testing whether 30% population of a town are smokers
is equivalent to testing H0 : P = 0.3.
x
Hence, it follows that the sample proportion (p) = follows normal distribution
n
P0 (1 − P0 )
with mean as P0 and S.D. as under H0.
n
8 6
Tests of Hypothesis–I
p − P0 n (p − P0 )
Thus Z 0 = =
P0 (1 − P0 ) P0 (1 − P0 )
n
is a standard normal variate and as such we can apply Z-test for attributes.
Illustration 8
n ( x − 1900)
we use Z 0 =
σ
The critical region for this right-sided alternative is given by :
50 (1926 − 1900)
Z0 = = 1.671 8 7
110
Probability and Thus, we reject H0 at 5% level of significance but accept the null hypothesis at
Hypothesis Testing 1% level of significance. On the basis of the given data, we thus conclude that
the manufacturer’s claim is justifiable at 5% level of significance but at 1%
level of significance, we infer that the manufacturer has been unable to produce
cables with a higher breaking strength.
Illustration 9
A random sample of 500 flower stems has an average length of 11 cm. Can
this be regarded as a sample from a large population with mean as 10.8 cm
and standard deviation as 2.38 cm?
Solution: Let the length of the stem be denoted by x. Assume that µ denotes
the mean of stems in the population. The sample size 500 being very large, we
apply Z-test for testing H0 : µ = 10.8, i.e., the population mean is 10.8 cm.
against H : µ ≠ 108, i.e., the population mean is not 10.8.
n ( x − 10.8)
Z0 =
σ
and choosing the level of significance as 5%, we note that the critical region is :
ω : |Z0| ≥ 1.96
as per given data,
m = 500, x = 11 cm, σ = 2.38 cm
Thus we accept H0. We conclude that on the basis of the given data, the
sample can be regarded as taken from a large population with mean as 10.8
cm and standard deviation as 2.38 cm.
Illustration 10
623, 648, 672, 685, 692, 650, 649, 666, 638, 629
n ( x − 650)
We consider Z0 =
σ
8 8
and recall that the critical region at 1% level of significance (selecting α = Tests of Hypothesis–I
0.01) for this left-tailed test is given by
ω : Z0 < –2.33
since n = 10, σ = 12.83 hours, and
623 + 648 + 672 + 685 + 692 + 650 + 649 + 666 + 638 + 629
x= = 655 .2 hours
10
10 (655.2 − 650)
∴ Z0 = = 1.282
12.83
As this does not fall in the critical region, H0 is accepted. Thus on the basis of
the given sample, we conclude that the manufactuer’s assertion was right.
Illustration 11
The heights of 12 students taken at random from St. Nicholas College, which
has 1,000 students and a standard deviation of height as 10 inches, are
recorded in inches as 65, 67, 63, 69, 71, 70, 65, 68, 63, 72, 61 and 66.
Do the data support the hypothesis that the mean height of all the students in
that college is 68.2 inches?
Solution: Letting x stand for height of the students of St. Nicholas College, we
would like to test
x − 68.2
where Z0 =
S.E. (x)
In this case,
65 + 67 + 63 + 69 + 71 + 70 + 65 + 68 + 63 + 72 + 61 + 66
x= = 66 . 67 inches
12
10 1000 − 12
= = 2.9027 inches
12 1000 − 1
66.67 − 68.2
∴ Z0 = = 0.527
2.9027
Since n = 950; nP0 = 950 × 0.5 = 475; and nP0 (1–P0) = 237.5, we can apply
Z-test for proportion. Thus we compute :
n (p − 0.5)
Z0 = and note that the critical region at 1% level of
0.5 (1 − 0.5)
significance for this two-tailed test is :
ω : |Z0| ≥ 2.58
x 500
As p= = = 0.5263
n 950
Illustration 13
Solution: Let ‘p’ be the sample proportion of defectives and P, the proportion
of defective parts in the whole manufacturing process. Then we are to test
If we select α = 0.05, then the critical region for this right-tailed test is :
ω : Z0 ≥ 1.645
x 60
We have, as given, p= = = 0.075
n 800
Thus, Z0 falls on the acceptance region and we accept the null hypothesis. We
conclude that on the basis of the given information, the manufacturer’s claim is
9 0 valid.
Illustration 14 Tests of Hypothesis–I
A family-planning activist claims that more than 33 per cent of the families in
her town have more than one child. A random sample of 160 families from the
town reveals that 50 families have more than one child. what is your inference
? Select α = 0.01.
Solution: If ‘P’ denotes the proportion of families in the town having more
than one child, then we want to test H0 : P = 0.33 against H1 : P > 0.33.
n (p − 0.33)
We consider Z0 = as test statistic and note that at 1% level
0.33 (1 − 0.33)
of significance the critical region is ω : Z0 ≥ 2.35.
50
Here, p= = 0.3125 , n = 160
160
We have concluded our discussion by conducting tests for population mean and
population proportion under different types of alternative hypothesis.
σ σ
= x − 1 .96 , x + 1 .96
n n
σ σ
= x − 2 . 58 , x + 2 . 58
n n
∑ (X i − x ) 2
s =
|
provided ‘n’ exceeds 30.
n −1
Z-test for population mean: For testing H0: µ = µ0, test statistic is given by
(x − µ0 )
Z0 = n If σ is unknown and n > 30, we replace σ by s| in the
σ
expression for Z0.
n ( p − P0 )
Z0 = provided n is large,
P0 (1 − P0 )
Under the assumption that the null hypothesis is true, Z0 follows standard
normal distribution. At 5% level of significance, the critical region for the two-
tailed test is given by
ω : |Zo| ≥ 1.96
The critical region for the right-tailed test is
ω : Zo ≥ 1.645
and the critical region for the left-tailed test is
ω : Zo ≤ –1.645
Similarly when the level of significance is 1%, the critical region for the two-
tailed test is
ω : |Zo| ≥ 2.58
For the right-tailed test, the critical region is
ω : Zo ≥ 2.35
and that for the left-tailed test is
ω : Zo ≤ –2.35
h) Yes.
4. Yes, Z0 = – 0.447
5. No, Z0 = 4.12
6. Yes, Z0 = 1.774
3) Discuss the role of normal distribution in interval estimation and also in testing
hypothesis.
5) Discuss how far the sample proportion satisfies the desirable properties of a
good estimator.
7) Describe how you could set confidence limits to population proportion on the
basis of a large sample.
9) Describe the different steps for testing the significance of population proportion.
10) 15 Life Insurance Policies in a sample of 250 taken out of 60,000 were found to
be insured for less than Rs. 7500. How many policies can be reasonably
expected to be insured for less than Rs. 7500 in the whole lot at 99%
confidence level.
(Ans: 1278 to 5922)
12) A manufacturer of ball-point pens claims that a certain type of pen produced by
him has a mean writing life of 550 pages with a S.D. of 35 pages. A purchaser
selects 20 such pens and the mean life is found to be 539 pages. At 5% level of
significance should the purchaser reject the manufacturer’s claim ?
(Ans: Yes, Z0 = –2.30)
13) In a sample of 550 guavas from a large consignment, 50 guavas are found to be
rotten. Estimate the percentage of defective guavas and assign limits within
which 95% of the rotten guavas would lie.
[Ans: (i) 9.09%; (ii) 0.0668 to 0.1150]
14) A die is thrown 59215 times out of which six appears 9500 times. Would you
consider the die to be unbiased ?
(Ans: No, Z0 = – 4.113)
15) A sample of 50 items is taken from a normal population with mean as 5 and 9 5
Probability and standard deviation as 3. The sample mean comes out to be 4.38. Can the
Hypothesis Testing sample be regarded as a truly random sample?
(Ans: No, Z = –1.532)
16) A random sample of 600 apples was taken from a large consignment of 10,000
apples and 70 of them were found to be rotten. Show that the number of rotten
apples in the consignment with 95% confidence may be expected to be from
910 to 1,424.
17) The mean life of 500 bulbs, as obtained in a random sample manufactured by a
company, was found to be 900 hours with a standard deviation of 300 hours.
Test the hypothesis that the mean life is less than 900 hours. Select α = 0.05
and 0.01.
(Ans: Yes, Z0 = – 3.7268
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
Gupta, C.B., and Vijay Gupta, 1998, An Introduction to Statistical Methods, Vikas
Publishing House Pvt. Ltd., New Delhi.
9 6
Tests of Hypothesis-II
UNIT 16 TESTS OF HYPOTHESIS – II
STRUCTURE
16.0 Objectives
16.1 Introduction
16.2 Small Samples versus Large Samples
16.3 Student’s t-distribution
16.4 Application of t-distribution to determine Confidence Interval for
Population Mean
16.5 Application of t-distribution for Testing Hypothesis Regarding Mean
16.6 t-test for Independent Samples
16.7 t-test for Dependent Samples
16.8 Let Us Sum Up
16.9 Key Words and Symbols
16.10 Answers to Self Assessment/Exercises
16.11 Terminal Questions/Exercises
16.12 Further Reading
16.0 OBJECTIVES
After studying this unit, you should be able to:
l differentiate between exact tests i.e., small sample tests and approximate tests,
i.e., large sample tests,
l be familiar with the properties and applications of t-distribution,
l find the interval estimation for mean using t-distribution,
l have an idea about the theory required for testing hypothesis using
t-distribution,
l apply t-test for independent samples, and
l apply t-test for dependent samples.
16.1 INTRODUCTION
In the previous unit, we considered different aspects of the problems of
inferences. We further noted the limitations of standard normal test or Z-test.
As discussed in Unit 15, we can not apply normal distribution for estimating
confidence intervals for population mean in case the population standard
deviation is unknown and sample size does not exceed 30, i.e., small samples.
We may further recall that as mentioned in Unit 15, we can not test hypothesis
concerning population mean when the sample is small and population standard
deviation is unspecified. In a situation like this, we use t-distribution which is
also known as student’s t-distribution. t-distribution was first applied by W.S.
Gosset who used to work in ‘Guinners Brewery’ in Dublin. The workers of
Guinners Brewery were not allowed to publish their research work. Hence
Gosset was compelled to publish his research work under the penname
‘student’ and hence the distribution is known as student’s t-distribution or simply
student’s distribution. Before we discuss t-distribution, let us differentiate
between exact tests and approximate tests.
9 7
Probability and
Hypothesis Testing 16.2 SMALL SAMPLES VERSUS LARGE SAMPLES
Normally a sample is considered as small if its size is 30 or less whereas, the
sample with size exceeding 30 is considered as a large sample. All the tests
under consideration may be classified into two categories namely exact tests
and approximate tests. Exact tests are those tests that are based on the exact
sampling distribution of the test statistic and no approximation is made about the
form of parent population or the sampling distribution of the test statistic. Since
exact tests are valid for any sample size and usually cost as well as labour
increases with an increase in sample size; we prefer to take small samples for
conducting exact tests. Hence, the exact tests are also known as small sample
tests. It may be noted that while testing for population mean on the basis of a
random sample from a normal distribution, we apply exact tests or small sample
tests provided the population standard deviation is known. This was
demonstrated in Unit 15.
T − θ0 T − θ0
Z0 = or Z 0 = ∧
S.E (T ) S. E (T )
n ( x − µ)
Z=
S|
∑ (x i − x) 2
where , S| =
(n − 1)
n (p − p 0 )
Z0 =
P0 (1 − P0 )
P0 being the specified population proportion, which again, for a large sample, is
9 8 an approximate standard normal variable.
Tests of Hypothesis-II
16.3 STUDENT’S t-DISTRIBUTION
Since we cannot use Z-test, for a small sample, for population mean when the
population standard deviation is not known, we are on the look out for a new
test statistic. It is necessary to know a few terms first.
2
Again x12 + x22 = ∑ x i ~ χ2 and in general,
2 2
i =1
n
x 12 + x 22 + … + x 2n = ∑ x i2 ~ χ 2n …… (16.1)
i =1
n 2
If we write u = ∑ xi , then the probability density function of u is given by :
i =1
0 α
Figure 16.1: χ distribution.
2 9 9
Probability and If x1, x2, x3 …xn are ‘n’ independent variables, each following normal
Hypothesis Testing
xi − µ
distribution with mean (µ) and variance (σ2), then Xi = is a standard
σ
normal variable and as such
n n ( x i − µ) 2
u = ∑ X i2 = ∑ …(16.2)
i =1 i =1 σ2
∑ (x i − x) 2
S2 =
n
∑ (x i − µ) 2
As ~ χ2n
σ 2
(x − µ)2 (x − µ)2 (x − µ2 ) 2
and n = 2 = 2 ~ χ1
σ2 σ /n σ / n
σ
sin ce x ~ N µ, ,
n
ns2
Hence it follows that 2 ~ χ n −1
2
Student’s t-distribution: Consider two independent variables ‘y’ and ‘u’ such
that ‘y’ follows standard normal distribution and ‘u’ follows χ2-distribution with
md.f.
y
Then the ratio t = , follows t-distribution
u/m
with m d.f. The probability density function of t is given by :
m +1
−
t2 2
f ( t ) = const. 1 +
m
+ or – ∞ < t < ∞ .........…(16.3)
( x − µ)
where, t = n , const. = a constant required to make the area under the
s
curve equal to unity; m = n–1, the degree of freedom of t.
−∞ ∞
Since we have :
− m+1
t2 2
f (t) = const. 1 +
m
for − ∞ < t < ∞
m +1 t2
∴Logf = k − Log1 + where, ‘k’ is a constant.
2 m
m +1 t 2 t4
=k− − 2
+ ……to α
2 m 2m
x2
[as Log (1+x) = − + ........ to α
2
for –1 < x ≤ 1
t2
and is rather small for a large m].
m
m +1 t 2 m +1 4
Hence Log f = k − . + t …… to α
m 2 4m 2
m +1
since m is very large tends to 1 and other terms containing powers of
m
‘m’ higher than 2 would tend to zero. Thus we have:
t2
Log f = k −
2
101
Probability and
t2 2 2 /2
Hypothesis Testing K−
or, f =e 2 = e k .e − t /2
= const. e − t
Looking from another angle, as the mean of t-distribution is zero and standard
m
deviation = which tends to unity for a large m,
m−2
If x1, x2, x3 …xn denote the n observations of a random sample drawn from a
normal population with mean as µ and the standard deviation as σ, then x1, x2,
x3 …xn can be described as ‘n’ independent random variables each following
normal distribution with the same mean µ and a common standard deviation σ.
If we consider the statistic:
( x − µ)
n
S|
∑ xi − x) 2 ∑ (x i
where, x = , the sample mean; and S =
|
is the standard
n n −1
deviation with divisor as (n–1) instead of n, then we may write :
n ( x − µ)
( x − µ) σ
n =
S| S| , dividing both numerator and denominator by σ
σ
(x − µ / σ / n y
= =
∑ (x i− x) 2 u /( n − 1)
(n − 1)σ 2
x −µ
y= is a standard normal variate
σ/ n
∑ (x i − x)2
Also u = follows χ2-distribution with (n–1) d.f
σ2
(x − µ) y
Hence, by definition n =
s′ u /(n −1)
(x − µ )
102 t = n ~ t n −1
s|
We apply t-distribution for finding confidence interval for mean as well as Tests of Hypothesis-II
testing hypothesis regarding mean. These are discussed in Sections 16.4 and
16.5 respectively.
Let us assume that we have a random sample of size ‘n’ from a normal
population with mean as µ and standard deviation as σ. We consider the case
when both µ and σ are unknown. We are interested in finding confidence
interval for population mean. In view of our discussion in Section 16.3, we
know that :
(x − µ)
t= n
s|
follows t-distribution with (n–1) d.f. We may recall here that x denotes the
sample mean and s|, the sample standard deviation with divisor as (n–1) and not
‘n’. We denote the upper α-point of t-distribution with (n–1) d.f as tα, (n–1).
Since t-distribution is symmetrical about t = 0, the lower α-point of t-distribution
with (n–1) d.f would be denoted by –tα, (n–1). As per our discussion in Unit
15, in order to get 100 (1–α)% confidence interval for µ, we note that :
(x − µ)
P − t α / 2 , (n −1) ≤ n ≤ t α / 2 , (n −1) = 1 − α
s
s| s|
or P − x − .t α/ 2 , (n −1) ≤ −µ ≤ −x + .t α/ 2 , (n −1) = 1− α
n n
s| . s| .t |α / 2 ,
or P x − t α / 2 , (n − 1) ≤ µ ≤ x + (n − 1) = 1 − α
n n
Thus 100 (1-α) % confidence interval to µ is :
s| . s| .
x − t α/2 , ( n − 1), x + t α / 2 , (n − 1) …(16.6)
n n
s|
100 (1-α) % Lower Confidence Limit to µ = x − .t α / 2 , (n − 1)
n
s|
and 100 (1-α) % Upper Confidence Limit to µ = x + .t α / 2 , (n − 1)
n
Selecting α = 0.05, we may note that
s|
95% Lower Confidence Limit to µ = x − .t 0.025 , ( n − 1)
n
s|
and 95% Upper Confidence Limit to µ = x + .t 0.025 , ( n − 1) …(16.7)
n 103
Probability and In a similar manner, setting α = 0.01, we get 99% Lower Confidence Limit to µ
Hypothesis Testing
s|
=x− .t 0 .005 , ( n − 1) and
n
|
s
99% Upper Confidence Limit to µ = x + . t 0.005 , ( n − 1) …(16.8)
n
Values of tα, m for m = 1 to 30 and for some selected values of a are provided
in Appendix Table 5. Figures 16.3, 16.4 and 16.5 exhibit confidence intervals to
µ applying t-distribution as follows :
α/2 α/2
s|
s | µ= x+ .t α / 2 , (n − 1)
µ=x− .t α / 2 , (n −1) n
n
α%) Confidence Interval to µ
Fig. 16.3: 100 (1–α
95% area
s| s|
µ=x− .t 0.005 , (n − 1) µ=x+ .t 0.005 , (n − 1)
n n
Fig. 16.4: 95% Confidence Interval to µ
104
Tests of Hypothesis-II
99% area
s| s|
µ=x− .t 0.005 , (n − 1) µ=x+ .t 0.005 , (n − 1)
n n
Illustration 1
Following are the lengths (in ft.) of 7 iron bars as obtained in a sample out of
100 such bars taken from SUR IRON FACTORY.
we have to find 95% confidence interval for the mean length of iron bars as
produced by SUR IRON FACTORY.
Solution: Let x denote the length of iron bars. We assume that x is normally
distributed with unknown mean µ and unknown standard deviation σ. Then 95%
Lower Confidence Limit to µ
s| N−n
=x− . t 0.005 , 6
n N −1
s| N−n
and 99% Upper Confidence Limit to µ = x + . .t 0.025 ,6
n N −1
∑ (x − x)
2
∑ xi
where, x = ; S =
| i
; n = sample size = 7; N = population size
n n −1
= 100
N−n
and = finite population correction (fpc)
N −1
105
Probability and Table 16.1: Computation of Sample Mean and S.D.
Hypothesis Testing
xi xi 2
4.10 16.8100
3.98 15.8404
4.01 16.0801
3.95 15.6025
3.93 15.4449
4.12 16.9744
3.91 15.2881
28.00 112.0404
28
Thus, we have : x = =4
7
∑ (x i − x ) 2 = ∑ x i2 − nx 2 = 112.0404 – 7 × 42
= 0.0404
100 − 7
f .p.c. = = 0 .969223
100 − 1
0.08 2057
Hence 95% Lower confidence Limit to µ = 4 − × 0.969223 × 2.365
7
= 4 – 0.188091 = 3.811909
So 95% Confidence Interval for mean length of iron bars = [3.81 ft, 4.19 ft].
Illustration 2
Find 90% confidence interval to µ given sample mean and sample S.D as 20.24
and 5.23 respectively, as computed on the basis of a sample of 11 observations
from a population containing 1000 units.
s| s|
x − .t 0.05 , 10, x + .t 0.05 , 10
n n
∑ (xi − x2 )
As S= , the sample standard deviation (S.D).
n
∴nS2 = ∑ (x i − x) 2
106
∑ (x i − x) 2 Tests of Hypothesis-II
Hence (s ) =
| 2
n −1
nS 2
= [since ∑ (xi − x 2) = nS 2]
n −1
n
or, s| = .S
n −1
11
= × 5.23 = 5.4853
10
Consulting Appendix Table-5, given at the end of this block, we find t0.05,
10 = 1.812
Thus 90% confidence interval to µ is given by:
5.4853 5.4853
20.24 − 3.1623 ×1,812, 20.24 + 3.1623 × 1.812 = [17.0969, 23.3831]
Illustration 3
The study hours per week of 17 teachers, selected at random from different
parts of West Bengal, were found to be:
6.6, 7.2, 6.8, 9.2, 6.9, 6.2, 6.7, 7.2, 9.7, 10.4, 7.4, 8.3, 7.0, 6.8, 7.6, 8.1, 7.8
Suppose, we are interested in computing 95% and 99% confidence intervals for
the average hours of study per week per teacher in the state of West Bengal.
Solution: If µ denotes the average hours of study per week per teacher in
West Bengal, then as discussed earlier,
s| s|
95% confidence interval to µ = x − .t 0.025, (n −1), x + .t 0.025, (n −1)
n n
s| s|
and 99% confidence interval to µ = x − .t 0.005 , ( n − 1), x + .t 0.005 , (n − 1)
n n
∑ (x i− x)2
2
∑xi −n.(x)2
s =
|
=
n −1 n −1
1014.41 − 17 × (7.64) 2
=
17 − 1
1014.41 − 992.28
= = 1.1761
16
From Appendix Table-5, given at the end of this block, t0.025, 16 = 2.120; t0.005,
16 = 2.921
1.1761 1.1761
= (7.64 − × 2.12) hrs , (7.64 + × 2.12) hrs
4 4
= [7.0167 hours, 8.2633 hours]
Similarly 99% confidence interval to µ
1.1761 1.1761
= (7.64 − × 2.921) hours, (7.64 + × 2.921) hours
4 4
= [6.7812 hours, 8.4988 hours]
Illustration 4
s| s|
x − .t 0.05 , ( n − 1), x + .t 0.05 , (n − 1)
n n
In this case, n = 26, From Appendix Table-5, given at the end of this block,
t0.05, 25 = 1.708
s|
Hence we have x − ×1.708 = 46.584
26
or x − 0.33497s| = 46.584 …(1)
s|
and x + ×1.708 = 53.416
26
or, x + 0 .33497 s | = 53 .416 …(2)
on adding equation (1) and (2) we get
2 x = 100 or x = 50
108
replacing x by 50 in equation (1), we have Tests of Hypothesis-II
50 − 0.33497 s| = 46.584
3.416
or s| = = 10.19793
0.33497
n −1 |
Hence S = s [from illustration 2]
n
= 0.98058 × 10.19793 = 9.9999 ~− 10
Thus the sample mean is 50 units and sample S.D is approximately 10 units.
∑ (x i− x) 2
~ χ 2n
σ 2
k) Z test has the widest range of applicability among all the commonly used
tests.
4) A random sample of size 10 drawn from a normal population yields sample mean
as 85 and sample S.D as 8.7. Compute 90% and 95% confidence intervals to
population mean. 109
Probability and ..........................................................................................................
Hypothesis Testing
..........................................................................................................
..........................................................................................................
5) Find 99% confidence limits for ‘µ’ given that a sample of 19 units drawn from a
population of 98 units provides sample mean as 15.627 and sample S.D as 2.348.
..........................................................................................................
..........................................................................................................
..........................................................................................................
6) A sample of size 10 drawn from a normal population produces the following results.
Σxi = 92 and Σxi2 = 889
Obtain 95% confidence limits to µ.
..........................................................................................................
..........................................................................................................
..........................................................................................................
H0 : µ = µ 0
against H : µ ≠ µ0 i.e., the population mean is anything but µ0.
or H1 : µ > µ0 i.e., the population mean is greater than µ0.
or H2 : µ < µ0 i.e., the population mean is less than µ0.
As we have noted in Section 16.1 the proper test to apply in this situation is
undoubtedly t-test. If we denote the upper α-point and lower α-point of t-
distribution with m.d.f. by tα, m and t1–a,m = – tα,m (as t-distribution is
symmetrical about 0) then for testing H0, based on the distribution of t, it may
be possible to find 4 values of t such that :
This is shown in the following Figure 16.6. Critical region lies on both the tails.
Acceptance Region
α/2 α/2
– ta/2, m 0 ta/2, m
Fig. 16.6: Critical Region for Both-tailed Test.
Secondly, in order to test the null hypothesis against the right-sided alternative
i.e., to test H0 against H1 : µ > µ0, from (16.11) we note that, as before, if we
choose a small value of α, then the probability that the observed value of t,
would exceed the critical value tα, m is very low. Thus one may have serious
questions in this case, about the validity of H0 if the value of t, as obtained on
the basis of a small random sample, really exceeds tα, m. We then reject H0
and accept H1. The critical region
ω : t0 ≥ tα, m ………(16.15)
lies on the right-tail of the curve and the test as such is called right-tailed test.
This is shown in Figure 16.7.
Acceptance Region
α
0 tα, m
Fig. 16.7: Critical Region for Right-tailed Test.
111
Probability and Lastly, when we proceed to test H0 against the left-sided alternative
Hypothesis Testing H2 : µ < µ0, we note that (16.12) suggests that if α is small, then the
probability that t0 would be less than the critical value –tα, m is very small. So
if the value of t0 as computed, on the basis of a small sample, is found to be
less than –tα, m, we would doubt the validity of H0 and accept H2. The critical
region
ω : t0 ≤ – tα, m …(16.16)
would lie on the left-tail and the test would be left-tailed test. This is depicted
in Fig. 16.8.
Acceptance Region
α
-tα, m 0
Fig. 16.8: Critical Region for Left-tailed Test.
4) Whether the sample drawn is a small one. Again if the answer is ‘no’ i.e., n >
30, we would be satisfied with Z-test. However, if n ≤ 30 and the first three
conditions are fulfilled, we should recommend t-test.
n (x − µ)
t=
s|
where, n = sample size; x = sample mean; and s| = sample S.D with divisor
as (n-1). The test statistic follows t-distribution with (n–1) d.f
112
In order to test H0 : µ = µ0 against the both-sided alternative H : µ ≠ µ0 we Tests of Hypothesis-II
compute :
n ( x − µ0 )
t0 =
s|
if t0 falls on the critical region defined by :
ω : |t0| ≥ tα/2, (n–1)
tα, m being the upper α-point of t-distribution with m d.f, then we reject H0. In
other words, H0 is rejected and H : µ ≠ µ0 is accepted if the observed value
of t, as computed from the sample, exceeds or is equal to the critical value
tα/2, (n-1).
Figure 16.9 shows critical region at 5% level of significance while Figure 16.10
shows critical region at 1% level of significance.
Acceptance Region
95 % of area
Critical Region
ω : t0 ≤-t0.025, (n-1) Critical Region
ω : t0 ≥ t0.025, (n-1)
0.025 0.025
0 t0.025, (n-1)
- t0.025 , (n-1)
Fig.16.9: Critical Region for Both-tailed Test at 5% Level of Significance
Acceptance Region
0.005 0.025
0 t0.005, (n-1)
- t0.005, (n-1)
Fig. 16.10: Critical Region for Both-tailed Test at 1% Level of Significance 113
Probability and Similarly, for testing H0 against the right-sided alternative H1 : µ > µ0, the
Hypothesis Testing critical region is given by :
ω : t0 ≥ tα , (n–1)
ω : t0 ≥ t0.05 , (n–1)
The following Figures 16.11 and 16.12 show these two critical regions.
Acceptance Region
95 % of area
Critical Region
ω : t0 ≥ t0.05, (n-1)
0.05
0 t0.05, (n-1)
Fig. 16.11: Critical Region for Right-tailed Test at 5% Level of Significance
Acceptance Region
95 % of area
Critical Region
ω : t0 ≥ t0.01, (n-1)
0.01
0 t0.01, (n-1)
114
Lastly, when we test H0 against the left-tailed test H2 : µ < µ0, the critical Tests of Hypothesis-II
region would be:
ω : t0 ≤ –tα , (n–1)
ω : t0 ≤ –t0.05 , (n–1)
ω : t0 ≤ –t0.01 , (n–1)
These are depicted in the following Figure 16.13 and Figure 16.14 respectively.
Acceptance Region
0.05
- t0.05, (n-1) 0
Acceptance Region
0.01
- t0.01, (n-1) 0
Illustration 5
9.7, 9.6, 10.4, 10.3, 9.8, 10.2, 10.4, 9.5, 10.6, 10.8, 9.1, 9.4, 10.7
Solution: Let x denote the weight of the packed tins of oil. Since,
n (x −10)
t0 = |
, where, x = ∑ x i
s n
2
∑(xi − x) −n.x2
2
∑xi
s| = =
n −1 n −1
ω : t0 ≤ –tα, (n–1)
13 (10.038 − 10)
Hence t 0 = = 0.245
0.5591
which is greater than –1.782
As t0 does not fall on the critical region w, we accept H0. So, on the basis of
the given data as obtained from the sample observations, we conclude that the
machine worked in accordance with the given specifications.
116
Illustration 6 Tests of Hypothesis-II
Solution: Let x denote the inner diameter of steel tubes as produced by the
company. We are interested in testing
H0 : µ = 4 against
H :µ≠ 4
Assuming that x follows normal distribution, we note that the sample size is 15
(<30) and the population S.D. is unknown. All these factors justify the
application of t-distribution. Thus we compute our test statistic as:
n ( x − 4)
t=
s|
As given, x = 3.96; and s = 0.032
n 15
∴s| = s = × 0.032 = 0.033
n −1 14
n ( x − 4)
t0 =
s|
14 (3.96 − 4)
So, t 0 = = − 4.536
0.033
Hence t 0 = 4.536
ω : t 0 ≥ t α / 2, ( n −1)
Selecting the level of significance as 1%, from the t-table (Appendix Table-5),
we get t0.01/2, (15–1)
= t0.005, 14 = 2.977
Thus, ω : t 0 ≥ 2.977
Since the computed value of t i.e., t 0 = 4.536 , falls on w, we reject H0. Hence
the sample mean is significantly different from the population mean.
Illustration 7
The mean weekly sales of detergent powder in the department stores of the
city of Delhi as produced by a company was 2,025 kg. The company carried
out a big advertising campaign to increase the sales of their detergent powder.
After the advertising campaign, the following figures were obtained from 20
departmental stores selected at random from all over the city (weight in kgs.).
n ( x − 2025)
t0 =
s|
and the critical region for the right-sided alternative is given by :
ω : t 0 ≥ (t 0 , (n −1)
or ω : t0 ≥ 1.729
[By selecting α = 0.05 and consulting Appendix Table-5, given at the end of
this block, we find that for m = 20–1 = 19 and for α = 0.05, value of t is
1.729].
xi ui = xi – 2000 u i2
2000 0 0
2023 23 529
2056 56 3136
2048 48 2304
2010 10 100
2025 25 625
2100 100 10000
2563 563 316969
2289 289 813521
2005 5 25
2082 82 6724
2056 56 3136
2049 49 2401
2020 20 400
2310 310 96100
2206 206 42436
2316 316 99856
2186 186 34596
2243 243 59049
2013 13 169
Total 2600 762076
2600
118 From the above table, we have x = 2000 + kg = 2130 kg
20
Tests of Hypothesis-II
2
∑ ui − nu 2
s =
|
n −1
762076 − 20 × (130 ) 2
= = 149 .3981 kg
19
n ( x − 2025)
As t 0 =
s|
20 ( 2130 − 2025 )
∴ t0 = = 3.143
149 .3981
A glance at the critical region suggests that we reject H0 and accept H1. On
the basis of the given sample we, therefore, conclude that the advertising
campaign was successful in increasing the sales of the detergent powder
produced by the company.
Illustration 8
A random sample of 26 items taken from a normal population has the mean as
145.8 and S.D. as 15.62. At 1% level of significance, test the hypothesis that
the population mean is 150.
Solution: Here we would like to test H0 : µ = 150 i.e., the population mean is
150 against H : µ ≠ 150 i.e., the population mean is not 150. As the necessary
conditions for applying t-test are fulfilled, we compute
n ( x − 150)
t0 =
s|
and the critical region at 1% level of significance is :
ω : t 0 ≥ 2.787
n 26
∴s | = S = × 15.62 = 15.9293
n −1 25
26 (145.8 − 150)
So, t 0 = = − 1.344
15.9293
thereby, t 0 = 1.344
Looking at the critical region, we find acceptance of H0. So on the basis of the
given data, we infer that the population mean is 150.
Similarly one may apply paired t-test to verify the necessity of a costly
management training for its sales personnel by recording the sales of the
selected trainees before and after the management training or the validity of
special coaching for a group of educationally backward students by verifying
their progress before and after the coaching programme or the increase in
productivity due to the application of a particular kind of fertiliser by recording
the productivity of a crop before and after applying this particular fertiliser and
so on.
Let us now discuss the theoretical background for the application of paired t-
test. In our earlier discussions, we were emphatic about the observations being
independent of each other. Now we consider a pair of random variables which
are dependent or correlated. Earlier, we considered normal distribution, to be
more precise, univariate normal distribution. Similarly, we may think of bivariate
normal distribution. Let x and y be two random variables following bivariate
normal distribution with mean µ1 and µ2 respectively, standard deviations σ1 and
σ2 respectively and a correlation co-efficient (ρ).
Thus ‘x’ and ‘y’ may be the bodyweight of the babies before and after the
application of the restorative, sales before and after the training programme,
marks of the weak students before and after the coaching, yield of a crop
before and after applying the fertiliser and so on.
Let us consider ‘n’ pairs of observations on ‘x‘ and ‘y’ and denote the ‘n’
pairs by (xi, yi) for i = 1, 2, 3, …, n.
Thus testing H0 : µ1–µ2 is analogous to testing for population mean when the
population standard deviation is unknown. In view of our discussion in Section
16.5, if the sample size is small, it is obvious that the appropriate test statistic
would be:
n (u − µ u )
120 t= ……(16.17)
s| u
Tests of Hypothesis-II
∑ ui
where n = sample size; u = ;u=x−y
n
∑ (ui − u)2 ∑ ui − nu 2
su =
|
=
n −1 n −1
nu
As before, under H0, t0 = follows t-distribution with (n–1) d.f.
s| u
Thus for testing H0 : µu = 0 against H : µu ≠ 0,
the critical region is provided by :
ω : t0 ≥ tα/2 (n–1)
For testing H0 against H1 : µ1 > µ2 i.e., H1 : µu > 0
We consider the critical region
ω : t ≥ tα, (n–1)
when the sample size exceeds 30, the assumption of normality for u may be
nu
avoided and the test statistic can be taken as a standard normal variable
s| u
and accordingly we may recommend Z-test.
Illustration 9
Is it reasonable to believe that the drug has no effect on the change of blood-
pressure?
Solution: Let x denote blood-pressure before applying the drug and y, the
blood-pressure after applying the drug. Further let µ1 denote the average blood-
pressure in the population before applying the drug and µ2, the average blood-
pressure after applying the drug. Thus the problem is reduced to testing :
nu
follows t-distribution with (n–1) d.f under H0.
s|u
Thus the critical region would be
ω : t0 ≥ tα , (n–1)
or ω : t0 ≥ 1.895
121
Probability and By taking α = 0.05, tα, (n–1)= t0.05, 7 = 1.895 from Appendix Table-5.
Hypothesis Testing
From the given data, we find that n = 8, Σui = 6, Σui2 = 120
∑ ui 6
Hence u = = = 0.75
n 8
n u
t0 =
s |u
8 × 0.75
∴t0 = = 0.522
4.062
Looking at the critical region, we find that H0 is accepted. Thus on the basis of
the given data we conclude that the drug has been unsuccessful in reducing
blood-pressure.
Illustration 10
A group of students was selected at random from the set of weak students in
statistics. They were given intensive coaching for three months. The marks in
statistics before and after the coaching are shown below.
1 19 32
2 38 36
3 28 30
4 32 30
5 35 40
6 10 25
7 15 30
8 29 20
9 16 15
Solution: Let x and y denote the marks in statistics before and after the
coaching respectively. If the corresponding mean marks in the population be µ1
and µ2 respectively, then we are to test :
H0 : µ1 = µ2 i.e., the coaching has really improved the standard of the students,
against the alternative hypothesis H : µ1 < µ2.
We compute :
nu
122 t0 = which follows t-distribution with (n–1) d.f under H0.
s| u
where, n = no. of students selected = 9 Tests of Hypothesis-II
u = x – y = difference in statistics marks
∑ ui2 − n(u)2
s| u =
n −1
since α = 0.05, n = 9,
consulting Appendix Table-5, we find that t0.05 , 8 = 1.86.
Thus the left-sided critical region is provided by w : t0 ≤ –1.86.
Marks in Statistics
Serial No. (x0) (y0) ui = (xi–yi) u i2
of student Before After
coaching coaching
1 19 32 –13 169
2 38 36 2 4
3 28 30 –2 4
4 32 30 2 4
5 35 40 –5 25
6 10 25 –15 225
7 15 30 –15 225
8 29 20 9 81
9 16 15 1 1
Total – – –36 738
∑ u i − 36
Thus u = = =−4
n 9
738 − 9 × (−4) 2
s| u = = 8.6168
8
8 × −4
∴t0 = = − 1.313
8.6168
A glance at the critical region suggests that we accept H0. On the basis of the
given data, therefore, we infer that the coaching has failed to improve the
standard of the students.
Illustration 11
123
Probability and
Serial number Sales (’000 Rs.)
Hypothesis Testing
of trainee Before the course After the course
1 15 16
2 16 17
3 13 19
4 20 18
5 18 22.5
6 17 18.3
7 16 19.2
8 19 18
9 20 20
10 15.5 16
11 16.2 17
12 15.8 17
13 18.7 20
14 18.3 18
15 20 22
Was the training programme effective in promoting sales? Select α = 0.05.
H0 : µ1 = µ2 against
H1 : µ1 < µ2
µ1 and µ2 being the average sales in the population before the training and
after the training. As before the critical region is :
ω : t0 ≤ –1.761
as m = n–1 = 14 and t0.05, 14 = 1.761
1 15 16 –1 1
2 16 17 –1 1
3 13 19 –6 36
4 20 18 2 4
5 18 22.5 – 4.5 20.25
6 17 18.3 – 1.3 1.69
7 16 19.2 – 3.2 10.24
8 19 18 1 1
9 20 20 0 0
10 15.5 16 – 0.5 0.25
11 16.2 17 – 0.8 0.64
12 15.8 17 – 1.2 1.44
13 18.7 20 – 1.3 1.69
14 18.3 18 0.3 0.09
15 20 22 –2 4
Total – – –19.l5 83.29
124
Tests of Hypothesis-II
∑ µi − nu
2
su =
|
n −1
From the above table 16.5, we have
− 19.5
u= = − 1.3
15
61 − 6 ( 0 .833 ) 2
su =
|
= 3 .3715
5
nu 15 × − 1.3
Hence t 0 = |
= = − 2.428
su 2.0343
to being less than –1.761, we reject H0. Thus on the basis of the given sample,
we conclude that the training programme was effective in promoting sales.
Illustration 12
Six pairs of husbands and wives were selected at random and their IQs were
recorded as follows:
Pair : 1 2 3 4 5 6
IQ of Husband : 105 112 98 92 116 110
IQ of Wife : 102 108 100 96 112 110
Do the data suggest that there is no significant difference in average IQ
between the husband and wife? Use 1% level of significance.
Solution: Let x denote the IQ of husband and y, that of wife. We would like
to test
H0 : µ1 = µ2 i.e., there is no difference in IQ.
ω : t 0 ≥ t 0.01 , (6 − 1)
2
i.e., ω : t 0 ≥ t 0.05,5
i.e., ω : t 0 ≥ 4 . 032
1 105 102 3 9
2 112 108 4 16
3 98 100 –2 4
4 92 96 –4 16
5 116 112 4 16
6 110 110 0 0
Total – – 5 61
125
Probability and From the above Table, we get,
Hypothesis Testing
5
u= = 0.8333
6
2
∑ ui − n(u ) 2
su =
|
n −1
61 − 6 (0.833) 2
s| u = = 3.3715
5
n u
t0 =
s |u
5 × 0.8333
so, t0 = = 0.553
3.3715
Therefore, we accept H0 and conclude that, on the basis of the given sample,
there is no reason to believe that IQs of husbands and wives are different.
2) Describe the different steps one should undertake in order to apply t-test.
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
5) A certain diet was introduced to increase the weight of pigs. A random sample of
12 pigs was taken and weighed before and after applying the new diet. The
differences in weights were :
7, 4, 6, 5, – 6, – 3, 1, 0, –5, –7, 6, 2
can we conclude that the diet was successful in increasing the weight of the pigs?
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
ω : t 0 ≥ t α / 2 , ( n −1)
ω : t0 ≥ tα, (n–1) and the critical region for the left-sided alternative
We have concluded our discussion by describing paired t-test and the critical
region for t-tests applied to dependent samples. 127
Probability and
Hypothesis Testing 16.9 KEY WORDS AND SYMBOLS
Chi-square Distribution: If x1, x2 …, xn are ‘m’ independent standard normal
variables, then u = ∑ x i 2 follows χ2-distribution with md.f and this is denoted by
u ~ χ 2m .
Degree of Freedom (d.f.): no. of observations – no. of constraints.
Large Sample: when sample size (n) is more than 30.
Large Sample Tests or Approximate Tests: tests based on large samples.
Paired Samples: Another term used for dependent samples.
Small Sample: when sample size (n) is less than 30.
Small Sample Tests or Exact Tests: tests based on small samples only.
t-distribution: If x is a standard normal variable and u is a chi-square with
m.d.f., and x and u are independent variables, then the ratio.
x
follows t-distribution
u/m
with m.d.f. and is denoted by t ~ tm
100 (1–α) % confidence interval to m
s| s|
x − t 1 / − α / 2 , ( n −1) × , x + t 1− α / 2 , ( n −1) ×
n n
For testing population mean from independent samples, we use the test statistic
n (x − µ0 )
t0 =
s|
and for testing for a particular effect, we use
nu
t0 =
s| u
where u0 = specific value for mean; s| = simple S.D. with (n–1) divisor;
u = x–y = difference in paired sample; and s|u= sample of S.D. of u with (n–1)
divisor
5. No, t0 = – 0.226
6. No, t0 = 0.518
2) How would you distinguish between a t-test for independent sample and a paired
t-test?
6) A technician is making engine parts with axle diameter of 0.750 inch. A random
sample of 14 parts shows a mean diameter of 0.763 inch and a S.D. of 0.0528
inch.
7) St. Nicholas college has 500 students. The heights (in cm.) of 11 students chosen
at random provides the following results:
175, 173, 165, 170, 180, 163, 171, 174, 160, 169, 176
Determine the limits of mean height of the students of St. Nicholas college at 1%
level of significance.
(Ans: 164.6038 cm. and 176.4870 cm.)
8) For a sample of 15 units drawn from a normal population of 150 units, the mean
and S.D. are found to be 10.8 and 3.2 respectively. Find the confidence level for
the following confidence intervals.
(i) 9.415, 12.185
(ii) 9.113, 12.487
[Ans: (i) 90% (ii) 95%]
Sales
(’000 Rs.)
After 17 17 12 15 20 19 14 15 24 12 10 12 18 17 34
campaign
11. A suggestion was made that husbands are more intelligence than wives. A social
worker took a sample of 12 couples and applied I.Q. Tests to both husbands and
wives. The results are shown below:
Sl.No. I.Q. of
Husbands Wives
1. 110 115
2. 115 113
3. 102 104
4. 98 90
5. 90 93
6. 105 103
7. 104 106
8. 116 118
9. 109 110
10. 111 110
11. 87 100
12. 100 98
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
130
Tests of Hypothesis-II
16.12 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt within this unit.
Levin and Rubin, 1996, Statistics for Management. Printice-Hall of India Pvt. Ltd.,
New Delhi.
Hooda, R.P., 2000, Statistics for Business and Economics, MacMillan India Ltd.,
Delhi.
Gupta, S.P., 1999, Statistical Methods, Sultan Chand & Sons, New Delhi.
131
Probability and Hypothesis
Testing UNIT 17 CHI-SQUARE TEST
STRUCTURE
17.0 Objectives
17.1 Introduction
17.2 Chi-Square Distribution
17.3 Chi-Square Test for Independence of Attributes
17.4 Chi-Square Test for Goodness of Fit
17.5 Conditions for Applying Chi-Square Test
17.6 Cells Pooling
17.7 Yates Correction
17.8 Limitations of Chi-Square Test
17.9 Let Us Sum Up
17.10 Key Words
17.11 Answers to Self Assessment Exercises
17.12 Terminal Questions/Exercises
17.13 Further Reading
Appendix Tables
17.0 OBJECTIVES
After studying this unit, you should be able to:
l explain and interpret interaction among attributes,
l use the chi-square distribution to see if two classifications of the same data
are independent of each other,
l use the chi-square statistic in developing and conducting tests of goodness-
of-fit, and
l analyse the independence of attributes by using the chi-square test.
Chi-square tests enable us to test whether more than two population proportions
are equal. Also, if we classify a consumer population into several categories
(say high/medium/low income groups and strongly prefer/moderately prefer/
indifferent/do not prefer a product) with respect to two attributes (say consumer
income and consumer product preference), we can then use chi-square test to
test whether two attributes are independent of each other. In this unit you will
learn the chi-square test, its applications and the conditions under which the chi-
square test is applicable.
132
Chi-Square Test
17.2 CHI-SQUARE DISTRIBUTION
The chi-square distribution is a probability distribution. Under some proper
conditions the chi-square distribution can be used as a sampling distribution of
chi-square. You will learn about these conditions in section 17.5 of this unit.
The chi-square distribution is known by its only parameter – number of degrees
of freedom. The meaning of degrees of freedom is the same as the one you
have used in student t-distribution. Figure 17.1 shows the three different chi-
square distributions for three different degrees of freedom.
df = 2
df = 3
Probability
df = 4
0 2 4 6 8 10 12 14 16
χ2
Figure 17.1. Chi-Square Sampling Distributions for df=2, 3 and 4
It is to be noted that as the degrees of freedom are very small, the chi-square
distribution is heavily skewed to the right. As the number of degrees of
freedom increases, the curve rapidly approaches symmetric distribution. You
may be aware that when the distribution is symmetric, it can be approximated
by normal distribution. Therefore, when the degrees of freedom increase
sufficiently, the chi-square distribution approximates the normal distribution. This
is illustrated in Figure 17.2.
df = 2
Probability
df = 4
df = 10
df = 20
0 5 10 15 20 25 30 35 40
χ2
Figure 17.2. Chi-Square Sampling Distributions for df=2, 4, 10, and 20
Like student t-distribution there is a separate chi-square distribution for each
number of degrees of freedom. Appendix Table-1 gives the most commonly
used tail areas that are used in tests of hypothesis using chi-square distribution.
It will explain how to use this table to test the hypothesis when we deal with
examples in the subsequent sections of this unit. 133
Probability and Hypothesis
Testing 17.3 CHI-SQUARE TEST FOR INDEPENDENCE OF
ATTRIBUTES
Many times, the researchers may like to know whether the differences they
observe among several sample proportions are significant or only due to chance.
Suppose a sales manager wants to know consumer preferences of consumers
who are located in different geographic regions of a country, of a particular
brand of a product. In case the manager finds that the difference in product
preference among the people located in different regions is significant, he/she
may like to change the brand name according to the consumer preferences. But
if the difference is not significant then the manager may conclude that the
difference, if any, is only due to chance and may decide to sell the product
with the same name. Therefore, we are trying to determine whether the two
attributes (geographical region and the brand name) are independent or
dependent. It should be noted that the chi-square test only tells us whether two
principles of classification are significantly related or not, but not a measure of
the degree or form of relationship. We will discuss the procedure of testing the
independence of attributes with illustrations. Study them carefully to understand
the concept of χ2 test.
Illustration 1
Suppose in our example of consumer preference explained above, we divide
India into 6 geographical regions (south, north, east, west, central and north
east). We also have two brands of a product brand A and brand B.
The survey results can be classified according to the region and brand
preference as shown in the following table.
Consumer preference
Region Brand A Brand B Total
South 64 16 80
North 24 6 30
East 23 7 30
West 56 44 100
Central 12 18 30
North-east 12 18 30
Total 191 109 300
For example, the cell entry in row-1 and column-2 of the brand preference 6x2
contingency table referred to earlier is:
80 × 191 15280
E= = = 50.93
300 300
Accordingly, the following table gives the calculated expected frequencies for
the rest of the cells of the 6x2 contingency table.
Consumer Preference
Region Brand A Brand B Total
South (80×191)/300 = 50.93 (80×109)/300 = 29.07 80
North (30×191)/300 = 19.10 (30×109)/300 = 10.90 30
East (30×191)/300 = 19.10 (30×109)/300 = 10.90 30
West (100×191)300= 63.67 (100×109)/300 =36.33 100
Central (30×191)300 = 19.10 (30×109)/300 = 10.90 30
Northern (30×191)/300 = 19.10 (30×109)/300 = 10.90 30
Total 191 109 300
We use the following formula for calculating the chi-square value.
(O i − E i )
χ2 = ∑
Ei
1) Subtract Ei from Oi for each of the 12 cells and square each of these differences
(O i–E i) 2.
(O i − E i ) 2
2) Divide each squared difference by Ei and obtain the total, i.e., ∑ .
Ei
This gives the value of chi-squares which may be ranged from zero to infinity.
Thus, value of χ2 is always positive. 135
Probability and Hypothesis
Now we rearrange the data given in the above two tables for comparing the
Testing
observed and expected frequencies. The rearranged observed frequencies,
expected frequencies and the calculated χ2 value are given in the following
Table.
Illustration 2
A TV channel programme manager wants to know whether there are any
significant differences among male and female viewers between the type of the
programmes they watch. A survey conducted for the purpose gives the
following results.
136
Chi-Square Test
Type of TV Viewers Sex
programme Male Female Total
News 30 10 40
Serials 20 40 60
Total 50 50 100
Since we have a 2x2 contingency table, the degrees of freedom will be (r–1) ×
(c–1) = (2–1) × (2–1) = 1× 1 = 1. At 1 degree of freedom and 0.10
significance level the table value (from Appendix Table-4) is 2.706. Since the
calculated χ2 value (16.66) is greater than table value of χ2 (2.706) we reject
the null hypothesis and conclude that the type of TV programme is dependent
on viewers' sex. It should, therefore, be noted that the value of χ2 is greater
than the table value of x2 the difference between the theory and observation is
significant.
137
Probability and Hypothesis Self Assessment Exercise A
Testing
1) The following are the independent testing situations, calculated chi-
square values and the significance levels. (i) state the null hypothesis,
(ii) determine the number of degrees of freedom, (iii) calculate the
corresponding table value, and (iv) state whether you accept or reject
the null hypothesis.
a) Type of the car (small, family, luxury) versus attitude by sex
(preferred, not preferred). χ2 = 10.25 and a = 0.05.
b) Income distribution per month (below Rs 10000, Rs 10000-20000,
Rs 20000-30000, Rs 30000 and above) and preference for type of
house with number of bed rooms (1, 2, 3, 4 and above). χ2 = 28.50
and a = 0.01.
c) Attitude towards going to a movie or for shopping versus sex (male,
female). χ2 = 8.50 and a = 0.01.
d) Educational level (illiterate, literate, high school, graduate) versus
political affiliation (CPI, Congress, BJP, BSP). χ2 = 12.65 and α =
0.10.
.........................................................................................................
...............................................................................................................
......................................................................................................
......................................................................................................
...............................................................................................................
138 ...............................................................................................................
............................................................................................................... Chi-Square Test
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
...............................................................................................................
The logic inherent in the chi-square test allows us to compare the observed
frequencies (Oi) with the expected frequencies (Ei). The expected frequencies
are calculated on the basis of our theoretical assumptions about the population
distribution. Let us explain the procedure of testing by going through some
illustrations.
Illustration 3
A sales man has 3 products to sell and there is a 40% chance of selling each
product when he meets a customer. The following is the frequency distribution
of sales.
Ho: The sales of three products has a binomial distribution with P=0.40. 139
Probability and Hypothesis H1: The sales of three products do not have a binomial distribution with P=0.40.
Testing
0 0.216
1 0.432
2 0.288
3 0.064
1.000
We now calculate the expected frequency of sales for each situation. There are
130 customers visited by the salesman. We multiply each probability by 130 (no.
of customers visited) to arrive at the respective expected frequency. For
example, 0.216 × 130 = 28.08.
The following table shows the observed frequencies and the expected
frequencies.
(O i − E i ) 2
χ2 = ∑
Ei
140
Chi-Square Test
Observed Expected (O i–E i) (O i–E i) 2 (O i–E i) 2/E i
frequencies frequencies
(O i) (Ei)
Illustration 4
In order to plan how much cash to keep on hand, a bank manager is interested
in seeing whether the average deposit of a customer is normally distributed with
mean Rs. 15000 and standard deviation Rs. 6000. The following information is
available with the bank.
Calculate the χ2 statistic and test whether the data follows a normal distribution
with mean Rs.15000 and standard deviation Rs.6000 (take the level of
significance
(a) as 0.10).
H0: The sample data of deposits is from a population having normal distribution
with mean Rs.15000 and standard deviation Rs.6000.
H1: The sample data of deposits is not from a population having normal
distribution with mean Rs.15000 and standard deviation Rs.6000.
For example, to obtain the area for deposits less than Rs.10000, we calculate
the normal deviate as follows:
From Appendix Table-3 (given at the end of this unit), this value (–0.83)
corresponds to a lower tail area of 0.5000–0.2967 = 0.2033. Multiplying 0.2033
by the sample size (150), we obtain the expected frequency 0.2033 × 150 =
30.50 depositors.
The calculations of the remaining expected frequencies are shown in the
following table.
1.0000 150
We should note that from Appendix Table-3 for 0.83 the area left to x is
0.5000 + 0.2967 = 0.7967 and for ∞ the area left to x is 0.5000 + 0.5000 =
1.0000. Similarly, the area of deposit range for normal deviate 0.83 = 0.7967–
0.2033 = 0.5934 and for ∞ = 1.0000–0.7967 = 0.2033.
Once the expected frequencies are calculated, the procedure for calculating χ2
statistic will be the same as we have seen in illustration 3.
(O i − E i ) 2
χ2 = ∑
Ei
The following table gives the calculation of chi-square.
Illustration 5
A small car company wishes to determine the frequency distribution of
warranty financed repairs per car for its new model car. On the basis of past
experience the company believes that the pattern of repairs follows a Poisson
distribution with mean number of repairs ( l) as 3. A sample data of 400
observations is provided below:
No. of repairs 0 1 2 3 4 5 or
more per car
No. of cars 20 57 98 85 78 62
H0: The number of repairs per car during warranty period follows a Poisson
probability distribution.
H1: The number of repairs per car during warranty period does not follow a Poisson
probability distribution.
As usual the expected frequencies are determined by multiplying the probability
values (in this case Poisson probability) by the total sample size of observed
frequencies. Appendix Table-2 provides the Poisson probability values. For
λ = 3.0 and for different x values we can directly read the probability values.
For example for λ = 3.0 and x = 0 the Poisson probability value is 0.0498, for
λ = 3.0 and x = 1 the Poisson probability value is 0.1494 and so on … .
4 0.1680 67.20
5 or more 0.1848 73.92
Total 1.0000 400
It is to be noted that from Appendix Table-2 for λ = 3.0 we have taken the
Poisson probability values directly for x = 0,1,2,3 and 4. For x = 5 or more we
added the rest of the probability values (for x = 5 to x = 12) so that the sum
of all the probability for x = 0 to x = 5 or more will be 1.000. 143
Probability and Hypothesis As usual we use the following formula for calculating the chi-square (χ2) value.
Testing
2 (O i − E i ) 2
χ =∑
Ei
The following table gives the calculated χ2 value
Illustration 6
In order to know the brand preference of two washing detergents, a sample of
1000 consumers were surveyed. 56% of the consumers preferred Brand X
and 44% of the consumers preferred Brand Y. Do these data conform to the
idea that consumers have no special preference for either brand? Take
significance level as 0.05.
144
Observed Expected (O i–E i) (O i–E i) 2 (O i–E i) 2/E i Chi-Square Test
frequencies(Oi) frequencies(Ei)
20 19.92 0.08 0.0064 0.0003
560 500 60 3600 7.2
440 500 – 60 3600 7.2
1000 1000 χ2 = 14.4
The table value (by consulting the Appendix Table-4) at 5% significance level
and n–1 = 2–1 = 1 degree of freedom is 3.841. Since the value of calculated
χ2 is 14.4 which is greater than table value, we reject the null hypothesis and
conclude that the brand names have special significance for consumer
preference.
a) Random Sample: In chi-square test the data set used is assumed to be a random
sample that represents the population. As with all significance tests, if you have a
random sample data that represents population data, then any differences in the
table values and the calculated values are real and therefore significant. On the
other hand, if you have a non-random sample data, significance cannot be established,
though the tests are nonetheless sometimes utilised as crude “rules of thumb” any
way. For example, we reject the null hypothesis, if the difference between observed
and expected frequencies is too large. But if the chi-square value is zero, we
should be careful in interpreting that absolutely no difference exists between
observed and expected frequencies. Then we should verify the quality of data
collected whether the sample data represents the population or not.
b) Large Sample Size: To use the chi-square test you must have a large
sample size that is enough to guarantee the test, to test the similarity
between the theoretical distribution and the chi-square statistic. Applying chi-
square test to small samples exposes the researcher to an unacceptable rate
of type-II errors. However, there is no accepted cutoff sample size. Many
researchers set the minimum sample size at 50. Remember that chi-square
test statistic must be calculated on actual count data (nominal, ordinal or
interval data) and not substituting percentages which would have the effect
of projecting the sample size as 100.
c) Adequate Cell Sizes: You have seen above that small sample size leads to
type-II error. That is, when the expected cell frequencies are too small, the
value of chi-square will be overestimated. This in turn will result in too
many rejections of the null hypothesis. To avoid making incorrect inferences
from chi- square tests we follow a general rule that the expected frequency
in any cell should be a minimum of 5.
145
Probability and Hypothesis
Testing 17.6 CELLS POOLING
In the previous section we have seen that the cell size should be large enough
of at least 5 or more. When a contingency table contains one or more cells
with expected frequency of less than 5, this requirement may be met by
combining two rows or columns before calculating χ2. We must combine these
cells in order to get an expected frequency of 5 or more in each cell. This
practice is also known as grouping the frequencies together. But in doing this,
we reduce the number of categories of data and will gain less information from
contingency table. In addition, we also lose 1 or more degrees of freedom due
to pooling. With this practice, it should be noted that the number of freedom is
determined with the number of classes after the regrouping. In a special case 2
× 2 contingency table, the degree of freedom is 1. Suppose in any cell the
frequency is less than 5, we may be tempted to apply the pooling method
which results in 0 degrees of freedom (due to loss of 1 df ) which is
meaningless. When the assumption of cell frequency of minimum 5 is not
maintained in case of a 2 × 2 contingency table we apply Yates correction. You
will learn about Yates correction in section 17.7. Let us take an illustration to
understand the cell pooling method.
Illustration 7
A company marketing manager wishes to determine whether there are any
significant differences between regions in terms of a new product acceptance.
The following is the data obtained from interviewing a sample of 190
consumers.
Degree of Region
acceptance South North East West Total
Strong 30 25 20 30 105
Moderate 15 15 20 20 70
Poor 5 10 0 0 15
Total 50 50 40 50 190
Calculate the chi-square statistic. Test the independence of the two attributes at
0.05 degrees of freedom.
Degree of Region
acceptance South North East West Total
Strong 27.63 27.63 22.11 27.63 105
Degree of Region
acceptance South North East West Total
Strong 30 25 20 30 105
Moderate and 20 25 20 20 85
poor
Total 50 50 40 50 190
Degree of Region
acceptance South North East West Total
Strong 27.63 27.63 22.11 27.63 105
Moderate and 22.37 22.37 17.89 22.37 85
poor
Total 50 50 40 50 190
Illustration 8
The following table gives the number of typing errors per page in a 40 page
report. Test whether the typing errors per page have a Poisson distribution with
mean (λ) number of errors is 3.0.
147
Probability and Hypothesis No. of typing 0 1 2 3 4 5 6 7 8 9 10 or
Testing
errors per page more
No. of pages 5 9 6 8 4 3 2 1 1 0 1
Since the expected frequencies of the first row are less than 5, we pool first
and second rows of observed and expected frequencies. Similarly, the expected
frequencies of the last 6 rows (with 5,6,7,8,9, and 10 or more errors) are less
than 5. Therefore we pool these rows with the row having the expected typing
errors as 4 or more.
As usual we use the following formula for calculating the chi-square (χ2) value.
2 (O i − E i ) 2
χ =∑
Ei
148
The following table gives the calculated χ2 value after pooling cells Chi-Square Test
Suppose for a 2 × 2 contingency table, the four cell values a, b, c and d are
arranged in the following order.
a b
c d
Illustration 9
Suppose we have the following data on the consumer preference of a new
product collected from the people living in north and south India.
South India North India Row total
Number of consumers who 4 51 55
prefer present product
Number of consumers who 14 38 52
prefer new product
Column total: 18 89 107
149
Probability and Hypothesis Do the data suggest that the new product is preferred by the people
Testing independent of their region? Use a = 0.05.
H0: PS = PN (the proportion of people who prefer new product among south and north
India are the same).
H1: PS ≠ PN (the proportion of people who prefer new product among south and north
India are not the same).
In this illustration, (i) the sample size (n) = 107 (ii) the cell values are: a = 4,
b = 51, c = 14, d = 38, (iii) The corresponding row totals are: (a + b) = 55 and
(c + d) = 52, and column totals are (a + c) = 18 and (b + d) = 89.
2
107
107 | 4 × 38 − 51 × 14 | −
2
107 [ | 152 − 714 | − 53 . 5 ] 2
χ 2
= =
55 × 52 × 18 × 89 4581720
The table value for degrees of freedom (2–1) (2–1) = 1 and significance level
∝ = 0.05 is 3.841. Since calculated value of chi-square is 6.0386 which is
greater than the table value we can reject H0 and accept H1 and conclude that
the preference for the new product is not independent of the geographical
region.
It may be observed that when N is large, Yates correction will not make much
difference in the chi square value. However, if N is small, the implication of
Yates correction may overstate the probability.
a) As explained in section 17.5 (conditions for applying chi square test), the chi square
test is highly sensitive to the sample size. As sample size increases, absolute
differences become a smaller and smaller proportion of expected value. This means
150
that a reasonably strong association may not come up as significant if the sample Chi-Square Test
size is small. Conversely, in a large sample, we may find statistical significance
when the findings are small and insignificant. That is, the findings are not substantially
significant, although they are statistically significant.
b) Chi-square test is also sensitive to small frequencies in the cells of contingency
table. Generally, when the expected frequency in a cell of a table is less than 5,
chi-square can lead to erroneous conclusions as explained in section 17.5. The
rule of thumb here is that if either (i) an expected value in a cell in a 2 × 2 contingency
table is less than 5 or (ii) the expected values of more than 20% of the cells in a
greater than 2 × 2 contingency table are less than 5, then chi square test should not
be applied. If at all a chi-square test is applied then appropriately either Yates
correction or cell pooling should also be applied.
c) No directional hypothesis is assumed in chi-square test. Chi-square tests the
hypothesis that two attributes/variables are related only by chance. That is if a
significant relationship is found, this is not equivalent to establishing the researchers’
hypothesis that attribute A causes attribute B or attribute B causes attribute A.
Self Assessment Exercise B
1) While calculating the expected frequencies of a chi-square distribution it was found
that some of the cells of expected frequencies have value below 5. Therefore,
some of the cells are pooled. The following statements tell you the size of the
contingency table before pooling and the rows/columns pooled. Determine the
number of degrees of freedom.
a) 5 × 4 contingency table. First two and last two rows are pooled.
b) 4 × 6 contingency table. First two and last two columns are pooled.
c) 6 × 3 contingency table. First two rows are pooled. 4th, 5th, and 6th rows
are pooled.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
The chi-square test for testing the goodness-of-fit establishes whether the
sample data supports the assumption that a particular distribution applies to the
parent population. It should be noted that the statistical procedures are based on
some assumptions such as normal distribution of population. A chi-square
procedure allows for testing the null hypothesis that a particular distribution
applies. We also use chi-square test whether to test whether the classification
criteria are independent or not.
When performing chi-square test using contingency tables, it is assumed that all
cell frequencies are a minimum of 5. If this assumption is not met we may use
the pooling method but then there is a loss of information when we use this
method. In a 2 × 2 contingency table if one or more cell frequencies are less
than 5 we should apply Yates correction for computing the chi-square value.
In a chi-square test for goodness of-fit, the degrees of freedom are number of
categories – 1 (n–1). In a chi-square test for independence of attributes, the
degrees of freedom are (number of rows–1) × (number of columns–1). That is,
(r–1) × (c–1).
Cells Pooling: When a contingency table contains one or more cells with
expected frequency less than 5, we combine two rows or columns before
calculating χ2. We combine these cells in order to get an expected frequency of
5 or more in each cell.
Goodness of Fit: The chi-square test procedure used for the validation of our
assumption about the probability distribution is called goodness of fit.
153
Probability and Hypothesis
Testing
3. a. (Row, Observed Expected (Oi - Ei) (Oi - Ei)2 (Oi - Ei)2/Ei
Column) frequency frequency
(O i) (Ei)
3. b. H0: The preference for the brand is distributed independent of the consumers’
education level.
3. c. Table value χ2 at 3 d.f and α = 0.05 is 7.815. Since calculated value (7.1178)
is less than the table value of χ2 (7.815), we accept the H0.
B) 1. a) 6, b) 9, c) 4
2. a) 20.090, b)22.362, c) 23.542, d) 8.558
3. i) Poisson probabilities and expected values
No. of repairs Poisson probability Expected frequency
per car (x) Ei =(2)x150
(1) (2) (3)
0 0.0498 7.47
1 0.1494 22.41
2 0.2240 33.60
3 0.2240 33.60
4 0.1680 25.20
5 or more 0.1848 27.72
154
3. ii) chi-square value Chi-Square Test
3.iii) At 0.05 significance level and 4 degrees of freedom the table value is 9.488.
Since the calculated chi-square value is greater than the table value we reject
the null hypothesis that the frequency of telephone calls follows Poisson
distribution.
155
Probability and Hypothesis 9) The following table gives the number of telephone calls attended by a credit card
Testing information attendant.
Day Sunday Monday Tuesday Wednesday Thursday Friday Saturday
No. 45 50 24 36 33 27 42
of calls
attended
Test whether the telephone calls are uniformly distributed? Use 0.10
significance level.
10)The following data gives preference of car makes by type of customer.
(a) Test the independence of the two attributes. Use 0.05 level of significance.
(b) Draw your conclusions.
11) A bath soap manufacturer introduced a new brand of soap in 4 colours. The
following data gives information on the consumer preference of the brand.
Good 20 10 20 30 80
Fair 20 10 10 30 70
Poor 10 45 35 10 100
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
156
Chi-Square Test
17.13 FURTHER READING
A number of good text books are available on the topics dealth within this unit. The
following books may be used for more indepth study.
157
Appendix Table-1 Binomial Probabilities
158
Testing
p
n r .01 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95
2 0 .980 .902 .810 .723 .640 .563 .490 .423 .360 .303 .250 .203 .160 .123 .090 .063 .040 .023 .010 .002
1 .020 .095 .180 .255 .320 .375 .420 .455 .480 .495 .500 .495 .480 .455 .420 .375 .320 .255 .180 .095
2 .000 .002 .010 .023 .040 .063 .090 .123 .160 .203 .250 .303 .360 .423 .490 .563 .640 .723 .810 .902
3 0 .970 .857 .729 .614 .512 .422 .343 .275 .216 .166 .125 .091 .064 .043 .027 .016 .008 .003 .001 .00
1 .029 .135 .243 .325 .384 .422 .441 .444 .432 .408 .375 .334 .288 .239 .189 .141 .096 .057 .027 .007
Probability and Hypothesis
2 .000 .007 .027 .057 .096 .141 .189 .239 .288 .334 .375 .408 .432 .444 .441 .422 .384 .325 .243 .135
3 .000 .000 .001 .003 .008 .016 .027 .043 .064 .091 .125 .166 .216 .275 .343 .422 .512 .614 .729 .857
4 0 .961 .815 .656 .522 .410 .316 .240 .179 .130 .092 .062 .041 .026 .015 .008 .004 .002 .001 .000 .000
1 .039 .171 .292 .368 .410 .422 .412 .384 .346 .300 .250 .200 .154 .112 .076 .047 .026 .011 .004 .000
2 .001 .014 .049 .098 .154 .211 .265 .311 .346 .368 .375 .368 .346 .311 .265 .211 .154 .098 .049 .014
3 .000 .000 .004 .011 .026 .047 .076 .112 .154 .200 .250 .300 .346 .384 .412 .422 .410 .368 .292 .171
4 .000 .000 .000 .001 .002 .004 .008 .015 .026 .041 .062 .092 .130 .179 .240 .316 .410 .522 .656 .815
5 0 .951 .774 .590 .444 .328 .237 .168 .116 .078 .050 .031 .019 .010 .005 .002 .001 .000 .000 .000 .000
1 .048 .204 .328 .392 .410 .396 .360 .312 .259 .206 .156 .113 .077 .049 .028 .015 .006 .002 .000 .000
2 .001 .021 .073 .138 .205 .264 .309 .336 .346 .337 .312 .276 .230 .181 .132 .088 .051 .024 .008 .001
3 .000 .001 .008 .024 .051 .088 .132 .181 .230 .276 .312 .337 .346 .336 .309 .264 .205 .138 .073 .021
4 .000 .000 .000 .002 .006 .015 .028 .049 .077 .113 .156 .206 .259 .312 .360 .396 .410 .392 .328 .204
5 .000 .000 .000 .000 .000 .001 .002 .005 .010 .019 .031 .050 .078 .116 .168 .237 .328 .444 .590 .774
6 0 .941 .735 .531 .377 .262 .178 .118 .075 .047 .028 .016 .008 .004 .002 .001 .000 .000 .000 .000 .000
1 .057 .232 .354 .399 .393 .356 .303 .244 .187 .136 .094 .061 .037 .020 .010 .004 .002 .000 .000 .000
2 .001 .031 .098 .176 .246 .297 .324 .328 .311 .278 .234 .186 .138 .095 .060 .033 .015 .006 .001 .000
3 .000 .002 .015 .042 .082 .132 .185 .236 .276 .303 .312 .303 .276 .236 .185 .132 .082 .042 .015 .002
4 .000 .000 .001 .006 .015 .033 .060 .095 .138 .186 .234 .278 .311 .328 .324 .297 .246 .176 .098 .031
5 .000 .000 .000 .000 .002 .004 .010 .020 .037 .061 .094 .136 .187 .244 .303 .356 .393 .399 .354 .232
6 .000 .000 .000 .000 .000 .000 .001 .002 .004 .008 .016 .028 .047 .075 .118 .178 .262 .377 .531 .735
7 0 .932 .698 .478 .321 .210 .133 .082 .049 .028 .015 .008 .004 .002 .001 .000 .000 .000 .000 .000 .000
1 .066 .257 .372 .396 .367 .311 .247 .185 .131 .087 .055 .032 .017 .008 .004 .001 .000 .000 .000 .000
2 .002 .041 .124 .210 .275 .311 .318 .299 .261 .214 .164 .117 .077 .047 .025 .012 .004 .001 .000 .000
3 .000 .004 .023 .062 .115 .173 .227 .268 .290 .292 .273 .239 .194 .144 .097 .058 .029 .011 .003 .000
4 .000 .000 .003 .011 .029 .058 .097 .144 .194 .239 .273 .292 .290 ;268 .227 .173 .115 .062 .023 .004
5 .000 .000 .000 .001 .004 .012 .025 .047 .077 .117 .164 .214 .261 .299 .318 .311 .275 .210 .124 .041
6 .000 .000 .000 .000 .000 .001 .004 .008 .017 .032 .055 .087 .131 .185 .247 .311 .367 .396 .372 .257
7 .000 .000 .000 .000 .000 .000 .000 .001 .002 .004 .008 .015 .028 .049 .082 .133 .210 .321 .478 .698
8 0 .923 .663 .430 .272 .168 .100 .058 .032 .017 .008 .004 .002 .001 .000 .000 .000 .000 .000 .000 .000
1 .075 .279 .383 .385 .336 .267 .198 .137 .090 .055 .031 .016 .008 .003 .001 .000 .000 .000 .000 .000
2 .003 .051 .149 .238 .294 .311 .296 .259 .209 .157 .109 .070 .041 .022 .010 .004 .001 .000 .000 .000
3 .000 .005 .033 .084 .147 .208 .254 .279 .279 .257 .219 .172 .124 .081 .047 .023 .009 .003 .000 .000
4 .000 .000 .005 :018 .046 .087 .136 .188 .232 .263 .273 .263 .232 .188 .136 .087 .046 .018 .005 .000
5 .000 .000 .000 .003 .009 .023 .047 .081 .124 .172 .219 .257 .279 .279 .254 .208 .147 .084 .033 .005
6 .000 .000 .000 .000 .001 .004 .010 .022 .041 .070 .109 .157 .209 .259 .296 .311 .294 .238 .149 .051
7 .000 .000 .000 .000 .000 .000 .001 .003 .008 .016 .031 .055 .090 .137 .198 .267 .336 .385 .383 .279
8 .000 .000 .000 .000 .000 000 .000 .000 .001 .002 .004 .008 .017 .032 .058 .100 .168 .272 .430 .663
Appendix Table-1 Binomial Probabilities (continued)
p
n r .01 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95
9 0 .914 .630 .387 .232 .134 .075 .040 .021 .010 .005 .002 .001 .000 .000 .000 .000 .000 .000 .000 .000
1 .083 .299 .387 .368 .302 .225 .156 .100 .060 .034 .018 .008 .004 .001 .000 .000 .000 .000 .000 .000
2 .003 .063 .172 .260 .302 .300 .267 .216 .161 .111 .070 .041 .021 .010 .004 .001 .000 .000 .000 .000
3 .000 .008 .045 .107 .176 .234 .267 .272 .251 .212 .164 .116 .074 .042 .021 .009 .003 .001 .000 .000
4 .000 .001 .007 .028 .066 .117 .172 .219 .251 .260 .246 .213 .167 .118 .074 .039 .017 .005 .001 .000
5 .000 .000 .001 .005 .017 .039 .074 .118 .167 .213 .246 .260 .251 .219 .172 .117 .066 .028 .007 .001
6 .000 .000 .000 .001 .003 .009 .021 .042 .074 .116 .164 .212 .251 .272 .267 .234 .176 .107 .045 .008
7 .000 .000 .000 .000 .000 .001 .004 .010 .021 .041 .070 .111 .161 .216 .267 .300 .302 .260 .172 .063
8 .000 .000 .000 .000 .000 .000 .000 .001 .004 .008 .018 .034 .060 .100 .156 .225 .302 .368 .387 .299
9 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .002 .005 .010 .021 .040 .075 .134 .232 .387 .630
10 0 .904 .599 .349 .197 .107 .056 .028 .014 .006 .003 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000
1 .091 .315 .387 .347 .268 .188 .121 .072 .040 .021 .010 .004 .002 .000 .000 .000 .000 .000 .000 .000
2 .004 .075 .194 .276 .302 .282 .233 .176 .121 .076 .044 .023 .011 .004 .001 .000 .000 .000 .000 .000
3 .000 .010 .057 .130 .201 .250 .267 .252 .215 .166 .117 .075 .042 .021 .009 .003 .001 .000 .000 .000
4 .000 .001 .011 .040 .088 .146 .200 .238 .251 .238 .205 .160 .111 .069 .037 .016 .006 .001 .000 .000
5 .000 .000 .001 .008 .026 .058 .103 .154 .201 .234 .246 .234 .201 .154 .103 .058 .026 .008 .001 .000
6 .000 .000 .000 .001 .006 .016 .037 .069 .111 .160 .205 .238 .251 .238 .200 .146 .088 .040 .011 .001
7 .000 .000 .000 .000 .001 .003 .009 .021 .042 .075 .117 .166 .215 .252 .267 .250 .201 .130 .057 .010
8 .000 .000 .000 .000 .000 .000 .001 .004 .011 .023 .044 .076 .121 .176 .233 .282 .302 .276 .194 .07.
9 .000 .000 .000 .000 .000 .000 .000 .000 .002 .004 .010 .021 .040 .072 .121 .188 .268 .347 .387 .315
10 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .003 .006 .014 .028 .056 .107 .197 .349 .599
11 0 .895 .569 .314 .167 .086 .042 .020 .009 .004 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
1 .099 .329 .384 .325 .236 .155 .093 .052 .027 .013 .005 .002 .001 .000 .000 .000 .000 .000 .000 .000
2 .005 .087 .213 .287 .295 .258 .200 .140 .089 .051 .027 .013 .005 .002 .001 .000 .000 .000 .000 .000
3 .000 .014 .071 .152 .221 .258 .257 .225 .177 .126 .081 .046 .023 .010 .004 .001 .000 .000 .000 .000
4 .000 .001 .016 .054 .111 .172 .220 .243 .236 .206 .161 .113 .070 .038 .017 .006 .002 .000 .000 .000
5 .000 .000 .002 .013 .039 .080 .132 .183 .221 .236 .226 .193 .147 .099 .057 .027 .010 .002 .000 .000
6 .000 .000 .000 .002 .010 .027 .057 .099 .147 .193 .226 .236 .221 .183 .132 .080 .039 .013 .002 .000
7 .000 .000 .000 .000 .002 .006 .017 .038 .070 .113 .161 .206 .236 .243 .220 .172 .111 .054 .016 .001
8 .000 .000 .000 .000 .000 .001 .004 .010 .023 .046 .081 .126 .177 .225 .257 .258 .221 .152 .071 .014
9 .000 .000 .000 .000 .000 .000 .001 .002 .005 .013 .027 .051 .089 .140 .200 .258 .295 .287 .213 .087
10 .000 .000 .000 .000 .000 .000 .000 .000 .001 .002 .005 .013 .027 .052 .093 .155 .236 .325 .384 .329
11 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .004 .009 .020 .042 .086 .167 .314 .569
12 0 .886 .540 .282 .142 .069 .032 .014 .006 .002 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
1 .107 .341 .377 .301 .206 .127 .071 .037 .017 .008 .003 .001 .000 .000 .000 .000 .000 .000 .000 .000
2 .006 .099 .230 .292 .283 .232 .168 .109 .064 .034 .016 .007 .002 .001 .000 .000 .000 .000 .000 .000
3 .000 .017 .085 .172 .236 .258 .240 .195 .142 .092 .054 .028 .012 .005 .001 .000 .000 .000 .000 .000
4 .000 .002 .021 .068 .133 .194 .231 .237 .213 .170 .121 .076 .042 .020 .008 .002 .001 .000 .000 .000
5 .000 .000 .004 .019 .053 .103 .158 .204 .227 .223 .193 .149 .101 .059 .029 .011 .003 .001 .000 .000
6 .000 .000 .000 .004 .016 .040 .079 .128 .177 .212 .226 .212 .177 .128 .079 .040 .016 .004 .000 .000
159
Chi-Square Test
160
Testing
Probability and Hypothesis
p
n r .01 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95
7 .000 .000 .000 .001 .003 .011 .029 .059 .101 .149 .193 .223 .227 .204 .158 .103 .053 .019 .004 .000
8 .000 .000 .000 .000 .001 .002 .008 .020 .042 .076 .121 .170 .213 .237 .231 .194 .133 .068 .021 .002
9 .000 .000 .000 .000 .000 .000 .001 .005 .012 .028 .054 .092 .142 .195 .240 .258 .236 .172 .085 .017
10 .000 .000 .000 .000 .000 .000 .000 .001 .002 .007 .016 .034 .064 .109 .168 .232 .283 .292 .230 .099
11 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .003 .008 .017 .037 .071 .127 .206 .301 .377 .341
12 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .002 .006 .014 .032 .069 .142 .282 .540
15 0 .860 .463 .206 .087 .035 .013 .005 .002 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
1 .130 .366 .343 .231 .132 .067 .031 .013 .005 .002 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
2 .009 .135 .267 .286 .231 .156 .092 .048 .022 .009 .003 .001 .000 .000 .000 .000 .000 .000 .000 .000
3 .000 .031 .129 .218 .250 .225 .170 .111 .063 .032 .014 .005 .002 .000 .000 .000 .000 .000 .000 .000
4 .000 .005 .043 .116 .188 .225 .219 .179 .127 .078 .042 .019 .007 .002 .001 .000 .000 .000 .000 .000
5 .000 .001 .010 .045 .103 .165 .206 .212 .186 .140 .092 .051 .024 .010 .003 .001 .000 .000 .000 .000
6 .000 .000 .002 .013 .043 .092 .147 .191 .207 .191 .153 .105 .061 .030 .012 .003 .001 .000 .000 .000
7 .000 .000 .000 .003 .014 .039 .081 .132 .177 .201 .196 .165 .118 .071 .035 .013 .003 .001 .000 .000
8 .000 .000 .000 .001 .003 .013 .035 .071 .118 .165 .196 .201 .177 .132 .081 .039 .014 .003 .000 .000
9 .000 .000 .000 .000 .001 .003 .012 .030 .061 .105 .153 .191 .207 .191 .147 .092 .043 .013 .002 .000
10 .000 .000 .000 .000 .000 .001 .003 .010 .024 .051 .092 .140 .186 .212 .206 .165 .103 .045 .010 .001
11 .000 .000 .000 .000 .000 .000 .001 .002 .007 .019 .042 .078 .127 .179 .219 .225 .188 .116 .043 .005
12 .000 .000 .000 .000 .000 .000 .000 .000 .002 .005 .014 .032 .063 .111 .170 .225 .250 .218 .129 .031
13 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .003 .009 .022 .048 .092 .156 .231 .286 .267 .135
14 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .002 .005 .013 .031 .067 .132 .231 .343 .366
15 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .002 .005 .013 .035 .087 .206 .463
Appendix Table-2 Direct Values for Determining Poisson Probabilities Chi-Square Test
For a given value of l, entry indicates the probability of obtaining a specified value of X.
µ
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0.9048 0.8187 0.7408 0.6703 0.6065 0.5488 0.4966 0.4493 0.4066 0.3679
1 0.0905 0.1637 0.2222 0.2681 0.3033 0.3293 0.3476 0.3595 0.3659 0.3679
2 0.0045 0.0164 0.0333 0.0536 0.0758 0.0688 0.1217 0.1438 0.1647 0.1839
3 0.0002 0.0011 0.0033 0.0072 0.0126 0.0198 0.0284 0.0383 0.0494 0.0613
4 0.0000 0.0001 0.0003 0.0007 0.0016 0.0030 0.0050 0.0077 0.0111 0.0153
5 0.0000 0.0000 0.0000 0.0001 0.0002 0.0004 0.0007 0.0012 0.0020 0.0031
6 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0003 0.0005
7 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
µ
x 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
0 0.3329 0.3012 0.2725 0.2466 0.2231 0.2019 0.1827 0.1653 0.1496 0.1353
1 0.3662 0.3614 0.3543 0.3452 0.3347 0.3230 0.3106 0.2975 0.2842 0.2707
2 0.2014 0.2169 0.2303 0.2417 0.2510 0.2584 0.2640 0.2678 0.2700 0.2707
3 0.0738 0.0867 0.0998 0.1128 0.1255 0.1378 0.1496 0.1607 0.1710 0.1804
4 0.0203 0.0260 0.0324 0.0395 0.0471 0.0551 0.0636 0.0723 0.0812 0.0902
5 0.0045 0.0062 0.0084 0.0111 0.0141 0.0176 0.0216 0.0260 0.0309 0.0361
6 0.0008 0.0012 0.0018 0.0026 0.0035 0.0047 0.0061 0.0078 0.0098 0.0120
7 0.0001 0.0002 0.0003 0.0005 0.0008 0.0011 0.0015 0.0020 0.0027 0.0034
8 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0003 0.0005 0.0006 0.0009
9 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002
µ
x 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
0 0.1225 0.1108 0.1003 0.0907 0.0821 0.0743 0.0672 0.0608 0.0550 0.0498
1 0.2572 0.2438 0.2306 0.2177 0.2052 0.1931 0.1815 0.1703 0.1596 0.1494
2 0.2700 0.2681 0.2652 0.5613 0.2565 0.2510 0.2450 0.2384 0.2314 0.2240
3 0.1890 0.1966 0.2033 0.2090 0.2138 0.2176 0.2205 0.2225 0.2237 0.2240
4 0.0992 0.1082 0.1169 0.1254 0.1336 0.1414 0.1488 0.1557 0.1622 0.1680
5 0.0417 0.0476 0.0538 0.0602 0.0668 0.0735 0.0804 0.0872 0.0940 0.1008
6 0.0146 0.0174 0.0206 0.0241 0.0278 0.0319 0.0362 0.0407 0.0455 0.0504
7 0.0044 0.0055 0.0068 0.0083 0.0099 0.0118 0.0139 0.0163 0.0188 0.0216
8 0.0011 0.0015 0.0019 0.0025 0.0031 0.0038 0.0047 0.0057 0.0068 0.0081
9 0.0003 0.0004 0.0005 0.0007 0.0009 0.0011 0.0014 0.0018 0.0022 0.0027
10 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0004 0.0005 0.0006 0.0008
11 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0002
12 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
µ
x 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0
0 0.0450 0.0408 0.0369 0.0334 0.0302 0.0273 0.0247 0.0224 0.0202 0.0183
1 0.1397 0.1304 0.1217 0.1135 0.1057 0.0984 0.0915 0.0850 0.0789 0.0733
2 0.2165 0.2087 0.2008 0.1929 0.1850 0.1771 0.1692 0.1615 0.1539 0.1465
3 0.2237 0.2226 0.2209 0.2186 0.2158 0.2125 0.2087 0.2046 0.2001 0.1954
4 0.1734 0.1781 0.1823 0.1858 0.1888 0.1912 0.1931 0.1944 0.1951 0.1954
5 0.1075 0.1140 0.1203 0.1264 0.1322 0.1377 0.1429 0.1477 0.1522 0.1563
6 0.0555 0.0608 0.0662 0.0716 0.0771 0.0826 0.0881 0.0936 0.0989 0.1042
7 0.0246 0.0278 0.0312 0.0348 0.0385 0.0425 0.0466 0.0508 0.0551 0.0595
8 0.0095 0.0111 0.0129 0.0148 0.0169 0.0191 0.0215 0.0241 0.0269 0.0298
9 0.0033 0.0040 0.0047 0.0056 0.0066 0.0076 0.0089 0.0102 0.0116 0.0132
10 0.0010 0.0013 0.0016 0.0019 0.0023 0.0028 0.0033 0.0039 0.0045 0.0053
11 0.0003 0.0004 0.0005 0.0006 0.0007 0.0009 0.0011 0.0013 0.0016 0.0019
12 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005 0.0006
13 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002
14 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 161
Probability and Hypothesis Appendix Table-2 Direct Values for Determining Poisson Probabilities (continued….)
Testing
µ
x 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0
0 0.0166 0.0150 0.0136 0.0123 0.0111 0.0101 0.0091 0.0082 0.0074 0.0067
1 0.0679 0.0630 0.0583 0.0540 0.0500 0.0462 0.0427 0.0395 0.0365 0.0337
2 0.1393 0.1323 0.1254 0.1188 0.1125 0.1063 0.1005 0.0948 0.0894 0.0842
3 0.1904 0.1852 0.1798 0.1743 0.1687 0.1631 0.1574 0.1517 0.1460 0.1404
4 0.1951 0.1944 0.1933 0.1917 0.1898 0.1875 0.1849 0.1820 0.1789 0.1755
5 0.1600 0.1633 0.1662 0.1687 0.1708 0.1725 0.1738 0.1747 0.1753 0.1755
6 0.1093 0.1143 0.1191 0.1237 0.1281 0.1323 0.1362 0.1398 0.1432 0.1462
7 0.0640 0.0686 0.0732 0.0778 0.0824 0.0869 0.0914 0.0959 0.1022 0.1044
8 0.0328 0.0360 0.0393 0.0428 0.0463 0.0500 0.0537 0.0575 0.0614 0.0653
9 0.0150 0.0168 0.0188 0.0209 0.0232 0.0255 0.0280 0.0307 0.0334 0.0363
10 0.0061 0.0071 0.0081 0.0092 0.0104 0.0118 0.0132 0.0147 0.0164 0.0181
11 0.0023 0.0027 0.0032 0.0037 0.0043 0.0049 0.0056 0.0064 0.0073 0.0082
12 0.0008 0.0009 0.0011 0.0014 0.0016 0.0019 0.0022 0.0026 0.0030 0.0034
13 0.0002 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009 0.0011 0.0013
14 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005
15 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.000
µ
x 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0
0 0.0061 0.0055 0.0050 0.0045 0.0041 0.0037 0.0033 0.0030 0.0027 0.0025
1 0.0311 0.0287 0.0265 0.0244 0.0225 0.0207 0.0191 0.0176 0.0162 0.0149
2 0.0793 0.0746 0.0701 0.0659 0.0618 0.0580 0.0544 0.0509 0.0477 0.0446
3 0.1348 0.1293 0.1239 0.1185 0.1133 0.1082 0.1033 0.0985 0.0938 0.0892
4 0.1719 0.1681 0.1641 0.1600 0.1558 0.1515 0.1472 0.1428 0.1383 0.1339
5 0.1753 0.1748 0.1740 0.1728 0.1714 0.1697 0.1678 0.1656 0.1632 0.1606
6 0.1490 0.1515 0.1537 0.1555 0.1571 0.1584 0.1594 0.1601 0.1605 0.1606
7 0.1086 0.1125 0.1163 0.1200 0.1234 0.1267 0.1298 0.1326 0.1353 0.1377
8 0.0692 0.0731 0.0771 0.0810 0.0849 0.0887 0.0925 0.0962 0.0998 0.1033
9 0.0392 0.0423 0.0454 0.0486 0.0519 0.0552 0.0586 0.0620 0.0654 0.0688
10 0.0200 0.0220 0.0241 0.0262 0.0285 0.0309 0.0334 0.0359 0.0386 0.0413
11 0.0093 0.0104 0.0116 0.0129 0.0143 0.0157 0.0173 0.0190 0.0207 0.0225
12 0.0039 0.0045 0.0051 0.0058 0.0065 0.0073 0.0082 0.0092 0.0102 0.0113
13 0.0015 0.0018 0.0021 0.0024 0.0028 0.0032 0.0036 0.0041 0.0046 0.0052
14 0.0006 0.0007 0.0008 0.0009 0.0011 0.0013 0.0015 0.0017 0.0019 0.0022
15 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009
16 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003
17 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001
µ
x 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0
0 0.0022 0.0020 0.0018 0.0017 0.0015 0.0014 0.0012 0.0011 0.0010 0.0009
1 0.0137 0.0126 0.0116 0.0106 0.0098 0.0090 0.0082 0.0076 0.0070 0.0064
2 0.0417 0.0390 0.0364 0.0340 0.0318 0.0296 0.0276 0.0258 0.0240 0.0223
3 0.0848 0.0806 0.0765 0.0726 0.0688 0.0652 0.0617 0.0584 0.0552 0.0521
4 0.1294 0.1249 0.1205 0.1162 0.1118 0.1076 0.1034 0.0992 0.0952 0.0912
5 0.1579 0.1549 0.1519 0.1487 0.1454 0.1420 0.1385 0.1349 0.1314 0.1277
6 0.1605 0.1601 0.1595 0.1586 0.1575 0.1562 0.1546 0.1529 0.1511 0.1490
7 0.1399 0.1418 0.1435 0.1450 0.1462 0.1472 0.1480 0.1486 0.1489 0.1490
8 0.1066 0.1099 0.1130 0.1160 0.1188 0.1215 0.1240 0.1263 0.1284 0.1304
9 0.0723 0.0757 0.0791 0.0825 0.0858 0.0891 0.0923 0.0954 0.0985 0.1014
10 0.0441 0.0469 0.0498 0.0528 0.0558 0.0588 0.0618 0.0649 0.0679 0.0710
11 0.0245 0.0265 0.0285 0.0307 0.0330 0.0353 0.0377 0.0401 0.0426 0.0452
12 0.0124 0.0137 0.0150 0.0164 0.0179 0.0194 0.0210 0.0227 0.0245 0.0264
13 0.0058 0.0065 0.0073 0.0081 0.0089 0.0098 0.0108 0.0119 0.0130 0.0142
14 0.0025 0.0029 0.0033 0.0037 0.0041 0.0046 0.0052 0.0058 0.0064 0.0071
15 0.0010 0.0012 0.0014 0.0016 0.0018 0.0020 0.0023 0.0026 0.0029 0.0033
16 0.0004 0.0005 0.0005 0.0006 0.0007 0.0008 0.0010 0.0011 0.0013 0.0014
17 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0004 0.0005 0.0006
18 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002
162 19 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001
Appendix Table-2 Direct Values for Determining Poisson Probabilities (continued….) Chi-Square Test
µ
x 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0
0 0.0008 0.0007 0.0007 0.0006 0.0006 0.0005 0.0005 0.0004 0.0004 0.0003
1 0.0059 0.0054 0.0049 0.0045 0.0041 0.0038 0.0035 0.0032 0.0029 0.0027
2 0.0208 0.0194 0.0180 0.0167 0.0156 0.0145 0.0134 0.0125 0.0116 0.0107
3 0.0492 0.0464 0.0438 0.0413 0.0389 0.0366 0.0345 0.0324 0.0305 0.0286
4 0.0874 0.0836 0.0799 0.0764 0.0729 0.0696 0.0663 0.0632 0.0602 0.0573
5 0.1241 0.1204 0.1167 0.1130 0.1094 0.1057 0.1021 0.0986 0.0951 0.0916
6 0.1468 0.1445 0.1420 0.1394 0.1367 0.1339 0.1311 0.1282 0.1252 0.1221
7 0.1489 0.1486 0.1481 0.1474 0.1465 0.1454 0.1442 0.1428 0.1413 0.1396
8 0.1321 0.1337 0.1351 0.1363 0.1373 0.1382 0.1388 0.1392 0.1395 0.1396
9 0.1042 0.1070 0.1096 0.1121 0.1144 0.1167 0.1187 0.1207 0.1224 0.1241
10 0.0740 0.0770 0.0800 0.0829 0.0858 0.0887 0.0914 0.0941 0.0967 0.0993
11 0.0478 0.0504 0.0531 0.0558 0.0585 0.0613 0.0640 0.0667 0.0695 0.0722
12 0.0283 0.0303 0.0323 0.0344 0.0366 0.0388 0.0411 0.0434 0.0457 0.0481
13 0.0154 0.0168 0.0181 0.0196 0.0211 0.0227 0.0243 0.0260 0.0278 0.0296
14 0.0078 0.0086 0.0095 0.0104 0.0113 0.0123 0.0134 0.0145 0.0157 0.0169
15 0.0037 0.0041 0.0046 0.0051 0.0057 0.0062 0.0069 0.0075 0.0083 0.0090
16 0.0016 0.0019 0.0021 0.0024 0.0026 0.0030 0.0033 0.0037 0.0041 0.0045
17 0.0007 0.0008 0.0009 0.0010 0.0012 0.0013 0.0015 0.0017 0.0019 0.0021
18 0.0003 0.0003 0.0004 0.0004 0.0005 0.0006 0.0006 0.0007 0.0008 0.0009
19 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003 0.0003 0.0004
20 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002
21 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001
µ
x 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0
0 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0001 0.0001
1 0.0025 0.0023 0.0021 0.0019 0.0017 0.0016 0.0014 0.0013 0.0012 0.0011
2 0.0100 0.0092 0.0086 0.0079 0.0074 0.0068 0.0063 0.0058 0.0054 0.0050
3 0.0269 0.0252 0.0237 0.0222 0.0208 0.0195 0.0183 0.0171 0.0260 0.0150
4 0.0544 0.0517 0.0491 0.0466 0.0443 0.0420 0.0398 0.0377 0.0357 0.0337
5 0.0882 0.0849 0.0816 0.0784 0.0752 0.0722 0.0692 0.0663 0.0635 0.0607
6 0.1191 0.1160 0.1128 0.1097 0.1066 0.1034 0.1003 0.0972 0.0941 0.0911
7 0.1378 0.1358 0.1338 0.1317 0.1294 0.1271 0.1247 0.1222 0.1197 0.1171
8 0.1395 0.1392 0.1388 0.1382 0.1375 0.1366 0.1356 0.1344 0.1332 0.1318
9 0.1256 0.1269 0.1280 0.1290 0.1299 0.1306 0.1311 0.1315 0.1317 0.1318
10 0.1017 0.1040 0.1063 0.1084 0.1104 0.1123 0.1140 0.1157 0.1172 0.1186
11 0.0749 0.0776 0.0802 0.0828 0.0853 0.0878 0.0902 0.0925 0.0948 0.0970
12 0.0505 0.0530 0.0555 0.0579 0.0604 0.0629 0.0654 0.0679 0.0703 0.0728
13 0.0315 0.0334 0.0354 0.0374 0.0395 0.0416 0.0438 0.0459 0.0481 0.0504
14 0.0182 0.0196 0.0210 0.0225 0.0240 0.0256 0.0272 0.0289 0.0306 0.0324
15 0.0098 0.0107 0.0116 0.0126 0.0136 0.0147 0.0158 0.0169 0.0182 0.0194
16 0.0050 0.0055 0.0060 0.0066 0.0072 0.0079 0.0086 0.0093 0.0101 0.0109
17 0.0024 0.0026 0.0029 0.0033 0.0036 0.0040 0.0044 0.0048 0.0053 0.0058
18 0.0011 0.0012 0.0014 0.0015 0.0017 0.0019 0.0021 0.0024 0.0026 0.0029
19 0.0005 0.0005 0.0006 0.0007 0.0008 0.0009 0.0010 0.0011 0.0012 0.0014
20 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0004 0.0005 0.0005 0.0006
21 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0002 0.0003
22 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
µ
x 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0
0 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0000
1 0.0010 0.0009 0.0009 0.0008 0.0007 0.0007 0.0006 0.0005 0.0005 0.0005
2 0.0046 0.0043 0.0040 0.0037 0.0034 0.0031 0.0029 0.0027 0.0025 0.0023
3 0.1040 0.0131 0.0123 0.0115 0.0107 0.0100 0.0093 0.0087 0.0081 0.0076
4 0.0319 0.0302 0.0285 0.0269 0.0254 0.0240 0.0226 0.0213 0.0201 0.0189
163
Probability and Hypothesis Appendix Table-2 Direct Values for Determining Poisson Probabilities (continued….)
Testing
5 0.0581 0.0555 0.0530 0.0506 0.0483 0.0460 0.0439 0.0418 0.0398 0.0378
6 0.0881 0.0851 0.0822 0.0793 0.0764 0.0736 0.0709 0.0682 0.0656 0.0631
7 0.1145 0.1118 0.1091 0.1064 0.1037 0.1010 0.0982 0.0955 0.0928 0.0901
8 0.1302 0.1286 0.1269 0.1251 0.1232 0.1212 0.1191 0.1170 0.1148 0.1126
9 0.1317 0.1315 0.1311 0.1306 0.1300 0.1293 0.1284 0.1274 0.1263 0.1251
10 0.1198 0.1210 0.1219 0.1228 0.1235 0.1241 0.1245 0.1249 0.1250 0.1251
11 0.0991 0.1012 0.1031 0.1049 0.1067 0.1083 0.1098 0.1112 0.1125 0.1137
12 0.0752 0.0776 0.0799 0.0822 0.0844 0.0866 0.0888 0.0908 0.0928 0.0948
13 0.0526 0.0549 0.0572 0.0594 0.0617 0.0640 0.0662 0.0685 0.0707 0.0729
14 0.0342 0.0361 0.0380 0.0399 0.0419 0.0439 0.0459 0.0479 0.0500 0.0521
15 0.0208 0.0221 0.0235 0.0250 0.0265 0.0281 0.0297 0.0313 0.0330 0.0347
16 0.0118 0.0127 0.0137 0.0147 0.0157 0.0168 0.0180 0.0192 0.0204 0.0217
17 0.0063 0.0069 0.0075 0.0081 0.0088 0.0095 0.0103 0.0111 0.0119 0.0128
18 0.0032 0.0035 0.0039 0.0042 0.0046 0.0051 0.0055 0.0060 0.0065 0.0071
19 0.0015 0.0017 0.0019 0.0021 0.0023 0.0026 0.0028 0.0031 0.0034 0.0037
20 0.0007 0.0008 0.0009 0.0010 0.0011 0.0012 0.0014 0.0015 0.0017 0.0019
21 0.0003 0.0003 0.0004 0.0004 0.0005 0.0006 0.0006 0.0007 0.0008 0.0009
22 0.0001 0.0001 0.0002 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0004
23 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002
24 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001
µ
x 11 12 13 14 15 16 17 18 19 20
0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
1 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2 0.0010 0.0004 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
3 0.0037 0.0018 0.0008 0.0004 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000
4 0.0102 0.0053 0.0027 0.0013 0.0006 0.0003 0.0001 0.0001 0.0000 0.0000
5 0.0224 0.0127 0.0070 0.0037 0.0019 0.0010 0.0005 0.0002 0.0001 0.0001
6 0.0411 0.0255 0.0152 0.0087 0.0048 0.0026 0.0014 0.0007 0.0004 0.0002
7 0.0646 0.0437 0.0281 0.0174 0.0104 0.0060 0.0034 0.0018 0.0010 0.0005
8 0.0888 0.0655 0.0457 0.0304 0.0194 0.0120 0.0072 0.0042 0.0024 0.0013
9 0.1085 0.0874 0.0661 0.0473 0.0324 0.0213 0.0135 0.0083 0.0050 0.0029
10 0.1194 0.1048 0.0859 0.0663 0.0486 0.0341 0.0230 0.0150 0.0095 0.0058
11 0.1194 0.1144 0.1015 0.0844 0.0663 0.0496 0.0355 0.0245 0.0164 0.0106
12 0.1094 0.1144 0.1099 0.0984 0.0829 0.0661 0.0504 0.0368 0.0259 0.0176
13 0.0926 0.1056 0.1099 0.1060 0.0956 0.0814 0.0658 0.0509 0.0378 0.0271
14 0.0728 0.0905 0.1021 0.1060 0.1024 0.0930 0.0800 0.0655 0.0514 0.0387
15 0.0534 0.0724 0.0885 0.0989 0.1024 0.0992 0.0906 0.0786 0.0650 0.0516
16 0.0367 0.0543 0.0719 0.0866 0.0960 0.0992 0.0963 0.0884 0.0772 0.0646
17 0.0237 0.0383 0.0550 0.0713 0.0847 0.0934 0.0963 0.0936 0.0863 0.0760
18 0.0145 0.0256 0.0397 0.0554 0.0706 0.0830 0.0909 0.0936 0.0911 0.0844
19 0.0084 0.0161 0.0272 0.0409 0.0557 0.0699 0.0814 0.0887 0.0911 0.0888
20 0.0046 0.0097 0.0177 0.0286 0.0418 0.0559 0.0692 0.0798 0.0866 0.0888
21 0.0024 0.0055 0.0109 0.0191 0.0299 0.0426 0.0560 0.0684 0.0783 0.0846
22 0.0012 0.0030 0.0065 0.0121 0.0204 0.0310 0.0433 0.0560 0.0676 0.0769
23 0.0006 0.0016 0.0037 0.0074 0.0133 0.0216 0.0320 0.0438 0.0559 0.0669
24 0.0003 0.0008 0.0020 0.0043 0.0083 0.0144 0.0226 0.0328 0.0442 0.0557
25 0.0001 0.0004 0.0010 0.0024 0.0050 0.0092 0.0154 0.0237 0.0336 0.0446
26 0.0000 0.0002 0.0005 0.0013 0.0029 0.0057 0.0101 0.0164 0.0246 0.0343
27 0.0000 0.0001 0.0002 0.0007 0.0016 0.0034 0.0063 0.0109 0.0173 0.0254
28 0.0000 0.0000 0.0001 0.0003 0.0009 0.0019 0.0038 0.0070 0.0117 0.0181
29 0.0000 0.0000 0.0001 0.0002 0.0004 0.0011 0.0023 0.0044 0.0077 0.0125
30 0.0000 0.0000 0.0000 0.0001 0.0002 0.0006 0.0013 0.0026 0.0049 0.0083
32 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0007 0.0015 0.0030 0.0054
32 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0004 0.0009 0.0018 0.0034
33 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0005 0.0010 0.0020
34 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0006 0.0012
35 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0007
36 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0004
37 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002
38 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
164 39 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
Appendix Table-3 Areas of a Standard Normal Probability Distribution Between the Chi-Square Test
Mean and Positive Values of z.
0.4429 of area
Mean z=1.58
z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .0359
0.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .0753
0.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .1141
0.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .1517
0.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .1879
0.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .2224
0.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2517 .2549
0.7 .2580 .2611 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .2852
0.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .3133
0.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .3389
1.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .3621
1.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .3830
1.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .4015
1.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .4177
1.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .4319
1.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .4441
1.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .4545
1.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .4633
1.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .4706
1.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .4767
2.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .4817
2.1 :4821 .4826 :4830 .4834 .4838 .4842 .4846 .4850 .4854 .4857
2.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .4890
2.3 .4893 .4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .4916
2.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .4936
2.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .4952
2.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .4964
2.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .4974
2.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .4981
2.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .4986
3.0 .4987 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990
3.1 .4990 .4991 .4991 .4991 .4992 .4992 .4992 .4992 .4993 .4993
3.2 .4993 .4993 .4994 .4994 .4994 .4994 .4994 .4995 .4995 .4995
3.3 .4995 .4995 .4995 .4996 .4996 .4996 .4996 .4996 .4996 .4997
3.4 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4998
3.5 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998
3.6 .4998 .4998 .4998 .4999 .4999 .4999 .4999 .4999 .4999 .4999
165
Probability and Hypothesis χ2) Distribution
Appendix Table-4 Area in the Right Tail of a Chi-Square (χ
Testing
Degrees Area in right tail
of
freedom .99 .975 .95 .90 .80 .20 .10 .05 .025 .01
1 0.00016 0.00098 0.00393 0.0158 0.0642 1.642 2.706 3.841 5.024 6.635
2 0.0201 0.0506 0.103 0.211 0.446 3.219 4.605 5.991 7.378 9.210
3 0.115 0.216 0.352 0.584 1.005 4.642 6.251 7.815 9.348 11.345
4 0.297 0.484 0.711 1.064 1.649 5.989 7.779 9.488 11.143 13.277
5 0.554 0.831 1.145 1.610 2.343 7.289 9.236 11.071 12.833 15.086
6 0.872 1.237 1.635 2.204 3.070 8..558 10.645 12.592 14.449 16.812
7 1.239 1.690 2.1675 2.833 3.822 9.803 12.017 14.067 16.013 18.475
8 1.646 2.180 2.733 3.490 4.594 11.030 13.362 15.507 17.535 20.090
9 2.088 2.700 3.325 4.168 5.380 12.242 14.684 16.919 19.023 21.666
10 2.558 3.247 3.940 4.865 6.179 13.442 15.987 18.307 20.483 23.209
11 3.053 3.816 4.575 5.578 6.989 14.631 17.275 19.675 21.920 24.725
12 3.571 4.404 5.226 6.304 7.807 15.812 18.549 21.026 23.337 26.217
13 4.107 5.009 5.892 7.042 8.634 16.985 19.812 22.362 24.736 27.688
14 4.660 5.629 6.571 7.790 9.467 18.151 21.064 23.685 26.119 29.141
15 5.229 6.262 7.261 8.547 10.307 19.311 22.307 24.996 27.488 30.578
16 5.812 6.908 7.962 9.312 11.152 20.465 23.542 26.296 28.845 32.000
17 6.408 7.564 8.672 10.085 12.002 21.615 24.769 27.587 30.191 33.409
18 7.015 8.231 9.390 10.865 12.857 22.760 25.989 28.869 31.526 34.805
19 7.633 8.907 10.117 11.651 13.716 23.900 27.204 30.144 32.852 36.191
20 8.260 9.591 10.851 12.443 14.578 25.038 28.412 31.410 34.170 37.566
21 8.897 10.283 11.591 13.240 15.445 26.171 29.615 32.671 35.479 38.932
22 9.542 10.982 12.338 14.041 16.314 27.301 30.813 33.924 36.781 40.289
23 10.196 11.6889 13.091 14.848 17.187 28.429 32.007 35.172 38.076 41.638
24 10.856 12.4015 13.848 15.658 18.062 29.553 33.196 36.415 39.364 42.980
25 11.524 13.120 14.611 16.473 18.940 30.675 34.382 37.652 40.647 44.314
26 12.198 13.844 15.379 17.292 19.820 31.795 35.563 38.885 41.923 45.642
27 12.879 14.573 16.151 18.114 20.703 32.912 36.741 40.113 43.195 46.963
28 13.565 15.308 16.928 18.939 21.588 34.027 37.916 41.337 44.461 48.278
29 14.256 16.047 17.708 19.768 22.475 35.139 39.087 42.557 45.722 49.588
30 14.953 16.791 18.493 20.599 23.364 36.250 40.256 43.773 46.979 50.892
Source: From Table IV of Fisher and Yates, Statistical Tables for Biological,
Agricultural and Medical Research, Published by Longman Group Ltd
(previously published by Oliver and Boyd, Edinburg, 1963).
166
Chi-Square Test
Appendix : Table-5 Table of t
(One Tail Area)
0 tα
Values of ta, m
167
Interpretation
UNIT 18 INTERPRETATION OF of Statistical
Data
STATISTICAL DATA
STRUCTURE
18.0 Objectives
18.1 Introduction
18.2 Meaning of Interpretation
18.3 Why Interpretation?
18.4 Essentials for Interpretation
18.5 Precautions in Interpretation
18.6 Concluding Remarks on Interpretation
18.7 Conclusions and Generalizations
18.8 Methods of Generalization
18.8.1 Logical Method
18.8.2 Statistical Method
18.9 Statistical Fallacies
18.10 Conclusions
18.11 Let Us Sum Up
18.12 Key Words
18.13 Answers to Self Assessment Exercises
18.14 Terminal Questions
18.15 Further Reading
18.0 OBJECTIVES
After studying this unit, you should be able to:
l define interpretation,
l explain the need for interpretation,
l state the essentials for interpretation,
l narrate the precautions to be taken before interpretation,
l describe a conclusion and generalization,
l explain the methods of generalization, and
l illustrate statistical fallacies.
18.1 INTRODUCTION
We have studied in the previous units the various methods applied in the
collection and analysis of statistical data. Statistics are not an end in themselves
but they are a means to an end, the end being to draw certain conclusions
from them. This has to be done very carefully, otherwise misleading conclusions
may be drawn and the whole purpose of doing research may get vitiated.
a) The data are homogeneous: It is necessary to ascertain that the data are
strictly comparable. We must be careful to compare the like with the like and
not with the unlike.
b) The data are adequate: Sometimes it happens that the data are incomplete
or insufficient and it is neither possible to analyze them scientifically nor is it
possible to draw any inference from them. Such data must be completed
first.
c) The data are suitable: Before considering the data for interpretation, the
researcher must confirm the required degree of suitability of the data.
6
Inappropriate data are like no data. Hence, no conclusion is possible with Interpretation
of Statistical
unsuitable data.
Data
1) Interpretation means:
....................................................................................................................
....................................................................................................................
....................................................................................................................
In every day life, we often make generalizations. We believe that what is true
of the observed instances will be true of the unobserved instances. Since, we
have had an uniform experience, we expect that we shall have it even in the
future. We are quite conscious of the fact that the observed instances do not
constitute all the members of a class concerned. But we have a tendency to
8 generalize. A generalization is a statement, the scope of which is wider
than the available evidence. For example, A is a crow, it is black. B is a Interpretation
crow, it is black. C is a crow, it is also black. Therefore, it can be generalized of Statistical
Data
that “all crows are black”. Similarly, all swans are white. All rose plants
possess thorns etc., The process by which such generalizations are made is
known as induction by simple enumeration.
This method was first introduced by John Stuart Mill, who said that
generalization should be based on logical processes. Mill thought that discovering
causal connections is the fundamental task in generalization. If casual
connections hold good, generalization can be done with confidence. Five
methods of experimental enquiry have been given by Mill. These methods serve
the purpose of discovering causal connections. These methods are as follows.
A+B+C Produce X
A+P+Q Produce X
M + N + Non-A Produce Non-X
G + H + Non-A Produce Non-X
∴ A and X are causally connected. 9
Interpretation and iv) The Method of Residues: This method is based on the principle of
Reporting elimination. The statement of this method is that, subtract from any phenomenon
such part as is known by previous inductions to be the effect of certain
antecedents, and the residue of the phenomenon is the effect of the remaining
antecedents. For example: A loaded lorry weighs 11 tons. The dead weight of
the lorry is 1 ton. The weight of load = 11 – 1 = 10 tons.
i) Collection of Data: The facts pertaining to the problem under study are to
be collected either by survey method or by observation method or by
experiment or from a library. (It was discussed in Unit-3).
iii) Analysis of Data: The processed data then should be properly analyzed
with the help of statistical tools, such as measures of central tendency,
measures of variation, measures of sknewness, correlation, time series, index
numbers etc. (This was discussed in Units 8, 9, 10, 11 and 12 of this course).
....................................................................................................................
5) Fill in the blanks with appropriate word (s) :
i) Extending the conclusion from observed instances to unobserved instances
is also called ____________.
ii) Logical method is associated with the name of ______________.
Unconscious bias is even more insidious. Perhaps, all statistical reports contain
some unconscious bias, since the statistical results are interpreted by human
beings after all. Each may look at things in terms of his own experience and
his attitude towards the problem under study. People suffer from several
inhibitions, prejudices, ideologies and hardened attitudes. They can not help
reflecting these in their interpretation of results. For example: A pessimist will
see the future as being dark, where as an optimist may see it as being bright.
Failure to Comprehend the Data: Very often figures are interpreted without
comprehending the total background of the data and it may lead to wrong
conclusions. For example, see the following interpretations:
– The death rate in the army is 9 per thousand, where as in the city of Delhi it is
15 per thousand. Therefore, it is safer to be in the army than in the city.
– Most of the patients who were admitted in the intensive care (IC) ward of a
hospital died. Therefore, it is unsafe to be admitted to intensive care ward in that
hospital.
18.10 CONCLUSIONS
Statistical methods and techniques are only tools. As such, they may be very
often misused. Some people believe that “figures can prove anything.” Figures
don’t lie but liers can figure”. Some people regard statistics as the worst type
of lies. That is why it is said that “an ounce of truth can be produced from
tons of statistics”. Mere quantitative results, or huge body of data, without any
definite purpose, can never help to explain anything. The misuse of statistics
12 may arise due to:
i) analysis without any definite purpose. Interpretation
of Statistical
ii) Carelessness or bias in the collection and interpretation of data. Data
As a principle, statistics can not prove anything, but they can be made to prove
anything because statistics are like clay with which one can make God or the
Devil. The fault lies not with statistics but with the person who is using
statistics. The interpreter must carefully look into these points before he sets
about the task of interpretation. We may conclude with the words of Marshall
who said “ Statistical arguments are often misleading at first, but free discussion
clears away statistical fallacies”.
Note: These questions/exercises will help you to understand the unit better. Try
to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
15
Interpretation and
Reporting 18.15 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt with in this unit.
B.N.Gupta. Statistics. Sahitya Bhavan, Agra.
S.P. Gupta. Statistical Methods, Sultan Chand & Sons, New Delhi.
B.N.Agarwal. Basic Statistics, Wiley Eastern Ltd.
P.Saravanavel. Research Methodology, Kitab Mahal, Allahabad.
C.R. Khothari. Research Methodology (Methods and Techniques), New Age
International Pvt. Ltd, New Delhi.
16
Report Writing
UNIT 19 REPORT WRITING
STRUCTURE
19.0 Objectives
19.1 Introduction
19.2 Purpose of a Report
19.3 Meaning
19.4 Types of Reports
19.5 Stages in Preparation of a Report
19.6 Characteristics of a Good Report
19.7 Structure of the Research Report
19.7.1 Prefactory Items
19.7.2 The Text/Body of the Report
19.7.3 Terminal Items
19.8 Check List for the Report
19.9 Let Us Sum Up
19.10 Key Words
19.11 Answers to Self Assessment Exercises
19.12 Terminal Questions
19.13 Further Reading
19.0 OBJECTIVES
After going through this unit, you should be able to :
l define a Report,
l explain the need for reporting,
l discuss the subject matter of various types of reports,
l identify the stages in preparation of a report,
l explain the characteristics of a good report,
l explain different parts of a report, and
l distinguish between a good and bad report.
19.1 INTRODUCTION
The last and final phase of the journey in research is writing of the report.
After the collected data has been analyzed and interpreted and generalizations
have been drawn the report has to be prepared. The task of research is
incomplete till the report is presented.
Writing of a report is the last step in a research study and requires a set of skills
some what different from those called for in respect of the earlier stages of
research. This task should be accomplished by the researcher with utmost care.
19.3 MEANING
Reporting simply means communicating or informing through reports. The
researcher has collected some facts and figures, analyzed the same and arrived
at certain conclusions. He has to inform or report the same to the parties
interested. Therefore “reporting is communicating the facts, data and information
through reports to the persons for whom such facts and data are collected and
compiled”.
A report is not a complete description of what has been done during the period
of survey/research. It is only a statement of the most significant facts that are
necessary for understanding the conclusions drawn by the investigator. Thus, “
a report by definition, is simply an account”. The report thus is an account
describing the procedure adopted, the findings arrived at and the conclusions
drawn by the investigator of a problem.
b) Written Report : Written reports are more formal, authentic and popular.
i) Journalistic Report
ii) Business Report
iii) Project Report
iv) Dissertation
v) Enquiry Report (Commission Report), and
vi) Thesis
Designing the Final Outline of the Report: It is the second stage in writing
the report. Having understood the subject matter, the next stage is structuring
the report and ordering the parts and sketching them. This stage can also be
called as planning and organization stage. Ideas may pass through the author’s
mind. Unless he first makes his plan/sketch/design he will be unable to achieve
a harmonious succession and will not even know where to begin and how to
end. Better communication of research results is partly a matter of language
but mostly a matter of planning and organizing the report.
Preparation of the Rough Draft: The third stage is the write up/drafting of
the report. This is the most crucial stage to the researcher, as he/she now sits
to write down what he/she has done in his/her research study and what and
how he/she wants to communicate the same. Here the clarity in
communicating/reporting is influenced by some factors such as who the readers
are, how technical the problem is, the researcher’s hold over the facts and
techniques, the researcher’s command over language (his communication skills),
the data and completeness of his notes and documentation and the availability
of analyzed results. Depending on the above factors some authors may be able
to write the report with one or two drafts. Some people who have less
command over language, no clarity about the problem and subject matter may
take more time for drafting the report and have to prepare more drafts (first
draft, second draft, third draft, fourth draft etc.,)
Finalization of the Report: This is the last stage, perhaps the most difficult
stage of all formal writing. It is easy to build the structure, but it takes more
time for polishing and giving finishing touches. Take for example the
construction of a house. Up to roofing (structure) stage the work is very quick
but by the time the building is ready, it takes up a lot of time.
i) It must be clear in informing the what, why, who, whom, when, where and how
of the research study.
ii) It should be neither too short nor too long. One should keep in mind the fact
that it should be long enough to cover the subject matter but short enough to
sustain the reader’s interest.
2 1
Interpretation and iii) It should be written in an objective style and simple language, correctness,
Reporting precision and clarity should be the watchwords of the scholar. Wordiness,
indirection and pompous language are barriers to communication.
iv) A good report must combine clear thinking, logical organization and sound
interpretation.
v) It should not be dull. It should be such as to sustain the reader’s interest.
vi) It must be accurate. Accuracy is one of the requirements of a report. It should
be factual with objective presentation. Exaggerations and superlatives should
be avoided.
vii) Clarity is another requirement of presentation. It is achieved by using familiar
words and unambiguous statements, explicitly defining new concepts and
unusual terms.
viii) Coherence is an essential part of clarity. There should be logical flow of ideas
(i.e. continuity of thought), sequence of sentences. Each sentence must be so
linked with other sentences so as to move the thoughts smoothly.
ix) Readability is an important requirement of good communication. Even a
technical report should be easily understandable. Technicalities should be
translated into language understandable by the readers.
x) A research report should be prepared according to the best composition
practices. Ensure readability through proper paragraphing, short sentences,
illustrations, examples, section headings, use of charts, graphs and diagrams.
xi) Draw sound inferences/conclusions from the statistical tables. But don’t repeat
the tables in text (verbal) form.
xii) Footnote references should be in proper form. The bibliography should be
reasonably complete and in proper form.
xiii) The report must be attractive in appearance, neat and clean whether typed or
printed.
xiv) The report should be free from mistakes of all types viz. language mistakes,
factual mistakes, spelling mistakes, calculation mistakes etc.,
The researcher should try to achieve these qualities in his report as far as possible.
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
5) What is meant by coherence?
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
7. Table of contents
8. List of tables
9. List of graphs/charts/
figures
10. List of cases, if any
11. Abstract or high lights (optional)
Let us discuss these items one by one in detail.
2 3
Interpretation and 19.7.1 Prefactory Items
Reporting
The various preliminaries to be included in the front pages of the report are
briefly narrated hereunder:
1) Title Page: The first page of the report is the title page. The title page should
carry a concise and adequately descriptive title of the research study, the name
of the author, the name of the institution to whom it is submitted, the date of
presentation.
4) Dedication: If the author wants to dedicate the work to whom soever he/she
likes, he/she may do so.
7) List of Tables: The researcher must have collected lot of data and analyzed
the same and presented in the form of tables. These tables may be listed
chapter wise and the list be presented with page numbers for easy location and
reference.
After the preliminary items, the body of the report is presented. It is the major
and main part of the report. It consists of the text and context chapters of the
study. Normally the body may be divided into 3 (three) parts.
2 4
i) The introduction Report Writing
i) Introduction
Generally this is the first chapter in the body of the report. It is devoted
introducing the theoretical background of the problem and the methodology
adopted for attacking the problem.
This is the major and main part of the report. It is divided into several chapters
depending upon the number of objectives of the study, each being devoted to
presenting the results pertaining to some aspect. The chapters should be well
balanced, mutually related and arranged in logical sequence. The results should
be reported as accurately and completely as possible explaining as to their
bearing on the research questions and hypotheses.
Each chapter should be given an appropriate heading. Depending upon the need,
a chapter may also be divided into sections. The entire verbal presentation
should run in an independent stream and must be written according to best
composition rules. Each chapter should end with a summary and lead into the
next chapter with a smooth transition sentence.
While dealing with the subject matter of text the following aspects should be
taken care of. They are :
1) Headings
2) Quotations 2 5
Interpretation and 3) Foot notes
Reporting
4) Exhibits
Centre Head. A Centre head is typed in all capital letters. If the title is long,
the inverted pyramid style (i.e., the second line shorter than the first, the third
line shorter than the second) is used. All caps headings are not underlined.
Underlining is unnecessary because capital letters are enough to attract the
reader’s attention.
Example
Centre Subhead. The first letter of the first and the last word and all nouns,
adjectives, verbs and adverbs in the title are capitalized. Articles, prepositions
and conjunctions are not capitalized.
Example
Side Heads. Words in the side head are either written in all capitals or
capitalized as in the centre sub head and underlined.
Example: Import Substitution and Export Promotion
Paragraph Head. Words in a paragraph head are capitalized as in the centre
sub head and underlined. At the end, a colon appears, and then the paragraph
starts.
Example: Import Substitution and Export Promotion: The Seventh Five-Year
Plan of India has attempted ……
2) Quotations
3) Foot Notes
Types of Footnotes: A foot note either indicates the source of the reference
or provides an explanation which is not important enough to include in the text.
In the traditional system, both kinds of footnotes are treated in the same form
and are included either at the bottom of the page or at the end of the chapter
or book.
In the modern system, explanatory footnotes are put at the bottom of the page
and are linked with the text with a footnote number. But source references are
incorporated within the text and are supplemented by a bibliographical note at
the end of the chapter or book.
Where to put the Footnote: Footnotes appear at the bottom of the page or
at the end of the chapter (before the appendices section).
b) In the text Arabic numerals are used for footnoting. Each new chapter begins
with number 1.
c) The number is typed half a space above the line or within parentheses. No
space is given between the number and the word. No punctuation mark is used
after the number.
2 7
Interpretation and d) The number is placed at the end of a sentence or, if necessary to clarify the
Reporting meaning, at the end of the relevant word or phrase. Commonly, the number
appears after the last quotation mark. In an indented paragraph, the number
appears at the end of the last sentence in the quotation.
4) Exhibits
Tables:
Tables can be numbered consecutively throughout the chapter as 1.1, 1.2, 1.3,…
wherein the first number refers to the chapter and the second number to the
table.
b) For the title and sub title, all capital letter are used.
c) Abbreviations and symbols are not used in the title or sub title.
1) Have the explanation and reference to the table been given in the text?
2) Is it essential to have the table for clarity and extra information?
3) Is the representation of the data comprehensive and understandable?
4) Is the table number correct?
5) Are the title and subtitle clear and concise?
6) Are the column headings clearly classified?
7) Are the row captions clearly classified?
8) Are the data accurately entered and represented?
9) Are the totals and other computations correct?
10) Has the source been given?
11) Have all the uncommon abbreviations been spelt out?
12) Have all footnote entries been made?
13) If column rules are used, have all rules been properly drawn?
1) Appendices
1) Original data
2) Long tables
3) Long quotations
4) Supportive legal decisions, laws and documents
5) Illustrative material
6) Extensive computations
7) Questionnaires and letters
8) Schedules or forms used in collecting data
9) Case studies / histories
10) Transcripts of interviews
2) Bibliographies
A bibliography contains the source of every reference cited in the footnote and
any other relevant works that the author has consulted. It gives the reader an
idea of the literature available on the subject that has influenced or aided the
author. 2 9
Interpretation and Bibliographical Information: The following information must be given for
Reporting each bibliographical reference.
3) Glossary
4) Index
Index may be either subject index or author index. Author index consists of
important names of persons discussed in the report, arranged in alphabetical
order. Subject index includes a detailed reference to all important matters
discussed in the report such as places, events, definitions, concepts etc., and
presented in alphabetical order. Index is not generally included in graduate /
post graduate students research reports. However, if the report is prepared for
publication or intended as a work of reference, an index is desirable.
Typing Instructions: For typing a report, the following points should be kept in
mind.
Paper: Quarter - size (A4 size) white thick, unruled paper is used.
Typing: Typing is done on only one side of the paper in double space.
Margins: Left side 1.5 inches, right side 0.5 inch, top and bottom 1.0 inch. But
on the first page of every major division, for example, at the beginning of a
chapter give 3 inches space at the top.
3 1
Interpretation and
Reporting 19.9 LET US SUM UP
The final stage of research investigation is reporting. The research results,
findings and conclusions drawn etc., have to be communicated. This can be
done in two ways i.e. orally or in writing. Written reports are more popular and
authentic even though oral reporting also has its place. Based on requirement
reports can be of two types viz., Technical reports and popular reports.
The total structure of a report can be divided into three main parts.
3 2
Report Writing
19.11 ANSWERS TO SELF ASSESSMENT
EXERCISES
Self Assessment Exercise C
Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
3 3
Interpretation and
Reporting 19.13 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt with in this unit.
1) V.P. Michael, Research Methodology in Management, Himalaya Publishing
House, Bombay.
2) O.R. Krishna Swamy, Methodology of Research in Social Sciences,
Himalaya Publishing House, Mumbai.
3) C.R. Kothari, Research Methodology, Wiley Eastern, New Delhi
4) Berenson, Conrad and Raymond Cotton, Research and Report Writing for
Business and Economics, Random House, New York.
3 4