0% found this document useful (0 votes)
73 views14 pages

Data Collection Methods in Statistics

The document discusses the systematic collection of data for statistical investigations, emphasizing the importance of data quality and the methods of data collection, which include primary and secondary data. It details various methods for collecting primary data such as direct personal interviews, indirect personal interviews, mailed questionnaires, and information from local agents, along with their merits and demerits. Additionally, it covers the design of questionnaires and schedules, measures of central tendency, and specific statistical measures like arithmetic mean, median, mode, and geometric mean.

Uploaded by

iitzraaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views14 pages

Data Collection Methods in Statistics

The document discusses the systematic collection of data for statistical investigations, emphasizing the importance of data quality and the methods of data collection, which include primary and secondary data. It details various methods for collecting primary data such as direct personal interviews, indirect personal interviews, mailed questionnaires, and information from local agents, along with their merits and demerits. Additionally, it covers the design of questionnaires and schedules, measures of central tendency, and specific statistical measures like arithmetic mean, median, mode, and geometric mean.

Uploaded by

iitzraaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Collection:

Statistical investigation is based on systematic collection of data. The reliability of conclusions drawn from the
sample data depends to a great extent on the quality of the data. The systematic planned and meaningful way of collecting
the information is known as collection of data. The methods of collection of data depends on various aspects such as
objective, scope and nature of the problems under study. The data can be collected from two main sources they are:

(i). primary data and (ii). secondary data

Primary Data: Primary data are those statistical data which are collected for the first time are original in nature. Primary
data are collected originally by the authorities who are required to collect them. The source from which primary data are
collected is called primary source. Primary data is collected by field workers, investigators and enumerators. In India, the
sources of primary data are the Census of India published by the Government, the Reserve Bank of India Bulletin published
by RBI etc.,

The primary data may be collected by any one of the methods.

(i) Direct personal interview


(ii) Indirect personal interview
(iii) Mailed questionnaire method
(iv) Information from local agents and correspondents.

(i). Direct personal interview: According to this method, the investigator personally approaches each respondent and
gathers first-hand information. The reliability of data depends upon the training and attitude of the investigator and
supporting attitude of the respondent.

Merits:

1. In this method, the data obtained in original, accurate and exact.


2. This method leads to obtain more reliable information since investigator can clear the doubts and misunderstandings
of the respondents.
3. Supplementary information can also be collected about the respondent’s personal characteristics and environment.
This helps in interpreting results.

Demerits:

1. The method is not suitable when the number of respondents is very large.
2. The method is costly, time-consuming.
3. Skilled investigators are required to collect the data.
4. The success of the survey depends on personal qualities of investigator.

(ii). Indirect personal interview: This method is used when the respondents are not willing to provide information directly.
When the field of investigation is very large, the information about a large number of respondents can indirectly be obtained
from one person who may lead the community or head of the organisation. It is generally used by C.B.I. and police for the
collection of information.

1
Merits:

1. If the area of investigation is very large, then this method is suitable.


2. Personally, if the respondent may not give the information to the investigator, then one may collect the information
from the third person.

Demerits:

1. In the absence of direct contact between investigator and respondent, important information may be lost.
2. The information given by the third person may be biased.
3. The information collected from the different persons may not be same and comparable.

(iii). Mail questionnaire method: In this method, a set of questions are prepared and sent by a mail to the respondents.
The respondents are supposed to fill the schedule and mail them back to investigating agency. It is very useful when the
respondents are educated and the area of investigation is very wide.

Merits:

1. It is useful when the area is large.


2. It is useful when all the respondents are educated and aware.
3. The information collected by this method is free from the bias of investigators.

Demerits:

1. It is applicable only to educated respondents.


2. Some of the respondents may not return the questionnaire.
3. Some of the respondents may send incomplete questionnaires.

(iv). Information through local agencies (or) respondents: In this method, local agents or correspondents are appointed
in different parts of the area under investigation. These agents send the required information at regular intervals of time.
This method is generally used by newspapers.

Merits:

1. This method is quite ideal when information is needed from a wide area.
2. It is economic in terms of time and money.

Demerits:

1. The information may not be reliable.


2. The data may be affected by the bias of the investigator.

Secondary Data: The secondary data is one which is collected by some earlier agency but is used and analysed by any
other for its own use. There are several sources of secondary data they are

1. Published Sources, 2. Unpublished Sources.

1. Published sources: There are a number of national and international agencies which collect statistical data relating to
business, trade, labour, prices, consumption, production, agriculture, industry, income, health, population and a number of
2
socio-economic characteristics and publish their findings in statistical reports on a regular basis, i.e., monthly, quarterly,
annually etc., The following are some of important published sources of secondary data.

(i). Government publications: The following are various government organisations which collect and publish statistical
data on various fields.

1. Central Statistical Organisation (CSO)

2. National Sample Survey Organisation (NSSO)

3. Office of the Registrar General and Census Commissioner of India, New Delhi

4. Directorate of Economics and Statistics

5. Labour Bureau, Ministry of Labour.

(ii). International publications: Various foreign governments and international agencies like UNO, World Bank,
International Monetary Fund (IMF) regularly publish reports on the data collected by them on various aspects.

(iii). Semi-official publications: Various local bodies such as District Boards, Municipal Corporations, Banking
Organisations, etc., publish periodicals providing information about vital events, socioeconomic characters, etc.,

(iv). Private publications: The following private publications may also be used as secondary sources of the data.

1. Publications of professional bodies like ISI (Indian Statistical Institute), CSIR, ICAR, NCERT, etc.,

2. Annual Reports of private banks

3. Information published in newspapers, books, magazines, etc.,

4. Reports prepared by research scholars of the university.

2. Unpublished sources: The information taken from the sources like diaries, letters, unpublished biographic and
autobiographic, etc., are called unpublished sources. Unpublished data may also be available with scholars, trade
associations and individuals.

Precautions for using secondary data: The following precautions must take before using the secondary data.

1. The organisation must check whether the data is reliable and suitable for the Statistical survey.
2. The investigating team should check whether the data is sufficient for present investigation.

DESIGNING A QUESTIONNAIRE:

Questionnaire: Collection of the data through questionnaires is the most popular method for collecting primary data. A
questionnaire is well prepared list of questions regarding the enquiry of the survey. In this method a questionnaire is sent
to various respondents, they answer the questions and return the questionnaires. This method is extensively employed in
various economic and business surveys.

Merits:

1. This method is very economical when the universe is large and the area is wide.

3
2. The respondents may furnish the answers well which leads to more accurate results.
3. The data may be collected conveniently from the rural and remote areas.
4. The data is more reliable.

Demerits:

1. Sometimes the respondents may not return the questionnaire.


2. Some questionnaires may not filled up properly, hence incompleteness causes less efficiency.
3. This method cannot be used for illiterates.
4. Once the questionnaires are sent to the respondents, then investigating agency cannot change or modify the
questions.
5. The method is not flexible. In case of inadequate or incomplete answers it is difficult to obtain supplementary
information.
6. This method is likely to be most time consuming, since the respondents can take their feasible and sufficient time
to return the questionnaire.

Features of a good questionnaire: In order to make the questionnaire more effective, it must be very carefully drafted.
The following are the qualities of a good questionnaire.

1. The size of the questionnaire should be as small as possible.


2. It should be simple, clear and unambiguous.
3. The questions should be brief.
4. The questions should be arranged in a logical order.
5. Questions may be dichotomous (i.e., yes or no type) or multiple choice and not of lengthy answers.
6. The case of sensitive and personal nature of questions should be avoided.
7. Questions should not be open-end, preferably, appropriate answer choices should be given.

DESIGNING A SCHEDULE: In this method, a team of enumerators is selected and a special training will be given to
them. Now the enumerators fill up the schedule. The difference between schedule and questionnaire methods is that
schedules are being filled by enumerators whereas questionnaires are to be filled by respondents. A special team of
enumerators is appointed for recording the answers given by respondents. The enumerators explain clearly the objective
of the survey, the definitions of basic concepts and rules to the respondents and record their responses. Census is usually
conducted by using this method in the world.

Merits:

1. This method can be used for illiterate population.


2. The data collected by this method is more accurate and reliable.
3. In this method, information given by respondents can be checked on the spot by cross questioning.
4. Non-response in this method is very little.
5. The identity of the respondent is known in this method whereas it is not clear in the case of mail questionnaire
method.

Demerits:

4
1. It is the most expensive method among all methods of collecting primary data.
2. It is more time consuming.
3. The success of the method mainly depends on the efficiency and skill of the enumerators.
4. The success of the method completely based on preparation of schedule.

MEASURES OF CENTRAL TENDENCY

Definition:

Measures of central tendency or measures of location give an idea about the concentration of the values in the
central part of the distribution. A measure of central tendency is a statistical average or a single value which represents
entire distribution.

The following are some important measures of central tendency

(i). Arithmetic Mean (ii). Median (iii). Mode (iv). Geometric Mean and (v). Harmonic mean

(i). Arithmetic Mean:

Arithmetic mean of a set of observations is the sum of all observations divided by the number of observations. It

is denoted by X

Mean for Ungrouped data:

If X1 , X 2 ,......., X n are ‘n’ observations, then the mean is defined as

x i
x1  x2  .....  xn
X i 1
 , where n is number of observations.
n n

Mean for grouped data:

(a). Mean for Discrete data:

If the variable xi has the frequency fi, the mean is defined as

fx i i
X i 1
; where N=Frequency Total
N

(b). Mean for Continuous Data:

If class intervals and frequencies are given then we use the following formula for calculate the mean

X  A
fd i i
C
N

Where A=Assumed mean

5
Xi  A
di  , Xi= Mid value of the class
C

N=Frequency Total= f i C=Class Interval

Merits of Arithmetic Mean:

1. It is easy to defined
2. It is easy to understand and easy to calculate
3. It is based on all the observations.
4. It is suitable for further mathematical treatment
5. It is more accurate and more reliable.
6. It is more reliable for comparative purpose

Demerits of Arithmetic Mean:

1. Arithmetic mean cannot be used if we are dealing with qualitative characteristics which cannot be measured
quantitatively, example intelligence.
2. If a single observation is missed or lost, arithmetic mean cannot be obtained with more accuracy.
3. It is affected very much by extreme value
4. It cannot calculated for open-end classes.
5. It may lead to wrong conclusions if the details of the data from which arithmetic mean computed are not given.

(ii). Median:

Median of a distribution is the value of the variable which divides it into two equal parts, i.e., median is the middle
value of the distribution. Median is also called as a ‘Positional Average’.

Median for Ungrouped data:

Make the given numbers in ascending or descending order. If the number of observations is odd, then the middle
value is treated as median. If the number of observations is even, then, the median is the arithmetic mean of middle terms.

Median for grouped data:

(a). Median for Discrete data:

(i). Arrange the values of the variable in ascending or descending order of magnitudes.

(ii). Find the cumulative frequency

N
(iii). Find , Where N   f
2

N
(iv). Find the cumulative frequency just greater than and determine the corresponding value of the variable.
2

(v). The value obtained in Step iv above the required median.

6
(b). Median for Continuous Data:

In this case the data is given the form of frequency table with class-interval, etc., and the following formula is
used to calculate the median.

 N  
  2   cf 
Median M  l     C
 f 
 
 

Where l=Lower limit of the median class

N=Total frequency= f
f=frequency of the median class

cf=Cumulative frequency of the class preceding the median class

C=Class interval

Merits of Median:

1. Median is not influenced by extreme values because it is a positional average


2. Median can be calculated in case of distribution with open-end intervals.
3. Median can be located even if the data are incomplete.
4. Median can be located even for qualitative factors such as ability, honesty etc.,

Demerits of Median:

1. A slight change in the series may bring effective change in median value.
2. In case of even number of items or continuous series, median is an estimated value other than any value in the
series.
3. It is not suitable for further mathematical treatment except its use in mean deviation
4. It is not taken into account all the observations.

(iii). Mode:

A third measure of central tendency is the mode and it is defined simply as the value (or attribute) which occurs the
most often. That is, with the high frequency. It is applicable to quantitative and qualitative data.

Mode for ungrouped data:

The calculation of mode is very easy. It depends upon the frequencies. The data, therefore, should be grouped in
discrete or continuous series and the item value with higher frequency would be the mode.

Mode for grouped data:

(a). Mode for discrete data:

In this case one can find mode by inspection. The variate value having the maximum frequency is the mode value.
7
(b). Mode for continuous data:

If the data are given with class intervals, then the following formula is used for the calculation of mode.

f1  f 0
Mode Z  l  C
2 f1  f 0  f 2

Where l=lower limit of mode class

f1 =frequency of mode class

f 0 = frequency of the class, prior to mode class

f 2 =frequency of the class, subsequent to mode class.

C=width of the mode class

Merits of Mode:

1. It is easy to calculate and in some cases it can be located mere inspection


2. Mode is not all affected by extreme values.
3. It can be calculated for open-end classes.
4. It is usually an actual value of an important part of the series.
5. In some circumstances it is the best representative of data.

Demerits of Mode:

1. It is not based on all observations.


2. It is not capable of further mathematical treatment.
3. Mode is ill-defined generally, it is not possible to find mode in some cases.
4. As compared with mean, mode is affected a great extent, by sampling fluctuations.
5. It is unsuitable in cases where relative importance of items has to be considered.

(iv). Geometric Mean: The Geometric mean of a set of n observations is the nth root of their product. It is denoted by G.

Geometric mean for ungrouped data:

If x1 , x2 ,......., xn are ‘n’ observations, then Geometric mean G is calculated by

1
log G   log x1  log x2       log xn 
n
1 n
or   log xi
n i 1
1 n 
G  Anti log   log xi 
 n i 1 

Where n= no. of observations.

8
Geometric mean for a grouped data:

fi
If , i  1, 2,......, n is a grouped frequency distribution, the geometric mean G is calculated by the following
xi
n
1 1
log G   f1 log x1  f 2 log x2        f n log xn    f log x
i i
N N i 1
formula.
1 n

G  Anti log 
N
 f log x 
i 1
i i

n
Where N  f
i 1
i

Merits of Geometric Mean:

1. It is rigidly defined.
2. It is based on all items.
3. It is suitable for finding averages for ratios, rates and percentages.
4. It is capable for further mathematical treatment.

Demerits of Geometric Mean:

1. Geometric mean is not easy to calculate and understand for a non-mathematics person.
2. If any observations is zero, geometric mean becomes zero and if anyone is negative, it becomes imaginary regarding
the magnitude of the item.

Harmonic Mean: Harmonic mean of a number of observations is the reciprocal of the arithmetic mean of the reciprocals
of the given values. It is denoted by H.

Harmonic mean for an ungrouped data: Let x1 , x2 ,......., xn be ‘n’ observations, then harmonic mean H is given by

1
H , where n= No. of observations
1 1 n


n i 1  xi

xi
Harmonic mean for a grouped frequency distribution: If . I=1,2,---,n is a grouped frequency distribution, then
fi
harmonic mean H is calculated by

f
1
H , where N 
1 n  fi 
i


i 1

N i 1  xi 

Merits of Harmonic Mean:

1. It is rigidly defined.
2. It is defined on all the observations.

9
3. It is suitable for further mathematical treatment.
4. It is not affected by fluctuations of sampling.
5. It gives greater importance to small items and also it is useful only when small items have to be given a greater
weightage.

Demerits of Harmonic Mean:

1. It is not easy to understand.


2. It is difficult to calculate.
3. It gives grater importance to small items.

UNIT-II

MEASURES OF DISPERSION

INTRODUCTION:

Measures of central tendency help us to represent the entire measure of data by a single value. Can the central
tendency describe the characteristics of the data fully? Consider the following the two sets of observations have the same
mean and median, so we consider that the averages of the two sets are not different.

Total
Series-A 111 116 116 116 116 580
Series-B 100 108 116 124 132 580

However, it may be noticed that observations in series-A are same. While observations in series-B are widely
dissimilar. It can be said that the variability of series-B is more than that of series-A. Thus, averages alone are not sufficient
to study several other characteristics of data and hence the necessity of another measure called the measure of dispersion.

Definition: The degree to which numerical data tend to spread about an average value is called the dispersion of the data.

The absolute measures can be divided into the following positional measures.

(i). Range

(ii). Quartile Deviation

(iii). Mean deviation

(iv). Standard Deviation

(v). Coefficient of variation

Range:

It is the simplest measure of dispersion. It is the difference between the maximum and minimum value of the given
series.

Range (R)=Largest Value (L)-Smallest Value (S).


10
Merits:

1. It is simple to understand
2. It is easy to calculate
3. It is useful in the statistical quality control, forecasting, price analysis etc.,

Demerits:

1. It is based on only two observations.


2. It cannot be calculated for open end classes.
3. It is not suitable for further mathematical treatment.
4. It is much affected by the extreme values.
5. It is very rarely used measure.

Quartile Deviation:

Quartile deviation is a measure of dispersion based on the Upper Quartile (Q3) and Lower Quartile (Q1) of a series.
It is half of the difference between the upper and lower quartiles.

Quartile deviation for Ungrouped data:

Q3  Q1
Quartile deviation Q 
2

n 3n
Where Q3  , Q1  , n= No. of observations.
4 4

Quartile deviation for grouped data:

(a). Quartile deviation for discrete data:

Q3  Q1
In this method Quartile deviation Q 
2

 N 1 
th

Where Q1    term
 4 

 N 1 
th

Q3  3   term , Where N=Total frequency


 4 

(b). Quartile deviation for continuous data:

Q3  Q1
In this method Quartile deviation Q 
2

N 3N
 cf  cf
Where Q1  l  4 C Q3  l  4  C N=Total frequency
f f

11
Merits:

1. It is simple and easy to calculate


2. It is not affected by extreme values
3. It can be calculated for the data with open end classes.

Demerits:

1. It is not based on all the observations.


2. It is not suitable for further mathematical treatment.
3. It is affected by fluctuations of sampling.

Mean Deviation:

Mean deviation is the arithmetic mean of absolute deviations from the mean or median or mode. It is usually
denoted by M.D.

Mean deviation for Ungrouped data:

1 n
M.D=  xi  A
n i 1

Where n= No. of observations, A= Mean or median or mode.

Mean deviation for grouped data:

(a). Mean deviation for discrete data:

xi
If , i=1,2,----,n is grouped frequency distribution, then the mean deviation is given by
fi

n
1
M.D.=
N
f
i 1
i xi  A

Where N= Total frequency, A= Arithmetic mean or median or mode

(b). Mean deviation for continuous data:

If class intervals and frequencies are given, then

1 n
M.D.=  f i D , Where D  M  x , M=Median
n i 1

Merits:

1. It is easy to understand and easy to calculate


2. It is based on all the observations
3. It is less affected by the extreme values
4. It is a good measure for comparative studies.

12
Demerits:

1. It is not an accurate measure of dispersion


2. It is not suitable for further mathematical treatment
3. It is rarely used and it is not a popular as standard deviation.

Standard deviation:

It is the positive square root of the arithmetic mean of the squares of the deviations of the given values from their
arithmetic mean. It is denoted by S.D.

Standard deviation for ungrouped data:

In this case the standard deviation   is given by

1 n
   xi  x  , Where n= no. of observations, x = Arithmetic mean
2

n i 1

Standard deviation for grouped data:

(a). S.D. for discrete data:

xi
If is the grouped frequency distribution, then the standard deviation is given by
fi

fx   fi xi
2

 i i
   , N=Total frequency
N  N 

(b). S.D. for continuous data: If class intervals and frequencies are given, then the standard deviation is given by

fd   fi di
2

 i i
    C , N=Total frequency d=(x-A)/c, c=class interval
N  N 

Merits:

1. It is easy to define
2. It is based on all the observations
3. It is suitable for further mathematical treatment
4. It is less affected by the fluctuations of sampling
5. The actual sign of the deviations are used in the calculation of standard deviation.
6. Standard deviation is the basis for calculating the variance, correlation coefficient etc.,

Demerits:

1. It is not easy to understand and difficult to calculate


2. It gives more weight to extreme values
3. It cannot be used for the comparative studies.
13
Coefficient of variation:

It is a relative measure of dispersion. It is generally, denoted by C.V. and is given by the formula:


Coefficient of variation: C.V.= 100
x

Where,  = Standard deviation, x = Arithmetic Mean

14

You might also like