Econ 656- Research Methodology and
Seminar in Economics
Part V
Data Processing and Analysis
1
Content of the Lecture
1. Introduction
2. Data Preparation and Processing
3. Analysis of Data
1. Univariate analysis
2. Bivariate analysis
3. Multivariate analysis
2
Introduction
Once data is acquired you will need to use it to help you
address your research questions.
For the data to be meaningfully used you need to:
Ensure that the data is complete.
Know your data - becoming familiar with what you have got.
Organize your data .
Analysis is the most rewarding part of your research project.
There is a sense of relief, excitement and satisfaction that
your work is meaningful.
3
Introduction
It is the process of working with the data to describe, discuss,
interpret, evaluate and explain it in terms of the research
questions or hypothesis.
i.e., the computation of certain indices or measures and
searching for patterns of relationships.
It ranges from very simple summary statistics to extremely
complex multivariate analyses.
4
Introduction
Much of the quantitative data analysis is conducted using
software programs.
So, the collected data must be converted into a machine-
readable, numeric format.
Numerical data can be analyzed quantitatively using statistical
tools in two different ways.
Descriptive analysis- statistically describing, aggregating, and
presenting the constructs of interest.
Inferential analysis- the statistical testing of hypotheses
(theory testing).
5
Data Preparation and Processing
6
Data Preparation and Processing
Data processing starts with editing, coding, classifying and
tabulating the collected data.
Editing: is the process of examining the collected raw data to
detect errors and omissions.
It involves a careful scrutiny of the completed
questionnaire to assure that the data are:
Accurate
Consistent with other facts gathered
Uniformly entered, etc.
7
Data Preparation and Processing
Two levels of editing: field and central levels.
Field level Editing: after an interview, field workers should
review their reporting forms, complete what was
abbreviated, translate personal short hands, rewrite illegible
entries, and make callback if necessary.
Central editing: takes place when all forms have been completed
and returned to the office.
Data editors correct obvious errors such as entry in
wrong place, recorded in wrong units, etc.
8
Data Preparation and Processing
Checking questionnaires: Identifiers
Each questionnaire or case needs a unique identifier.
Sometimes this will be assigned prior to collecting the
data by numbering the questionnaires.
If this has not been done, then an identifier should be
given to each data source.
Example: all questionnaires from people in Addis
may start with ‘1’ (101, 102, 103, etc.) and those from
Bahir Dar with ‘2’ (201, 202, 203, etc.).
9
Data Preparation and Processing
This will make it easier to sort out where questionnaires
have come from,
it also allows analysis to be carried out on the two
sets of questionnaires separately.
This information also enables the researcher to refer
back to his participants.
(if, for example, he wishes to involve them in further
research, etc.
10
Data Preparation and Processing
What to do with partial responses – missing responses
If a questionnaire is only partially completed there may
be a number of reasons for this.
May be the length of the questionnaire deterred your
participants from completing it, or
They did not wish to answer a particular question or
section.
For example, is a sensitive topic and decided not
to complete the questionnaire
11
Data Preparation and Processing
What we do with the data depends on the number of missing
cases and the possible reasons for incompletion.
You will need to decide whether to reject the incomplete
questionnaires or whether to include the partial
information.
And you must make it clear when writing up and
discussing your findings that this is the case.
12
Data Preparation and Processing
Inconsistent data
Sometimes you will find that the information given by a
respondent within a questionnaire is inconsistent.
This can be the case with both factual and value data.
Example: a participant may give her/his date of birth
as 1990 but also record that s(he) has children born
in 2000.
13
Data Preparation and Processing
This type of inconsistency could have occurred for a number
reasons.
Either (or both) of the dates given may be incorrect or
may have been misread or mis-heard, or
perhaps the participant has misunderstood the question
and recorded his brothers and sisters who live with him
as his children, or is referring to his stepchildren, etc.
14
Data Preparation and Processing
As with missing information, you will need to consider
whether:
the data can be checked in some way (by referring to other
questions) or by contacting the participant); and
whether the data is useable in your analysis.
Usually the default mode of handling missing values in some
software programs is to simply drop the entire observation
containing even a single missing value.
But Such deletion can significantly shrink the sample size
and make it extremely difficult to detect small effects.
15
Data Preparation and Processing
Some software programs allow the option of replacing
missing values with an estimated value via a process called
imputation.
For instance, the imputed value could be the average of
the respondent’s responses to remaining items.
But, such imputation can be biased if the missing value is of a
systematic nature rather than a random nature.
Other procedures (multiple imputations)
16
Data Preparation and Processing
Coding: Many data collection instruments include open
questions.
i.e., questions that do not have a preset range of
answers.
In order to be able to work with this data using statistical
analysis the data from open questions need to be coded.
Coding refers to the process of assigning numerals to
answers so that responses can be put into a limited number
of categories or classes – coding sheet.
17
Data Preparation and Processing
It is the process of converting data into numeric format.
This enables you to enter the data quickly using the
numeric keypad on your keyboard and with fewer
errors.
Coding is especially important for large complex studies
involving many variables and measurement items.
A codebook which is a comprehensive document
containing detailed description of each variable would be
created.
18
Data Preparation and Processing
The coding must be:
Exhaustive - there must be a class for every data item.
Mutually exclusive – category components should be
mutually exclusive i.e. specific answers can be placed
in one and only one cell in a given category set.
Multiple codes
Some questions can ask for more than one answer.
In this case there is more than one variable attached to
the question.
19
Example Example
You can consider
each of the listed
foods as a variable
and code each
variable as 1 if it is
ticked, 2 if it is not
ticked.
You could then count
how many people
eat, for example,
cereal more than
twice a week.
20
Data Preparation and Processing
Data entry: Coded data can be entered into a spreadsheet,
database, text file, or directly into a statistical program like
Stata or SPSS.
Each observation can be entered as one row in the
spreadsheet and each measurement item can be represented
as one column.
The entered data should be checked for accuracy, via
occasional spot checks on a set of items or observations,
during and after entry.
21
Analysis of quantitative data
22
Analysis of quantitative data
Analysis is a process of summarizing, describing and
explaining the data in terms of the research questions or
hypothesis.
So, analysis of data is more than simply summarizing and
tabulating the data that has been collected.
As a researcher, you must act as an intermediary between
the data you have gathered and the people who will be
interested in what you have found out.
23
Analysis of quantitative data: Univariate analysis
With respect to the number of variables three types of statistical
analysis could be considered:
Univariate analysis: only one variable
Bivariate analysis: two variables
Multivariate analysis: more than two variables
Univariate analysis refers to a set of statistical techniques that
can describe the general properties of one variable.
Univariate statistics include: (1) frequency distribution, (2)
central tendency, and (3) dispersion.
Examples: means, medians and modes, variances, and
percentiles.
24
Analysis of quantitative data: Univariate analysis
Whatever statistical analysis you have in mind, you are likely
to begin by producing some frequency tables.
For each variable you know how many of each answer or
code have been given.
You can then take a look at the way in which the answers to
your questions are distributed, and
identify potentially interesting distributions which you may
wish to explore further.
25
Analysis of quantitative data: Univariate analysis
The distribution or the ‘shape’ of your data, can also be
depicted in the form of graphs and charts.
Bar charts and histograms can help you to visualize the shape
or distribution of the values for each of your variables.
Graphs are effective ways for summarizing your data and
helping you to identify interesting or anomalous features
within the data
They help you to begin to explore relationships between
variables.
26
Example: Frequency distribution of religiosity
How many times a sample of respondents attend religious services
27
Analysis of quantitative data: Univariate analysis
Measures of central tendency: a value typical for the data
The mean, median and mode are methods of
summarizing the data relating to one variable.
Measures of dispersion: measure the amount of variation in
the data
similar measures of central tendency may come from
very different distributions
Range, Variance and standard deviation
variance is the average amount of variation around the
mean
28
Analysis of quantitative data: Bivariate analysis
Bivariate Analysis is the analysis of two variables to
examine if they are correlated
Bivariate analysis examines how two variables are
related to each other.
Correlation can be shown by:
Scatter plot/diagram: the values of the two variables are
plotted on X and Y axis
strong relationships can be identified by scatter
diagrams
29
Scatter plot of a positive association
Income and livestock ownership
60
50
Livestock
40
30
20
10
0
0 200 400 600 800 1000 1200
Income
30
Scatter plot of a negative association
Income & illitracy rates (%)
Rate of illiteracry (%)
100
80
60
40
20
0
0 200 400 600 800 1000 1200
Income
31
Scatter plot of no association
Income and household size
12
10
hh size
8
6
4
2
0
0 200 400 600 800 1000 1200
income
32
Analysis of quantitative data: Bivariate analysis
Correlation analysis : The most common bivariate statistic is the
bivariate correlation which is a number between -1 and +1.
The correlation coefficient is numerical value reflecting strength of
relationship.
II I
Mean y
III IV
Mean x
33
Analysis of quantitative data: Multivariate analysis
Multivariate analysis: the relationship between three or
more variables
Some of the relationships identified in bivariate analysis can
be spurious - when there is no real relationship
Analysis should control for the effects of additional variables
Multiple regression analysis (econometrics) controls for
all important variables on which data are available
34
Analysis of quantitative data: Multivariate analysis
General Linear Model: Most statistical procedures are derived
from a general family of statistical models called the general
linear model (GLM).
Yi = β0 + β1*X1 + β2*X2 + … + βn*Xn + εi
Yi = β0 + βi∑Xi+ εi
where X and Y are the independent and dependent variables
βi = coefficient parameters to be estimated; increase/decrease in Y
when X changes by one unit (controlling for other factors)
εi = random error term; difference between estimated values of Y
and real values of Y; and assumptions on εi
35
Two-variable linear model
36
Analysis of quantitative data: Multivariate analysis
How are the parameters (βi) estimated?
The widely used method is ordinary least squares (OLS)
In least squares method the difference between the
expected values of Y from the regression and the real
values of Y is minimised = the error terms are minimised
Other estimation methods are also available (MLE, GMM,
etc.)
37
Analysis of quantitative data: Multivariate analysis
Various tests can be organized.
Overall test (F-test): the null hypothesis for the overall test
is ‘all the coefficient of the regression are zero?’ (no
explanatory power)
Ho: β1 = β2= β3 = … = βn = 0
Test for a single variable (t-test): Does a particular
independent variable adds significantly to the explanation?
Ho: βi = 0
38
Analysis of quantitative data: Multivariate analysis
Several Econometric problems are also expected.
Sample Selectivity
Misspecification
Omitted Variables
Fixed Effects
Endogenous Variables
Appropriate tests and remedial measures need to be
considered for these problems.
39