Institute: University School of Business
Department: AIT-MBA
Subject: Marketing Research Analytics
Topic
Data Preparation, Frequency Distribution,
z
Hypothesis testing etc..
Faculty Name : Dr. Lalit Singla (Associate Professor)
Email :
[email protected] z
Introduction to Data Preparation and Analysis of Data
The data, after collection, has to be processed and analyzed in accordance
with the outline laid down for the purpose at the time of developing the
research plan. This is essential for a scientific study and for ensuring that
we have all relevant data for making contemplated comparisons and
analysis.
Technically speaking, processing implies editing, coding, classification
and tabulation of collected data so that they are amenable to analysis. The
term analysis refers to the computation of certain measures along with
searching for patterns of relationship that exist among data-groups.
z
Data Preparation
Data preparation is the process of preparing raw data so that it is suitable
for further processing and analysis. Key steps include collecting,
cleaning, and labeling raw data into a form suitable for machine learning
(ML) algorithms and then exploring and visualizing the data.
z
Steps in Data Preparation
1) Questionnaire Checking
Questionnaire checking involves eliminating unacceptable questionnaires. These
questionnaires may be incomplete, instructions not followed, little variance,
missing pages, past cutoff date or respondent not qualified.
z
2) Editing
Editing of data is a process of examining the collected raw data (specially in
surveys) to detect errors and omissions and to correct these when possible. As
a matter of fact, editing involves a careful scrutiny of the completed
questionnaires and/or schedules.
Editing is done to assure that the data are accurate, consistent with other facts
gathered, uniformly entered, as completed as possible and have been well
arranged to facilitate coding and tabulation.
Field editing consists in the review of the reporting forms by the investigator for
completing (translating or rewriting) what the latter has written in abbreviated
and/or in illegible form at the time of recording the respondents’ responses. This
type of editing is necessary in view of the fact that individual writing styles
often can be difficult for others to decipher. This sort of editing should be done
as soon as possible after the interview, preferably on the very day or on the next
day. While doing field editing, the investigator must restrain himself and must
not correct errors of omission by simply guessing what the informant would
have said if the question had been asked.
Central editing should take place when all forms or schedules have been
completed and returned to the office. This type of editing implies that all forms
should get a thorough editing by a single editor in a small study and by a team
of editors in case of a large inquiry. Editor(s) may correct the obvious errors
such as an entry in the wrong place, entry recorded in months when it should
have been recorded in weeks, and the like. In case of inappropriate on missing
replies, the editor can sometimes determine the proper answer by reviewing the
other information in the schedule. At times, the respondent can be contacted for
clarification. The editor must strike out the answer if the same is inappropriate
and he has no basis for determining the correct answer or the response. In such a
case an editing entry of ‘no answer’ is called for. All the wrong replies, which
are quite obvious, must be dropped from the final results, especially in the
context of mail surveys.
Editors must keep in view several points while performing their work:
(a) They should be familiar with instructions given to the interviewers and
coders as well as with the editing instructions supplied to them for the purpose.
(b) While crossing out an original entry for one reason or another, they should
just draw a single line on it so that the same may remain legible.
(c) They must make entries (if any) on the form in some distinctive colur and
that too in a standardized form.
(d) They should initial all answers which they change or supply.
(e) Editor’s initials and the date of editing should be placed on each completed
form or schedule.
3) Coding
Coding refers to the process of assigning numerals or other symbols to answers
so that responses can be put into a limited number of categories or classes. Such
classes should be appropriate to the research problem under consideration. They
must also possess the characteristic of exhaustiveness (i.e., there must be a class
for every data item) and also that of mutual exclusively which means that a
specific answer can be placed in one and only one cell in a given category set.
Another rule to be observed is that of unidimensionality by which is meant that
every class is defined in terms of only one concept.
4) Classification
Most research studies result in a large volume of raw data which must be
reduced into homogeneous groups if we are to get meaningful relationships.
This fact necessitates classification of data which happens to be the process of
arranging data in groups or classes on the basis of common characteristics. Data
having a common characteristic are placed in one class and in this way the entire
data get divided into a number of groups or classes.
• According to attributes
• According to class-intervals
5) Tabulation
When a mass of data has been assembled, it becomes necessary for the researcher to
arrange the same in some kind of concise and logical order. This procedure is referred to
as tabulation. Thus, tabulation is the process of summarizing raw data and displaying
the same in compact form (i.e., in the form of statistical tables) for further analysis. In a
broader sense, tabulation is an orderly arrangement of data in columns and rows.
Tabulation is essential because of the following reasons.
1. It conserves space and reduces explanatory and descriptive statement to a minimum.
2. It facilitates the process of comparison.
3. It facilitates the summation of items and the detection of errors and omissions.
4. It provides a basis for various statistical computations.
Tabulation can be done by hand or by mechanical or electronic devices. The choice
depends on the size and type of study, cost considerations, time pressures and the
availability of tabulating machines or computers. In relatively large inquiries, we may
use mechanical or computer tabulation if other factors are favourable and necessary
facilities are available.
Generally accepted principles of tabulation
1. Every table should have a clear, concise and adequate title so as to make the table intelligible
without reference to the text and this title should always be placed just above the body of
the table.
2. Every table should be given a distinct number to facilitate easy reference.
3. The column headings (captions) and the row headings (stubs) of the table should be clear
and brief.
4. The units of measurement under each heading or sub-heading must always be indicated.
5. Explanatory footnotes, if any, concerning the table should be placed directly beneath the
table, along with the reference symbols used in the table.
6. Source or sources from where the data in the table have been obtained must be indicated
just below the table.
7. Usually the columns are separated from one another by lines which make the table more
readable and attractive. Lines are always drawn at the top and bottom of the table and
below the captions.
8. There should be thick lines to separate the data under one class from the data under
another class and the lines separating the sub-divisions of the classes should be comparatively
thin lines.
9. The columns may be numbered to facilitate reference.
10. Those columns whose data are to be compared should be kept side by side. Similarly,
percentages and/or averages must also be kept close to the data.
11. It is generally considered better to approximate figures before tabulation as the same would
reduce unnecessary details in the table itself.
12. In order to emphasize the relative significance of certain categories, different kinds of type,
spacing and indentations may be used.
13. It is important that all column figures be properly aligned. Decimal points and (+) or (–)
signs should be in perfect alignment.
14. Abbreviations should be avoided to the extent possible and ditto marks should not be used
in the table.
15. Miscellaneous and exceptional items, if any, should be usually placed in the last row of the
table.
16. Table should be made as logical, clear, accurate and simple as possible. If the data happen
to be very large, they should not be crowded in a single table for that would make the table
unwieldy and inconvenient.
17. Total of rows should normally be placed in the extreme right column and that of columns
should be placed at the bottom.
18. The arrangement of the categories in a table may be chronological, geographical,
alphabetical or according to magnitude to facilitate comparison. Above all, the table must suit
the needs and requirements of an investigation.
6) Graphical Representation
7) Data Cleaning
8) Data Adjusting
Data Preparation Steps in Marketing Research
Questionnaire checking: Questionnaire checking involves eliminating
unacceptable questionnaires. These questionnaires may be incomplete,
instructions not followed, little variance, missing pages, past cutoff date or
respondent not qualified.
Editing: Editing looks to correct illegible, incomplete, inconsistent and
ambiguous answers.
Coding: Coding typically assigns alpha or numeric codes to answers that do not
already have them so that statistical techniques can be applied.
Data Preparation Steps in Marketing Research
Transcribing: Transcribing data involves transferring data so as to make it
accessible to people or applications for further processing.
Cleaning: Cleaning reviews data for consistencies. Inconsistencies may arise
from faulty logic, out of range or extreme values.
Statistical adjustments: Statistical adjustments applies to data that requires
weighting and scale transformations.
Analysis strategy selection: Finally, selection of a data analysis strategy is
based on earlier work in designing the research project but is finalized after
consideration of the characteristics of the data that has been gathered.
Problems in Processing of data
(a) The problem concerning “Don’t know” (or DK) responses
While processing the data, the researcher often comes across some responses
that are difficult to handle. One category of such responses may be ‘Don’t Know
Response’ or simply DK response. When the DK response group is small, it is of
little significance. But when it is relatively big, it becomes a matter of major
concern in which case the question arises: Is the question which elicited DK
response useless? The answer depends on two points viz., the respondent
actually may not know the answer or the researcher may fail in obtaining the
appropriate information. In the first case the concerned question is said to be
alright and DK response is taken as legitimate DK response. But in the second
case, DK response is more likely to be a failure of the questioning process.
(b) Use or percentages: Percentages are often used in data presentation for they simplify
numbers, reducing all of them to a 0 to 100 range. Through the use of percentages, the data are
reduced in the standard form with base equal to 100 which fact facilitates relative comparisons.
While using percentages, the following rules should be kept in view by researchers:
1. Two or more percentages must not be averaged unless each is weighted by the group size
from which it has been derived.
2. Use of too large percentages should be avoided, since a large percentage is difficult to
understand and tends to confuse, defeating the very purpose for which percentages are
used.
3. Percentages hide the base from which they have been computed. If this is not kept in view,
the real differences may not be correctly read.
4. Percentage decreases can never exceed 100 per cent and as such for calculating the
percentage of decrease, the higher figure should invariably be taken as the base.
5. Percentages should generally be worked out in the direction of the causal-factor in case of
two-dimension tables and for this purpose we must select the more significant factor out of
the two given factors as the causal factor.
Analysis of Data
As stated earlier, by analysis we mean the computation of certain indices or
measures along with searching for patterns of relationship that exist among the
data groups. Analysis, particularly in case of survey or experimental data,
involves estimating the values of unknown parameters of the population
and testing of hypotheses for drawing inferences. Analysis may, therefore, be
categorized as descriptive analysis and inferential analysis (Inferential analysis
is often known as statistical analysis).
Types of Analysis
(a) Multiple regression analysis
This analysis is adopted when the researcher has one dependent variable which is
presumed to be a function of two or more independent variables. The objective
of
this analysis is to make a prediction about the dependent variable based on its
covariance with all the concerned independent variables.
(b) Multiple discriminant analysis
This analysis is appropriate when the researcher has a single dependent variable
that cannot be measured, but can be classified into two or more groups on the
basis of some attribute. The object of this analysis happens to be to predict an
entity’s possibility of belonging to a particular group based on several predictor
variables.
(c) Multivariate analysis of variance (or multi-ANOVA)
This analysis is an extension of two-way ANOVA, wherein the ratio of among
group variance to within group variance is worked out on a set of variables.
(d) Canonical analysis
This analysis can be used in case of both measurable and non-measurable
variables for the purpose of simultaneously predicting a set of dependent
variables from their joint covariance with a set of independent variables.
Inferential analysis is concerned with the various tests of significance for
testing hypotheses in order to determine with what validity data can be said to
indicate some conclusion or conclusions. It is also concerned with the
estimation of population values. It is mainly on the basis of inferential
analysis that the task of interpretation (i.e., the task of drawing inferences and
conclusions) is
performed.
Introduction to Frequency Distribution
Frequency Distribution
It is a statistical technique used to organize and summarize data. It represents the
count or frequency of each value or range of values in a dataset. The primary
purpose of a frequency distribution is to provide an overview of the distribution
and pattern of the data.
A frequency distribution consists of two main components:
Variable: The variable of interest, which can be categorical or numerical.
Frequency: The number of occurrences of each value or range of values in
the dataset.
By creating a frequency distribution, researchers can gain insights into the
central tendency, variability, and shape of the data, enabling them to make
informed decisions and draw meaningful conclusions.
Creating a Frequency Distribution
involves several steps:
Identify the range of values: Determine the minimum and maximum values in the dataset.
This helps define the range or interval within which the values will be grouped.
Determine the number of intervals or classes: Decide on the number of intervals or classes
to divide the data. The number of classes should be sufficient to capture the variation in the data
while maintaining clarity and interpretability.
Calculate the interval width: Divide the range of values by the number of intervals to
determine the width of each interval. This ensures that each interval is of equal width.
Count the frequency: Count the number of data points falling within each interval. This
involves examining each value in the dataset and determining its placement within the
appropriate interval.
Represent the frequency distribution: Present the frequency distribution using a table or a
visual representation such as a histogram or bar chart. The table should include the intervals,
the frequency or count for each interval, and, if applicable, cumulative frequency and relative
frequency.
Associated Statistics with Frequency Distribution
Associated Statistics with Frequency Distribution Frequency distribution provides insights into
various statistical measures, including:
Measures of central tendency:
Mean: The average value of the dataset.
Median: The middle value that separates the higher half from the lower half of the dataset.
Mode: The value(s) that appear most frequently in the dataset.
Measures of variability:
Range: The difference between the maximum and minimum values in the dataset.
Standard Deviation: A measure of the dispersion or spread of the values around the
mean.
Variance: The average squared deviation from the mean.
These statistics help researchers summarize and describe the dataset, providing a more comprehensive
understanding of its characteristics.
Hypothesis Testing
Hypothesis testing is a statistical method used to make inferences about a
population based on sample data. It involves formulating a hypothesis,
collecting and analyzing data, and drawing conclusions about the population
parameter.
Basic Concepts Concerning Hypothesis Testing
1) Null hypothesis and alternative hypothesis
In the context of statistical analysis, we often talk about null hypothesis and
alternative hypothesis. If we are to compare method A with method B about its
superiority and if we proceed on the assumption that both methods are equally
good, then
this assumption is termed as the null hypothesis. As against this, we may think
that the method A is superior or the method B is inferior, we are then stating
what is termed as alternative hypothesis. The null hypothesis is generally
symbolized as H0 and the alternative hypothesis as Ha.
Basic Concepts Concerning Hypothesis Testing
2) The level of significance
This is a very important concept in the context of hypothesis testing.
It is always some percentage (usually 5%) which should be chosen wit
great care, thought and reason. In case we take the significance level at 5
per cent, then this implies that H0 will be rejected when the sampling result
(i.e., observed evidence) has a less than 0.05 probability of occurring if H0
is true. In other words, the 5 per cent level of significance means that
researcher is willing to take as much as a 5 per cent risk of rejecting the null
hypothesis when it (H0) happens to be true. Thus the significance level is
the maximum value of the probability of rejecting H0 when it is true and is
usually determined in advance before testing the hypothesis.
Basic Concepts Concerning Hypothesis Testing
3) Decision rule or test of hypothesis
Given a hypothesis H0 and an alternative hypothesis Ha, we make a rule
which is known as decision rule according to which we accept H0 (i.e., reject
Ha) or reject H0 (i.e., accept Ha). For instance, if (H0 is that a certain lot is
good (there are very few defective items in it) against Ha) that the lot is not
good (there are too many defective items in it), then we must decide the
number of items to be tested and the criterion for accepting or rejecting the
hypothesis. We might test 10 items in the lot and plan our decision saying that
if there are none or only 1 defective item among the 10, we will accept H0
otherwise we will reject H0 (or accept Ha). This sort of basis is known as
decision rule.
Basic Concepts Concerning Hypothesis Testing
4) Type I and Type II errors
In the context of testing of hypotheses, there are basically two types of errors
we can make. We may reject H0 when H0 is true and we may accept H0 when
in fact H0 is not true. The former is known as Type I error and the latter as
Type II error. In other words, Type I error means rejection of hypothesis which
should have been accepted and Type II error means accepting the hypothesis
which should have been rejected. Type I error is denoted by a (alpha) known
as a error, also called the level of significance of test; and Type II error is
denoted by b (beta) known as b error.
Basic Concepts Concerning Hypothesis Testing
5) Two-tailed and One-tailed tests
In the context of hypothesis testing, these two terms are quite important and
must be clearly understood. A two-tailed test rejects the null hypothesis if,
say, the sample mean is significantly higher or lower than the hypothesized
value of the mean of the population. Such a test is appropriate when the
null hypothesis is some specified value and the alternative hypothesis is a
value not equal to the specified value of the null hypothesis.
Hypothesis testing helps researchers make evidence-based decisions, validate
assumptions, and determine the statistical significance of their findings.
Parametric Test
In Statistics, the generalizations for creating records about the mean of the
original population is given by the parametric test. This test is also a kind of
hypothesis test. A t-test is performed and this depends on the t-test of students,
which is regularly used in this value. This is known as a parametric test.
The t-measurement test hangs on the underlying statement that there is the
ordinary distribution of a variable. Here, the value of mean is known, or it is
assumed or taken to be known. The population variance is determined in order
to find the sample from the population. The population is estimated with the
help of an interval scale and the variables of concern are hypothesized.
Non-Parametric Test
There is no requirement for any distribution of the population in the non-
parametric test.
Also, the non-parametric test is a type hypothesis test that is not dependent on
any underlying hypothesis.
In the non-parametric test, the test depends on the value of the median.
This method of testing is also known as distribution-free testing.
Test values are found based on the ordinal or the nominal level.
The parametric test is usually performed when the independent variables are
non-metric. This is known as a non-parametric test.
Parametric Tests
Student's T-Test:- This test is used when the samples are small and population variances are unknown. The test is used to
do a comparison between two means and proportions of small independent samples and between the population mean and
sample mean.
1 Sample T-Test:- Through this test, the comparison between the specified value and meaning of a single group of
observations is done.
Unpaired 2 Sample T-Test:- The test is performed to compare the two means of two independent samples. These samples
came from the normal populations having the same or unknown variances.
Paired 2 Sample T-Test:- In the case of paired data of observations from a single sample, the paired 2 sample t-test is used.
ANOVA:- Analysis of variance is used when the difference in the mean values of more than two groups is given.
One Way ANOVA:- This test is useful when different testing groups differ by only one factor.
Two Way ANOVA:- When various testing groups differ by two or more factors, then a two way ANOVA test is used.
Pearson's Correlation Coefficient:- This coefficient is the estimation of the strength between two variables. The test is
used in finding the relationship between two continuous and quantitative variables.
Z - Test:- The test helps measure the difference between two means.
Z - Proportionality Test:- It is used in calculating the difference between two proportions.
All these tests are based on the assumption of normality i.e., the source of data is considered to be normally distributed. In
some cases the population may not be normally distributed, yet the tests will be applicable on account of the fact that we
mostly deal with samples and the sampling distributions closely approach normal distributions.
Non-Parametric Tests
1 Sample Sign Test:- In this test, the median of a population is calculated and is
compared to the target value or reference value.
1 Sample Wilcoxon Signed Rank Test:- Through this test also, the population median
is calculated and compared with the target value but the data used is extracted from the
symmetric distribution.
Friedman Test:- The difference of the groups having ordinal dependent variables is
calculated. This test is used for continuous data.
Goodman Kruska's Gamma:- It is a group test used for ranked variables.
Kruskal-Wallis Test:- This test is used when two or more medians are different. For the
calculations in this test, ranks of the data points are used.
The Mann-Kendall Trend Test:- The test helps in finding the trends in time-series data.
Mann-Whitney Test:- To compare differences between two independent groups, this
test is used. The condition used in this test is that the dependent values must be
continuous or ordinal.
Mood's Median Test:- This test is used when there are two independent samples.
Spearman Rank Correlation:- This technique is used to estimate the relation between
two sets of data.
Using Excel for Statistical Analysis
Excel is a widely used tool for statistical analysis due to its accessibility and versatility. It offers various
functions and features that facilitate data management, calculation of descriptive statistics, and conducting
basic statistical tests.
Key features of Excel for statistical analysis include:
Data entry and organization: Excel provides a structured format for entering and organizing data,
making it easier to perform calculations and create charts.
Descriptive statistics: Excel offers functions such as AVERAGE, MEDIAN, MODE, STDEV, VAR, and
COUNT to calculate measures of central tendency, variability, and frequency.
Data visualization: Excel provides a range of chart types, including histograms, scatter plots, and box
plots, to visually represent data distributions and relationships.
Basic statistical tests: Excel supports common statistical tests such as t-tests, ANOVA, and correlation
analysis through built-in functions and the Data Analysis Toolpak.
While Excel is suitable for basic statistical analysis, more advanced statistical techniques may require
specialized software like SPSS.
Using SPSS for Statistical Analysis
SPSS (Statistical Package for the Social Sciences) is a comprehensive software package specifically designed
for statistical analysis. It provides a wide range of tools and features for data management, exploratory data
analysis, hypothesis testing, and advanced statistical modeling.
Key features of SPSS for statistical analysis include:
Data management: SPSS allows for data cleaning, recoding variables, and handling missing values.
Descriptive statistics: SPSS provides various functions for calculating measures of central tendency,
variability, and distributional characteristics.
Hypothesis testing: SPSS supports a wide range of statistical tests, including t-tests, ANOVA, chi-square tests,
regression analysis, and factor analysis.
Data visualization: SPSS offers powerful charting options, allowing users to create meaningful visual
representations of their data.
Advanced analytics: SPSS includes advanced statistical techniques such as multivariate analysis, cluster
analysis, and structural equation modeling.
SPSS is widely used in academia and industry for complex statistical analysis and is a valuable tool for
researchers and analysts in various fields.
Thanx for Patience
Listening