Summarizing Data
Summarizing Data
Introduction
• A measure of central tendency is a single value that attempts
to describe a set of data by identifying the central position
within that set of data.
• Measures of central tendency are sometimes called
measures of central location.
• They are also classed as summary statistics.
• The mean (often called the average) is most likely the
measure of central tendency that you are most familiar with,
but there are others, such as the median and the mode.
• The mean, median and mode are all valid measures of
central tendency, but under different conditions, some
measures of central tendency become more appropriate to
use than others.
Variable
• A variable is not only something that we measure, but
also something that we can manipulate and something
we can control for.
• Dependent and independent variables
• Experimental and non-experimental research
• Classification of variables as Categorical or continuous
variable
Dependent and Independent
Variables
• An independent variable is also called an experimental or Predictor
variable.
• It is a variable that is being manipulated in an experiment in order to
observe the effect on a dependent variable, sometimes called
an outcome variable.
• The dependent variable that is also called outcome variable is simply
that, a variable that is dependent on an independent variable(s).
• Therefore, the aim of the researcher's investigation is to examine
whether these independent variables result in a change in the
dependent variable.
• However, it is also worth noting that whilst this is the main aim of the
experiment, the tutor may also be interested to know if different
independent variables are also connected in some way.
Experimental and Non-Experimental Research
• Experimental research:
• The aim is to manipulate an independent variable(s) and then
examine the effect that this change has on a dependent
variable(s).
• Since it is possible to manipulate the independent variable(s),
experimental research has the advantage of enabling a
researcher to identify a cause and effect between variables.
• Non-experimental research:
• The researcher does not manipulate the independent variable(s).
• This is not to say that it is impossible to do so, but it will be
impractical.
Categorical and Continuous Variables
• The mean (or average) is the most popular and well known measure
of central tendency.
• It can be used with both discrete and continuous data, although its
use is most often with continuous data
• The mean is equal to the sum of all the values in the data set
divided by the number of values in the data set.
• So, if we have n values in a data set and they have
values x1,x2, …,xn, the sample mean, usually denoted by
X (pronounced "x bar"), is:
• You may have noticed that the above formula refers to the sample
mean. So, why have we called it a sample mean?
Mean (Arithmetic)
• It is sample mean because, in statistics, samples and
populations have very different meanings and these
differences are very important, even if, in the case of
the mean, they are calculated in the same way.
• To acknowledge that we are calculating the population
mean and not the sample mean, we use the Greek
lower case letter "mu", denoted as μ
Mean (Arithmetic)
• The mean is essentially a model of data set.
• It is the value that is most common.
• However, the mean is not often one of the actual values
observed in a data set.
• However, one of its important properties is that it minimizes
error in the prediction of any one value in your data set.
• It is the value that produces the lowest amount of error from
all other values in the data set.
• An important property of the mean is that it includes every
value in the data set as part of the calculation.
• In addition, the mean is the only measure of central tendency
where the sum of the deviations of each value from the mean
is always zero.
When not to use the mean
• Variance is the average squared difference of the values from the mean.
• Unlike the previous measures of variability, the variance includes all
values in the calculation by comparing each value to the mean.
• To calculate this statistic, you calculate a set of squared differences
between the data points and the mean, sum them, and then divide by
the number of observations. Hence, it’s the average squared difference.
• There are two formulas for the variance depending on whether you are
calculating the variance for an entire population or using a sample to
estimate the population variance.
Population and sample Variance
Population variance
The formula for the variance of an entire population is the following:
In the equation, σ2 is the population parameter for the variance, μ is the parameter for the
population mean, and N is the number of data points, which should include the entire population.
Sample variance
To use a sample to estimate the variance for a population, use the following
formula. Using the previous equation with sample data tends to
underestimate the variability. Because it’s usually impossible to measure an
entire population, statisticians use the equation for sample variances much
more frequently.
In the equation, s2 is the sample variance, and M is the sample mean. N-1
in the denominator corrects for the tendency of a sample to underestimate
the population variance.
Example of calculating the sample variance