0% found this document useful (0 votes)
9 views

Summarizing Data

Uploaded by

tonifavv
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Summarizing Data

Uploaded by

tonifavv
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Measures of Central Tendency

Introduction
• A measure of central tendency is a single value that attempts
to describe a set of data by identifying the central position
within that set of data.
• Measures of central tendency are sometimes called
measures of central location.
• They are also classed as summary statistics.
• The mean (often called the average) is most likely the
measure of central tendency that you are most familiar with,
but there are others, such as the median and the mode.
• The mean, median and mode are all valid measures of
central tendency, but under different conditions, some
measures of central tendency become more appropriate to
use than others.
Variable
• A variable is not only something that we measure, but
also something that we can manipulate and something
we can control for.
• Dependent and independent variables
• Experimental and non-experimental research
• Classification of variables as Categorical or continuous
variable
Dependent and Independent
Variables
• An independent variable is also called an experimental or Predictor
variable.
• It is a variable that is being manipulated in an experiment in order to
observe the effect on a dependent variable, sometimes called
an outcome variable.
• The dependent variable that is also called outcome variable is simply
that, a variable that is dependent on an independent variable(s).
• Therefore, the aim of the researcher's investigation is to examine
whether these independent variables result in a change in the
dependent variable.
• However, it is also worth noting that whilst this is the main aim of the
experiment, the tutor may also be interested to know if different
independent variables are also connected in some way.
Experimental and Non-Experimental Research

• Experimental research:
• The aim is to manipulate an independent variable(s) and then
examine the effect that this change has on a dependent
variable(s).
• Since it is possible to manipulate the independent variable(s),
experimental research has the advantage of enabling a
researcher to identify a cause and effect between variables.
• Non-experimental research:
• The researcher does not manipulate the independent variable(s).
• This is not to say that it is impossible to do so, but it will be
impractical.
Categorical and Continuous Variables

• Categorical variables are also known as discrete or qualitative variables.


• Categorical variables can be further categorized as either nominal, ordinal
or dichotomous.
1. Nominal variables
• are variables that have two or more categories, but which do not have an intrinsic order.
• Of note, the different categories of a nominal variable can also be referred to as groups or
levels of the nominal variable.
2. Dichotomous variables
• are nominal variables which have only two categories or levels.
• Gender "male" or "female" is an example of a dichotomous variable (and also a nominal
variable) "Yes" or "No“ is another example.
3. Ordinal variables
• are variables that have two or more categories just like nominal variables
• The categories can also be ordered or ranked
• However, whilst we can rank the levels, we cannot place a "value" to them; we cannot say
that "They are OK" is twice as positive as "Not very much" for example.
Continuous variables
• are also known as quantitative variables.
• Continuous variables can be further categorized as either interval or ratio variables.
1. Interval variables
• are variables for which their central characteristic is that they can be measured along a
continuum and they have a numerical value
• (for example, temperature measured in degrees Celsius or Fahrenheit). So the difference
between 20°C and 30°C is the same as 30°C to 40°C.
• However, temperature measured in degrees Celsius or Fahrenheit is NOT a ratio variable.
2. Ratio variables
• are interval variables, but with the added condition that 0 (zero) of the measurement
indicates that there is none of that variable.
• So, temperature measured in degrees Celsius or Fahrenheit is not a ratio variable
because 0°C does not mean there is no temperature.
• However, temperature measured in Kelvin is a ratio variable as 0 Kelvin (often called
absolute zero) indicates that there is no temperature whatsoever.
• Other examples of ratio variables include height, mass, distance and many more.
• The name "ratio" reflects the fact that you can use the ratio of measurements. So, for
example, a distance of ten metres is twice the distance of 5 metres.
Mean (Arithmetic)

• The mean (or average) is the most popular and well known measure
of central tendency.
• It can be used with both discrete and continuous data, although its
use is most often with continuous data
• The mean is equal to the sum of all the values in the data set
divided by the number of values in the data set.
• So, if we have n values in a data set and they have
values x1,x2, …,xn, the sample mean, usually denoted by
X (pronounced "x bar"), is:

• This formula is usually written in a slightly different manner using


the Greek capitol letter, ∑, pronounced "sigma", which means "sum
of...

• You may have noticed that the above formula refers to the sample
mean. So, why have we called it a sample mean?
Mean (Arithmetic)
• It is sample mean because, in statistics, samples and
populations have very different meanings and these
differences are very important, even if, in the case of
the mean, they are calculated in the same way.
• To acknowledge that we are calculating the population
mean and not the sample mean, we use the Greek
lower case letter "mu", denoted as μ
Mean (Arithmetic)
• The mean is essentially a model of data set.
• It is the value that is most common.
• However, the mean is not often one of the actual values
observed in a data set.
• However, one of its important properties is that it minimizes
error in the prediction of any one value in your data set.
• It is the value that produces the lowest amount of error from
all other values in the data set.
• An important property of the mean is that it includes every
value in the data set as part of the calculation.
• In addition, the mean is the only measure of central tendency
where the sum of the deviations of each value from the mean
is always zero.
When not to use the mean

• The mean has one main disadvantage:


• it is particularly susceptible to the influence of outliers.
• These are values that are unusual compared to the rest of the
data set by being especially small or large in numerical value.
• For example, consider the wages of staff at a factory below:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
• The mean salary for these ten staff is $30.7k.
• However, inspecting the raw data suggests that this mean
value might not be the best way to accurately reflect the
typical salary of a worker, as most workers have salaries in the
$12k to 18k range.
When not to use the mean…..
• The mean is being skewed by the two large salaries.
• Therefore, in this situation, a better measure of central tendency is the
median.
• The median is also preferred to the mean (or mode) when data is skewed (i.e.,
the frequency distribution for the data is skewed).
• If we consider the normal distribution - as this is the most frequently assessed
in statistics - when the data is perfectly normal, the mean, median and mode
are identical.
• Moreover, they all represent the most typical value in the data set.
• However, as the data becomes skewed the mean loses its ability to provide
the best central location for the data because the skewed data is dragging it
away from the typical value.
• However, the median best retains this position and is not as strongly
influenced by the skewed values.
Median
• The median is the middle score for a set of data that has been arranged
in order of magnitude. The median is less affected by outliers and
skewed data. In order to calculate the median, suppose we have the
data below:
65 55 89 56 35 14 56 55 87 45 92

We first need to rearrange that data into order of magnitude (smallest


14 35 45 55 55 56 56 65 87 89 92
first):

• Our median mark is the middle mark - in this case, 56 (highlighted in


bold).
• It is the middle mark because there are 5 scores before it and 5 scores
after it.
• This works fine when you have an odd number of scores, but what
happens when you have an even number of scores?
• What if you had only 10 scores? Well, you simply have to take the
middle two scores and average the result.
Mode

• The mode is the most frequent score in our data set.


• On a histogram it represents the highest bar in a bar
chart or histogram.
• You can, therefore, sometimes consider the mode as
being the most popular option.
• Normally, the mode is used for
categorical data where we wish
to know which is the most
common category
Mode…..
• Mode usually leaves us with a problem when we have two
or more values that share the highest frequency
• We are now stuck as to which mode best describes the
central tendency of the data.
• This is particularly problematic when we have continuous
data because we are more likely not to have any one value
that is more frequent than the other.
• This is why the mode is very rarely used with continuous
data.
• Another problem with the mode is that it will not provide us
with a very good measure of central tendency when the
most common mark is far away from the rest of the data in
the data set.
Mode…
• In this diagram the mode has a value of 2.
• We can clearly see, however, that the
mode is not representative of the data,
which is mostly concentrated around the
20 to 30 value range.
• To use the mode to describe the central
tendency of this data set would be
misleading.
Summary of when to use the mean, median and
mode

• Please use the following summary table to know what


the best measure of central tendency is with respect to
the different types of variable.
MEASURE OF VARIATION
• A measure of variability is a summary statistic that represents the
amount of dispersion in a dataset.
• How spread out are the values?
• While a measure of central tendency describes the typical value,
measures of variability define how far away the data points tend to fall
from the center.
• We talk about variability in the context of a distribution of values.
• A low dispersion indicates that the data points tend to be clustered
tightly around the center.
• High dispersion signifies that they tend to fall further away.
MEASURE OF VARIATION…
• In statistics, variability, dispersion, and spread are synonyms that denote the width
of the distribution.
• Just as there are multiple measures of central tendency, there are several
measures of variability.
• The most common measures of variability are:
• The range,
• The interquartile range,
• variance,
• Standard deviation.
• The two plots below show the difference graphically for distributions with the
same mean but more and less dispersion.
• The panel on the left shows a distribution that is tightly clustered around the
average, while the distribution in the right panel is more spread out.
Importance of Variability

• Analysts frequently use the mean to summarize the center of a


population or a process.
• While the mean is relevant, people often react to variability even
more.
• When a distribution has lower variability, the values in a dataset are
more consistent.
• However, when the variability is higher, the data points are more
dissimilar and extreme values become more likely.
• Consequently, understanding variability helps you grasp the likelihood
of unusual events and give the full picture of a data set
Range
• It is the most straightforward measure of variability to calculate and
the simplest to understand.
• The range of a dataset is the difference between the largest and
smallest values in that dataset.
• For example, in the two datasets below, dataset 1 has a range of 20 –
38 = 18 while dataset 2 has a range of 11 – 52 = 41.
• Dataset 2 has a broader range and, hence, more variability than
dataset 1.
Range……
• Range is based on only the two most extreme values in the dataset, which makes it
very susceptible to outliers.
• If one of those numbers is unusually high or low, it affects the entire range even if it
is atypical.
• Additionally, the size of the dataset affects the range.
• In general, you are less likely to observe extreme values.
• However, as you increase the sample size, you have more opportunities to obtain
these extreme values.
• Consequently, when you draw random samples from the same population, the range
tends to increase as the sample size increases.
• Consequently, use the range to compare variability only when the sample sizes are
similar.
The Interquartile Range (IQR)
and Percentiles..
• The interquartile range is the middle half of the data.
• To visualize it, think about the median value that splits the dataset in half.
• Similarly, you can divide the data into quarters.
• Statisticians refer to these quarters as quartiles and denote them from low to high as Q1,
Q2, and Q3.
• The lowest quartile (Q1) contains the quarter of the dataset with the smallest values.
• The upper quartile (Q4) contains the quarter of the dataset with the highest values.
• The interquartile range is the middle half of the data that is in between the upper and
lower quartiles.
• In other words, the interquartile range includes the 50% of data points that fall between
Q1 and Q3.
• The IQR is the red area in the graph below.
The Interquartile Range (IQR)
and Percentiles..
• The interquartile range is a robust measure of variability in a similar manner that
the median is a robust measure of central tendency.
• Neither measure is influenced dramatically by outliers because they don’t depend
on every value.
• Additionally, the interquartile range is excellent for skewed distributions, just like
the median.
• In a normal distribution, the standard deviation tells you the percentage of
observations that fall specific distances from the mean.
• However, this doesn’t work for skewed distributions, and the IQR is a great
alternative.
• The interquartile range (IQR) extends from the low end of Q2 to the upper limit of
Q3. For this dataset, the range is 21 – 39.
Using other percentiles

• When there is a skewed distribution, reporting the median with the


interquartile range is a particularly good combination.
• The interquartile range is equivalent to the region between the 75th
and 25th percentile (75 – 25 = 50% of the data).
• Other percentiles can be used to determine the spread of different
proportions.
• For example, the range between the 97.5th percentile and the 2.5th
percentile covers 95% of the data. The broader these ranges, the
higher the variability in your dataset.
Variance

• Variance is the average squared difference of the values from the mean.
• Unlike the previous measures of variability, the variance includes all
values in the calculation by comparing each value to the mean.
• To calculate this statistic, you calculate a set of squared differences
between the data points and the mean, sum them, and then divide by
the number of observations. Hence, it’s the average squared difference.
• There are two formulas for the variance depending on whether you are
calculating the variance for an entire population or using a sample to
estimate the population variance.
Population and sample Variance

Population variance
The formula for the variance of an entire population is the following:

In the equation, σ2 is the population parameter for the variance, μ is the parameter for the
population mean, and N is the number of data points, which should include the entire population.
Sample variance
To use a sample to estimate the variance for a population, use the following
formula. Using the previous equation with sample data tends to
underestimate the variability. Because it’s usually impossible to measure an
entire population, statisticians use the equation for sample variances much
more frequently.
In the equation, s2 is the sample variance, and M is the sample mean. N-1
in the denominator corrects for the tendency of a sample to underestimate
the population variance.
Example of calculating the sample variance

• An example using the formula for a sample on a dataset with 17


observations in the table below.
• The numbers in parentheses represent the corresponding table column
number.
• The procedure involves taking each observation (1), subtracting the sample
mean (2) to calculate the difference (3), and squaring that difference (4).
• Then, I sum the squared differences at the bottom of the table.
• Finally, I take the sum and divide by 16 because I’m using the sample
variance equation with 17 observations (17 – 1 = 16).
• The variance for this dataset is 201
Example of calculating the
sample variance….
• Because the calculations use the squared differences, the variance is
in squared units rather the original units of the data.
• While higher values of the variance indicate greater variability, there
is no intuitive interpretation for specific values.
• Despite this limitation, various statistical tests use the variance in
their calculations.
• While it is difficult to interpret the variance itself, the standard
deviation resolves this problem!
Standard Deviation
• The standard deviation is the standard or typical difference between
each data point and the mean.
• When the values in a dataset are grouped closer together, you have a
smaller standard deviation.
• On the other hand, when the values are spread out more, the
standard deviation is larger because the standard distance is greater.
• Conveniently, the standard deviation uses the original units of the
data, which makes interpretation easier.
• Consequently, the standard deviation is the most widely used measure
of variability. For example, a standard deviation of 5 indicates that the
typical data is plus or minus 5 minutes from the mean.
• It’s often reported along with the mean: 20 minutes (s.d. 5).
Standard Deviation
In the variance section, we calculated a variance of 201 in the table
Therefore, the standard deviation for that dataset is 14.177 .

• The standard deviation is similar to the mean absolute deviation. Both


use the original units of the data and they compare the data values to
mean to assess variability. However, there are differences.
• People often confuse the standard deviation with the standard error of
the mean.
• Both measures assess variability, but they have extremely different
purposes.
The Empirical Rule for the Standard Deviation
of a Normal Distribution

• When you have normally distributed data, or approximately so, the


standard deviation becomes particularly valuable.
• You can use it to determine the proportion of the values that fall
within a specified number of standard deviations from the mean.
• For example, in a normal distribution, 68% of the values will fall
within +/- 1 standard deviation from the mean.
• This property is part of the Empirical Rule.
• This rule describes the percentage of the data that fall within specific
numbers of standard deviations from the mean for bell-shaped
curves.
The Empirical Rule for the
Standard Deviation of a Normal
Distribution
• Let’s take another look at the pizza delivery example where we have a
mean delivery time of 20 minutes and a standard deviation of 5
minutes. Using the Empirical Rule, we can use the mean and standard
deviation to determine that 68% of the delivery times will fall
between 15-25 minutes (20 +/- 5) and 95% will fall between 10-30
minutes (20 +/- 2*5).
Which is Best—the Range, Interquartile Range,
or Standard Deviation?

• Variance is not included because the variance is in squared units and


doesn’t provide an intuitive interpretation.
• When you are comparing samples that are the same size, consider
using the range as the measure of variability. It’s a reasonably
intuitive statistic.
• Just be aware that a single outlier can throw the range off. The range
is particularly suitable for small samples when you don’t have enough
data to calculate the other measures reliably, and the likelihood of
obtaining an outlier is also lower.
Which is Best—the Range,
Interquartile Range, or
Standard Deviation?
• When you have a skewed distribution, the median is a better measure
of central tendency, and it makes sense to pair it with either the
interquartile range or other percentile-based ranges because all of
these statistics divide the dataset into groups with specific proportions.
• For normally distributed data, or even data that aren’t terribly skewed,
using the tried and true combination reporting the mean and the
standard deviation is the way to go. This combination is by far the most
common. You can still supplement this approach with percentile-base
ranges as you need.
• Except for variances, the statistics in this post are absolute measures of
variability because they use the original variable’s measurement units.
Assignment
• mean absolute deviation (MAD).
• Standard Error of the Mean.
• coefficient of variation
• Analyze Descriptive Statistics in Excel/spss
• Some variation is inevitable, but problems occur at the extremes.
• Distributions with greater variability produce observations with
unusually large and small values more frequently than distributions
with less variability.
• Variability help in assessing the sample’s heterogeneity.
MEASURES OF VARIATION
• Variability is the spread or dispersion of scores.
• Measures of variability tell us:
• The extent to which the scores differ from each other or how spread
out the scores are
• How accurately the measure of central tendency describes the
distribution
• The shape of the distribution
• There are a few ways to measure variability and they include:
• 1) The Range
• 2) The Mean Deviation
• 3) The Standard Deviation
• 4) The Variance

You might also like