0% found this document useful (0 votes)
170 views94 pages

Unit 3 - BA - July 2022

Uploaded by

PRAGASM PROG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
170 views94 pages

Unit 3 - BA - July 2022

Uploaded by

PRAGASM PROG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Unit 3

Exploring Data
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a
publicly accessible website, in whole or in part.

Chapter 2
DECISION MAKING
DATA ANALYSIS AND
BUSINESS ANALYTICS:

Describing the Distribution of a Single Variable


Introduction
(slide 1 of 2)

 The goal is to present data in a form that makes


sense to people. Tools that are used to do this
include:
 Graphs: bar charts, pie charts, histograms, scatterplots,
time series graphs
 Numerical summary measures: counts, percentages,
averages, measures of variability
 Tables of summary measures: totals, averages, counts,
grouped by categories
 It is a challenge to summarize data so that the
important information stands out clearly.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Introduction
(slide 2 of 2)

 There are four steps in data analysis:


1. Recognize a problem that needs to be solved.
2. Gather data to help understand and then solve the
problem.
3. Analyze the data.
4. Act on this analysis.
 It is up to you to ask good questions—and then
take advantage of the most appropriate tools to
answer them.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Populations and Samples
 A population includes all of the entities of interest
in a study (people, households, machines, etc.)
 Examples:
 All potential voters in a presidential election
 All subscribers to cable television
 All invoices submitted for Medicare reimbursement by
nursing homes
 A sample is a subset of the population, often
randomly chosen and preferably representative of
the population as a whole.
 Examples: Gallup, Harris, other polls today
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Sets, Variables, and Observations

 A data set is usually a rectangular array of data,


with variables in columns and observations in
rows.
 A variable (or field or attribute) is a characteristic
of members of a population, such as height, gender,
or salary.
 An observation (or case or record) is a list of all
variable values for a single member of a
population.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.1:
Questionnaire [Link]
 Objective: To illustrate variables and observations in a typical data
set.
 Solution: Data set includes observations on 30 people who responded
to a questionnaire on the president’s environmental policies.
 Variables include: age, gender, state, children, salary, opinion.

 Include a row that lists variable names.

 Include a column that shows an index of the observation.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Types of Data
(slide 1 of 5)

 A variable is numerical if meaningful arithmetic


can be performed on it.
 Otherwise, the variable is categorical.
 There is also a third data type, a date variable.
 Excel® stores dates as numbers, but dates are treated
differently from typical numbers.
 A categorical variable is ordinal if there is a
natural ordering of its possible values.
 If there is no natural ordering, it is nominal.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Types of Data
(slide 2 of 5)

 Categorical variables can be coded numerically or left


uncoded.
 A dummy variable is a 0–1 coded variable for a
specific category.
 It is coded as 1 for all observations in that category and 0
for all observations not in that category.
 Categorizing a numerical variable by putting the data
into discrete categories (called bins) is called binning
or discretizing.
 A variable that has been categorized in this way is called a
binned or discretized variable.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Environmental Data
Using a Different Coding (slide 3 of 5)

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Types of Data
(slide 4 of 5)

 A numerical variable is discrete if it results from a


count, such as the number of children.
 A continuous variable is the result of an essentially
continuous measurement, such as weight or height.
 Cross-sectional data are data on a cross section of
a population at a distinct point in time.
 Time series data are data collected over time.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Typical Time Series Data Set
(slide 5 of 5)

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Descriptive Measures for
Categorical Variables

 There are only a few possibilities for describing a


categorical variable, all based on counting:
 Count the number of categories.
 Give the categories names.
 Count the number of observations in each category
(referred to as the count of categories).
 Once you have the counts, you can display them
graphically, usually in a column chart or a pie chart.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.2:
Supermarket [Link] (slide 1 of 3)

 Objective: To summarize categorical variables in a large data set.


 Solution: Data set contains transactions made by supermarket
customers over a two-year period.
 Children, Units Sold, and Revenue are numerical.
 Purchase Date is a date variable.
 Transaction and Customer ID are used only to identify.
 All of the other variables are categorical.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.2:
Supermarket [Link] (slide 2 of 3)

 To get the counts in column S, use Excel’s COUNTIF function.


 To get the percentages in column T, divide each count by the
total number of observations.
 When creating charts, be careful to use appropriate scales.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.2:
Supermarket [Link] (slide 3 of 3)

 Another efficient way to find counts for a categorical variable is to use


dummy (0–1) variables.
 Recode each variable so that one category is replaced by 1 and all others by 0.
 This can be done using a simple IF formula.
 Find the count of that category by summing the 0s and 1s.
 Find the percentage of that category by averaging the 0s and 1s.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Descriptive Measures for
Numerical Variables

 There are many ways to summarize numerical


variables, both with numerical summary measures
and with charts.
 To learn how the values of a variable are distributed,
ask:
 What are the most “typical” values?
 How spread out are the values?
 What are the “extreme” values on either end?
 Is the chart of the values symmetric about some middle
value, or is it skewed in some direction? Does it have any
other peculiar features besides possible skewness?
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.3:
Baseball Salaries [Link] (slide 1 of 2)

 Objective: To learn how salaries are distributed across all 2011 MLB
players.
 Solution: Data set contains data on 843 Major League Baseball players in
the 2011 season.
 Variables are player’s name, team, position, and salary.

 Create summary measures of baseball salaries using Excel functions.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.3:
Baseball Salaries [Link] (slide 2 of 2)

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Central Tendency
(slide 1 of 3)

 The mean is the average of all values.


 If the data set represents a sample from some larger
population, this measure is called the sample mean and is
denoted by X.
 If the data set represents the entire population, it is called
the population mean and is denoted by μ.

 In Excel, the mean can be calculated with the


AVERAGE function.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Central Tendency
(slide 2 of 3)

 The median is the middle observation when the


data are sorted from smallest to largest.
 If the number of observations is odd, the median is
literally the middle observation.
 If the number of observations is even, the median is
usually defined as the average of the two middle
observations.
 In Excel, the median can be calculated with the
MEDIAN function.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Central Tendency
(slide 3 of 3)

 The mode is the value that appears most often.


 In most cases where a variable is essentially
continuous, the mode is not very interesting because it
is often the result of a few lucky ties.
 However, it is not always a result of luck and may
reveal interesting information.
 In Excel, the mode can be calculated with the
MODE function.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Minimum, Maximum,
Percentiles, and Quartiles

 For any percentage p, the pth percentile is the value such that
a percentage p of all values are less than it.
 The quartiles divide the data into four groups, each with
(approximately) a quarter of all observations.
 The first, second and third quartiles are the percentiles
corresponding to p = 25%, p = 50%,
and p = 75%.
 By definition, the second quartile (p = 50%) is equal to the median.
 The minimum and maximum values can be calculated with
Excel’s MIN and MAX functions, and the percentiles and
quartiles with Excel’s PERCENTILE and QUARTILE
functions.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Variability
(slide 1 of 3)

 The range is the maximum value minus the


minimum value.
 The interquartile range (IQR) is the third quartile
minus the first quartile.
 Thus, it is the range of the middle 50% of the data.
 It is less sensitive to extreme values than the range.
 The variance is essentially the average of the
squared deviations from the mean.
 If Xi is a typical observation, its squared deviation from
the mean is (Xi – mean)2.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Variability
(slide 2 of 3)

 The sample variance is denoted by s2, and the


population variance by σ2.

 If all observations are close to the mean, their squared deviations


from the mean—and the variance—will be relatively small.
 If at least a few of the observations are far from the mean, their
squared deviations from the mean—and the variance—will be
large.
 In Excel, use the VAR function to obtain the sample variance
and the VARP function to obtain the population variance.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Variability
(slide 3 of 3)

 A fundamental problem with variance is that it is in


squared units (e.g., $  $2).
 A more natural measure is the standard deviation,

which is the square root of variance.


 The sample standard deviation, denoted by s, is the
square root of the sample variance.
 The population standard deviation, denoted by σ, is
the square root of the population variance.
 In Excel, use the STDEV function to find the sample
standard deviation or the STDEVP function to find the
population standard deviation.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Calculating Variance and
Standard Deviation

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Empirical Rules for Interpreting Standard Deviation
(slide 1 of 3)

 The interpretation of the standard deviation can be


stated as three empirical rules.
 If the values of a variable are approximately normally
distributed (symmetric and bell-shaped), then the
following rules hold:
 Approximately 68% of the observations are within one
standard deviation of the mean.
 Approximately 95% of the observations are within two
standard deviations of the mean.
 Approximately 99.7% of the observations are within three
standard deviations of the mean.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Empirical Rules for Baseball Salaries
(slide 2 of 3)

 The empirical rules should be applied with caution,


especially when the data are clearly skewed, as
illustrated by the calculations for baseball salaries
below.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Empirical Rules for Interpreting Standard Deviation
(slide 3 of 3)

 The mean absolute deviation (MAD) is the


average of the absolute deviations.

 In Excel, use the AVEDEV function to calculate


MAD.
 There is another empirical rule for MAD: For many
variables, the standard deviation is approximately
25% larger than MAD.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Shape
(slide 1 of 2)

 Skewness occurs when there is a lack of symmetry.


 A variable can be skewed to the right (or positively
skewed) because of some really large values (e.g.,
really large baseball salaries).
 Or it can be skewed to the left (or negatively skewed)
because of some really small values (e.g., temperature
lows in Antarctica).
 In Excel, a measure of skewness can be calculated
with the SKEW function.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Measures of Shape
(slide 2 of 2)

 Kurtosis has to do with the “fatness” of the tails of


the distribution relative to the tails of a normal
distribution.
 A distribution with high kurtosis has many more
extreme observations.
 In Excel, kurtosis can be calculated with the KURT
function.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Numerical Summary Measures in the
Status Bar and with StatTools

 If you select multiple cells, summary measures


appear for the selected cells in the status bar at the
bottom of the Excel window.
 You can choose the summary measures that appear by
right-clicking the status bar and selecting your
favorites.
 Although Excel’s built-in functions can be used to
calculate a number of summary measures, a much
quicker way is to use the StatTools add-in.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.3 (Continued):
Baseball Salaries [Link]
 Objective: To learn the
fundamentals of StatTools and
use it to generate summary
measures of baseball salaries.
 Solution: First, define a StatTools
data set, by selecting any cell in
the data set and clicking the Data
Set Manager button.
 Then generate summary measures
for the Salary variable, by
selecting One-Variable Summary
from the Summary Statistics
dropdown list and filling in the
dialog box that appears.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Charts for Numerical Variables
 There are many graphical ways to indicate the
distribution of a numerical variable.
 For cross-sectional variables:
 Histograms
 Box plots

 For time series variables:


 Time series graphs

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Histograms

 A histogram is the most common type of chart for


showing the distribution of a numerical variable.
 It is based on binning the variable—that is, dividing it
up into discrete categories.
 It is a column chart of the counts in the various
categories (with no gaps between the vertical bars).
 A histogram is great for showing the shape of a
distribution—whether the distribution is symmetric
or skewed in one direction.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.3 (Continued):
Baseball Salaries [Link] (slide 1 of 2)

 Objective: To see the shape of the salary distribution


through a histogram.
 Solution: It is possible to create a histogram with Excel
tools only—but it is a tedious process.
 The resulting table of counts is usually called a frequency table.
 The counts are called frequencies.
 It is much easier to create a histogram with StatTools.
 First, designate a StatTools data set.
 Next, select Histogram from the Summary Graphs dropdown
list.
 In the dialog box, select the Salary variable and click OK.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.3 (Continued):
Baseball Salaries [Link] (slide 2 of 2)

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.4:
Late or Lost [Link] (slide 1 of 2)

 Objective: To fine-tune a
histogram for a variable with
integer counts.
 Solution: Data set lists the number
of bags that were either late or lost
for 456 flights.
 In the Histogram dialog box,
request 9 bins and set the minimum
and maximum to -0.5 and 8.5.
 StatTools divides the range into 9
equal-length bins.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.4:
Late or Lost [Link] (slide 2 of 2)

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Box Plots
 A box plot (or box-whisker plot) is an alternative
type of chart for showing the distribution of a
variable.
 The elements of a generic box plot are shown below:

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.3 (Continued):
Baseball Salaries [Link]

 Objective: To illustrate the features of a box plot,


particularly how it indicates skewness.
 Solution: In StatTools, select Box-Whisker Plot from
the Summary Graphs dropdown list and fill in the
dialog box.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Time Series Data
 Our main interest in time series variables is how
they change over time, and this information is lost
in traditional summary measures and in histograms
or box plots.
 For time series data, a time series graph is used.
This is a graph of the values of one or more time
series, using time on the horizontal axis.
 This is always the place to start a time series analysis.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.5:
Crime in [Link] (slide 1 of 3)

 Objective: To see how time series graphs help to detect trends in crime
data.
 Solution: Data set contains annual data on violent and property crimes for
the years 1960 to 2010.
 In StatTools, designate a StatTools data set.
 Then select Times Series Graph from the Time Series and Forecasting
dropdown list and fill in the resulting dialog box.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.5:
Crime in [Link] (slide 2 of 3)

Total Violent and Property Crimes

Population Totals

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.5:
Crime in [Link] (slide 3 of 3)

Violent and Property Crime Rates

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.6:
DJIA Monthly [Link] (slide 1 of 2)

 Objective: To find useful ways to summarize the monthly


Dow data.
 Solution: Data set contains monthly values of the Dow from
1950 through 2011.
 Create summary measures and time series graphs for
monthly values and percentage changes of the Dow.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.6:
DJIA Monthly [Link] (slide 2 of 2)

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Outliers
 An outlier is a value or an entire observation (row)
that lies well outside of the norm.
 Some statisticians define an outlier as any value more
than three standard deviations from the mean, but this
is only a rule of thumb.
 Even if values are not unusual by themselves, there
still might be unusual combinations of values.
 When dealing with outliers, it is best to run the
analyses two ways: with the outliers and without
them.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Missing Values
 Most real data sets have gaps in the data.
 There are two issues: how to detect these missing values
and what to do about them.
 The more important issue is what to do about them:
 One option is to simply ignore them. Then you will have to
be aware of how the software deals with missing values.
 Another option is to fill in missing values with the average of
nonmissing values, but this isn’t usually a very good option.
 A third option is to examine the nonmissing values in the row
of a missing value; these values might provide clues on what
the missing value should be.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Excel Tables for Filtering,
Sorting, and Summarizing

 Tables are a tool introduced in Excel 2007.


 You now have the ability to designate a rectangular
data set as a table and then employ a number of
powerful tools for analyzing tables.
 These tools include:
 Filtering
 Sorting
 Summarizing

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.7:
Catalog [Link] (slide 1 of 2)

 Objective: To illustrate Excel tables for analyzing the HyTex data.


 Solution: Data set contains data on 1000 customers of HyTex, a fictional
direct marketing company.
 Designate the data set as a table by selecting any cell in the data set and
clicking the Table button on the Insert ribbon.
 Use the dropdown arrows next to the variable names to filter in many
different ways.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.7:
Catalog [Link] (slide 2 of 2)

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Filtering
 Finding records that match particular criteria is called filtering.
 One way to filter is to create an Excel table, which
automatically provides dropdown arrows next to the field names
that allow you to filter.
 There are also three ways to filter on any rectangular data set
with variable names:
1. Use the Filter button from the Sort & Filter dropdown list on the
Home ribbon.
2. Use the Filter button from the Sort & Filter group on the Data
ribbon.
3. Right-click any cell in the data set and select Filter. You get several
options, the most popular of which is Filter by Selected Cell’s
Value.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.7 (Continued):
Catalog [Link] (slide 1 of 2)

 Objective: To investigate the types of filters that can be


applied to the HyTex data.
 Solution: There is almost no limit to the filters you can
apply, but here are a few possibilities:
 Filter on one or more values in a field.
 Filter on more than one field.
 Filter on a continuous numerical field.
 Top 10 and Above/Below Average filters.
 Filter on a text field.
 Filter on a date field.
 Filter on color or icon.
 Use a custom filter.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 2.7 (Continued):
Catalog [Link] (slide 2 of 2)

Results from a Typical Filter

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a
publicly accessible website, in whole or in part.

Chapter 3
DECISION MAKING
Finding Relationships among Variables
DATA ANALYSIS AND
BUSINESS ANALYTICS:
Introduction
 The primary interest in data analysis is usually in
relationships between variables.
 The most useful numerical summary measure is correlation.
 The most useful graph is a scatterplot.
 To break down a numerical variable by a categorical variable, it
is useful to create side-by-side box plots.
 Excel’s® pivot table breaks down one variable by others so that
all sorts of relationships can be uncovered very quickly.
 The diagram in the file Data Analysis [Link]
gives you the big picture of which analyses are appropriate
for which data types and which tools are best for
performing the various analyses.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Relationships Among
Categorical Variables

 The most meaningful way to examine relationships


between two categorical variables is with counts and
corresponding charts of the counts.
 You can find counts of the categories of either variable
separately, as well as counts of the joint categories of
the two variables.
 Corresponding percentages of totals and charts help tell
the story.
 It is customary to display all such counts in a table
called a crosstabs (for crosstabulations). This is also
sometimes called a contingency table.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 3.1:
Smoking [Link] (slide 1 of 2)

 Objective: To use a crosstabs


to explore the relationship
between smoking and drinking.
 Solution: Data set lists the
smoking and drinking habits of
8761 adults.
 Categories have been coded
“N,” “O,” “H,” “S,” and “D”
for “Non,” “Occasional,”
“Heavy,” “Smoker,” and
“Drinker.”
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 3.1:
Smoking [Link] (slide 2 of 2)

 To create the crosstabs,


enter the category headings
in Excel and use the
COUNTIFS function to fill
the table with counts of
joint categories.
 Next, sum across rows and
down columns to get totals.
 Then express the counts as
percentages of row and
percentages of column.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Relationships Among Categorical Variables and a
Numerical Variable

 The comparison problem is one of the most


important problems in data analysis. It occurs
whenever you want to compare a numerical measure
across two or more subpopulations.
 Examples:
 The subpopulations are males and females, and the numerical
measure is salary.
 The subpopulations are different regions of the country, and
the numerical measure is the cost of living.
 The subpopulations are different days of the week, and the
numerical measure is the number of customers going to a
particular fast-food chain.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Stacked and Unstacked Formats
 There are two possible data formats, stacked and
unstacked.
 The data are stacked if there are two “long” variables,
such as Gender and Salary. The idea is that the male
salaries are stacked in with the female salaries.
 This is the format you will see in the vast majority of situations.
 You will occasionally see data in unstacked format, when
there are two “short” variables, such as Male Salary and
Female Salary.
 StatTools is capable of dealing with either format and
can convert from stacked to unstacked or vice versa.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Stacked and Unstacked Data
Stacked Data Unstacked Data

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 3.2:
Baseball Salaries 2011 [Link] (slide 1 of 2)
 Objective: To learn methods in StatTools for breaking down
baseball salaries by various categorical variables.
 Solution: Data set contains the same 2011 baseball data examined
previously, as well as several extra categorical variables.
 Create summary measures by selecting One-Variable Summary
from the Summary Statistics dropdown list.
 Next, click the Format button and choose Stacked. Then choose the
Cat variable you want to categorize by and the Val variable you
want to summarize.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 3.2:
Baseball Salaries 2011 [Link] (slide 2 of 2)
 Create side-by-side
boxplots, by selecting
Box-Whisker Plot
from the Summary
Graphs dropdown list
and filling in the
resulting dialog box.
 Select the Stacked
format so that you can
choose a Cat variable
and a Val variable.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Relationships Among Numerical
Variables
 To study relationships among numerical variables,
a new type of chart, called a scatterplot, and two
new summary measures, correlation and
covariance, are used.
 These measures can be applied to any variables that
are displayed numerically.
 However, they are appropriate only for truly
numerical variables, not for categorical variables
that have been coded numerically.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Scatterplots
 A scatterplot is a scatter of points, where each
point denotes the values of an observation for two
selected variables.
 It is a graphical method for detecting relationships
between two numerical variables.
 The two variables are often labeled generically as X
and Y, so a scatterplot is sometimes called an X-Y
chart.
 The purpose of a scatterplot is to make a relationship
(or the lack of it) apparent.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 3.3:
[Link] (slide 1 of 2)

 Objective: To use scatterplots to search for relationships in the golf


data.
 Solution: Data set includes an observation (stats) for each of the
top 200 earners on the PGA Tour.
 In StatTools, designate a StatTools data set for a particular year.
 Next, select Scatterplot from the Summary Graphs dropdown list
and then select at least one X variable and at least one Y variable.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 3.3:
[Link] (slide 2 of 2)

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Trend Lines in Scatterplots
 Once you have a scatterplot, Excel enables you to
superimpose one of several trend lines on the
scatterplot.
 A trend line is a line or curve that “fits” the scatter as
well as possible.
 This could be a straight line, or it could be one of
several types of curves.
 To do this, right-click on any point in the chart,
select Add Trendline, and fill out the resulting
dialog box.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Scatterplot with Trend Line and Equation
Superimposed

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Correlation and Covariance
(slide 1 of 4)

 Correlation and covariance measure the strength and


direction of a linear relationship between two
numerical variables.
 The relationship is “strong” if the points in a scatterplot
cluster tightly around some straight line.
 If this straight line rises from left to right, the relationship is
positive and the measures will be positive numbers.
 If it falls from left to right, the relationship is negative and the
measures will be negative numbers.
 The two numerical variables must be “paired” variables.
 They must have the same number of observations, and the
values for any observation should be naturally paired.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Correlation and Covariance
(slide 2 of 4)

 Covariance is essentially an average of products of


deviations from means.

 Excel has a built-in COVAR function, and StatTools


also calculates covariances automatically.
 Covariance has a serious limitation as a descriptive
measure because it is very sensitive to the units in
which X and Y are measured.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Correlation and Covariance
(slide 3 of 4)

 Correlation is a unitless quantity that is unaffected


by the measurement scale.

 The correlation is always between -1 and +1.


 The closer it is to either of these two extremes, the
closer the points in a scatterplot are to a straight line.
 Excel has a built-in CORREL function, and
StatTools also calculates correlations automatically.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Correlation and Covariance
(slide 4 of 4)

 Three important points about scatterplots,


correlations, and covariances:
 A correlation is a single-number summary of a
scatterplot. It never conveys as much information as
the full scatterplot.
 You are usually on the lookout for large correlations,
those near -1 or +1.
 Do not even try to interpret covariances numerically
except possibly to check whether they are positive or
negative. For interpretive purposes, concentrate on
correlations.
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 3.3 (Continued)
[Link] (slide 1 of 2)

 Objective: To use correlations to understand


relationships in the golf data.
 Solution: In StatTools, create a table of correlations by
selecting Correlation and Covariance from the
Summary Statistics dropdown list.
 Fill in the resulting dialog box and check Correlations.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 3.3 (Continued)
[Link] (slide 2 of 2)

 You can learn more about a correlation by creating


the corresponding scatterplot.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Pivot Tables
 The pivot table is an Excel tool that allows you to
break data down by categories.
 Sometimes pivot tables are used to display tables of
counts, often called crosstabs or contingency
tables.
 However, crosstabs typically list only counts,
whereas pivot tables can list counts, sums,
averages, and other summary measures.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 3.4:
Elecmart [Link] (slide 1 of 2)

 Objective: To use pivot tables to break down the customer order


data by a number of categorical variables.
 Solution: Data set contains data on 400 customer orders during
several months for Elecmart company.
 Create a pivot table by clicking the PivotTable button on the
Insert ribbon.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 3.4:
Elecmart [Link] (slide 2 of 2)

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Hiding Categories (Filtering)
 You can filter out any items in a pivot table that you don’t want
to see.
 Click the Row Labels dropdown arrow of the active field and check
the items you want to filter on.
 A pivot table with hidden categories is shown below.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Sorting on Values or Categories
 It is easy to sort in a pivot table, either by the
numbers in the Values area or by the labels in a
Rows or Columns field.
 To sort by the numbers in the Values area, right-click
any number and select Sort.
 To sort on the labels of a Rows or Columns field, right-
click any of the categories and select Sort.
 You can also click the dropdown arrow for the field and get
the dialog box that allows both sorting and filtering.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Changing Locations of Fields (Pivoting)

 You can choose where to place variables in a pivot


table.
 For example, to place the Region variable in the
Columns area, drag the Region button from the Rows
area of the PivotTable Fields pane to the Columns area.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Changing Field Settings
 You can change various settings in the Field Settings dialog
box.
 To get to this dialog box:
 Click the Field Setting button on the Analyze/Options ribbon.
 OR right-click any of the pivot table cells and select the Field Settings item.

 The pivot table with Value Field Settings changed to Average is


shown below.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Pivot Charts
 It is easy to accompany pivot tables with pivot charts.
 These charts adapt automatically to the underlying pivot table.
 To create a pivot chart, click anywhere inside the pivot table,
select the PivotChart button on the Analyze/Options ribbon, and
select a chart type.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Multiple Variables in the Values Area

 More than a single variable can be placed in the


Values area.
 Also, a given variable in the Values area can be
summarized by more than one summarizing
function.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Summarizing by Count
 The variable in the Values area can be summarized by
the Count function.
 This is useful when you want to know, for example, how
many of the orders were placed by females in the South.
 Right-click any number in the pivot table, select Value
Field Settings, and select the Count function.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Grouping
 Categories in a Rows or Columns variable can be grouped.
 Suppose you want to summarize Sum of Total Cost by
Date.
 Starting with a blank pivot table, check both Date and Total
Cost in the PivotTable Fields pane.
 Then right-click any date and select Group.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Other Pivot Table Features
 Showing/hiding subtotals and grand totals (check the Layout options on the
Design ribbon)
 Dealing with blank rows, that is, categories with no data (right-click any
number, choose PivotTable Options, and check the options on the Layout &
Format tab)
 Displaying the data behind a given number in a pivot table (double-click any
number in the Values area to get a new worksheet)
 Formatting a pivot table with various styles (check the style options on the
Design ribbon)
 Moving or renaming pivot tables (check the PivotTable and Action groups on
the Analyze/Options ribbon)
 Refreshing pivot tables as the underlying data changes (check the Refresh
dropdown list on the Analyze/Options ribbon)
 Creating pivot table formulas for calculated fields or calculated items (check
the Formulas dropdown list on the Analyze/Options ribbon)
© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 3.5:
Lasagna [Link] (slide 1 of 2)

 Objective: To use pivot tables to explore which demographic


variables help to distinguish lasagna triers from nontriers.
 Solution: Data set contains data on over 800 potential
customers being tracked by a frozen lasagna company.
 Set up a pivot table that shows counts of triers and nontriers
for different categories of the variables.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Example 3.5:
Lasagna [Link] (slide 2 of 2)

Pivot Table and Pivot Chart for Examining the Effect of Gender

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Slicers and Timelines
 In Excel 2010, Microsoft added slicers—lists of
the distinct values of any variable, which you can
then filter on.
 You add a slicer from the Analyze/Options ribbon
under PivotTable Tools.
 In Excel 2013, a Timeline feature was added. A
Timeline is like a slicer, but it is specifically for
filtering on a date variable.

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Pivot Table with Slicers and a Timeline

© 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

You might also like