Descriptive Data Mining
Descriptive Data Mining
Chapter 4
Introduction
The increase in the use of data-mining techniques in business has been caused largely by three
events:
The explosion in the amount of data being produced and electronically tracked
The ability to electronically warehouse these data
The affordability of computer power to analyze the data
Data Preparation (Treatment of Missing Data, Identification of Outliers and Erroneous Data,
Variable Representation)
The data in a data set are often said to be “dirty” and “raw” before they have been
preprocessed
We need to put them into a form that is best suited for a data-mining algorithm
Data preparation makes heavy use of the descriptive statistics and data visualization
methods
Note: If the number of observations with missing values is small, throwing out these incomplete
observations may be a reasonable option.
If a variable is missing measurements for a large number of observations, removing this variable
from consideration may be an option.
Another option is to fill in missing values with estimates. Convenient choices include replacing the
missing entries for a variable with the variable’s mode, mean, or median.
e) Dealing with missing data requires understanding of why the data are missing and
the impact of the missing data
f) If the missing value is a random occurrence, it is called a data value missing
completely at random (MCAR)
g) If the missing values are not completely random (i.e., correlated with the values of
some other variables), these are called missing at random (MAR)
h) Data is missing not at random (MNAR) if the reason that the value is missing is
related to the value of the variable
Examining the variables in the data set by means of summary statistics, histograms,
PivotTables, scatter plots, and other tools can uncover data- quality issues and outliers
Closer examination of outliers may reveal an error or a need for further investigation to
determine whether the observation is relevant to the current analysis
A conservative approach is to create two data sets, one with and one without outliers, and
then construct a model on both data sets
If a model’s implications depend on the inclusion or exclusion of outliers, then one should
spend additional time to track down the cause of the outliers
Example: Negative values for sales may result from a data entry error or may actually denote a
missing value.
Variable Representation
In many data-mining applications, the number of variables for which data is recorded may
be prohibitive to analyze
Dimension reduction: Process of removing variables from the analysis without losing any
crucial information
One way is to examine pairwise correlations to detect variables or groups of variables that
may supply similar information
Such variables can be aggregated or removed to allow more parsimonious model
development
A critical part of data mining is determining how to represent the measurements of the
variables and which variables to consider
The treatment of categorical variables is particularly important
Often data sets contain variables that, considered separately, are not particularly insightful
but that, when combined as ratios, may represent important relationships
Note: A variable tabulating the dollars spent by a household on groceries may not be interesting
because this value may depend on the size of the household. Instead, considering the proportion of
total household spending on groceries may be more informative.
Cluster Analysis
Goal of clustering is to segment observations into similar groups based on observed variable
Can be employed during the data-preparation step to identify variables or observations that
can be aggregated or removed from consideration
Commonly used in marketing to divide customers into different homogenous groups; known
as market segmentation
Used to identify outliers
Clustering methods:
Bottom- up hierarchical clustering starts with each observation belonging to its own
cluster and then sequentially merges the most similar clusters to create a series of
nested clusters
k-means clustering assigns each observation to one of k clusters in a manner such that
the observations assigned to the same cluster are as similar as possible
Both methods depend on how two observations are similar—hence, we have to measure
similarity between observations
When observations include numeric variables, Euclidean distance is the most common
method to measure dissimilarity between observations
Euclidean distance: Most common method to measure dissimilarity between observations,
when observations include continuous variables
Let observations u = (u1, u2, . . . , uq) and v = (v1, v2, . . . , vq) each comprise measurements of
q variables
The Euclidean distance between observations u and v is:
√ 2 2
d u , v = ( u1−v 1 ) + ( u2−v 2 ) + ∙∙ ∙+ ( uq −v q )
2
Illustration:
KTC is a financial advising company that provides personalized financial advice to its clients
KTC would like to segment its customers into several groups (or clusters) so that the
customers within a group are similar and dissimilar with respect to key characteristics
For each customer, KTC has corresponding to a vector of measurements on seven customer
variables, that is, (Age, Female, Income, Married, Children, Car Loan, Mortgage)
Note: Euclidean distance is highly influenced by the scale on which variables are measured.
Therefore, it is common to standardize the units of each variable j of each observation u;
Example: uj, the value of variable j in observation u, is replaced with its z-score, zj.
The conversion to z-scores also makes it easier to identify outlier measurements, which can distort
the Euclidean distance between observations.
Euclidean distance is highly influenced by the scale on which variables are measured
o Common to standardize the units of each variable j of each observation u
o Example: uj, the value of variable j in observation u, is replaced with its z-score, zj
The conversion to z-scores also makes it easier to identify outlier measurements, which can
distort the Euclidean distance between observations
When clustering observations solely on the basis of categorical variables encoded as 0–1, a
better measure of similarity between two observations can be achieved by counting the
number of variables with matching values
The simplest overlap measure is called the matching coefficient and is computed by:
A weakness of the matching coefficient is that if two observations both have a 0 entry for a
categorical variable, this is counted as a sign of similarity between the two observations
To avoid misstating similarity due to the absence of a feature, a similarity measure called
Jaccard’s coefficient does not count matching zero entries and is computer by:
Table 4.1: Comparison of Similarity Matrixes for Observations with Binary Variables
Hierarchical Clustering
Determines the similarity of two clusters by considering the similarity between the
observations composing either cluster
Starts with each observation in its own cluster and then iteratively combines the two
clusters that are the most similar into a single cluster
Given a way to measure similarity between observations, there are several clustering
method alternatives for comparing observations in two clusters to obtain a cluster similarity
measure
o Single linkage - The similarity between two clusters is defined by the similarity of the
pair of observations (one from each cluster) that are the most similar
o Complete linkage - This clustering method defines the similarity between two
clusters as the similarity of the pair of observations (one from each cluster) that are
the most different
o Group average linkage - Defines the similarity between two clusters to be the
average similarity computed over all pairs of observations between the two clusters
o Median linkage - Analogous to group average linkage except that it uses the median
of the similarities computer between all pairs of observations between the two
clusters
Note: Single linkage will consider two clusters to be close if an observation in one of the clusters is
close to at least one observation in the other cluster.
Complete linkage will consider two clusters to be close if their most different pair of observations
are close. This method produces clusters such that all member observations of a cluster are
relatively close to each other.
Centroid linkage uses the averaging concept of cluster centroids to define between-cluster
similarity
Ward’s method merges two clusters such that the dissimilarity of the observations with the
resulting single cluster increases as little as possible
When McQuitty’s method considers merging two clusters A and B, the dissimilarity of the
resulting cluster AB to any other cluster C is calculate as: ((dissimilarity between A and C) +
(dissimilarity between B and C))/2)
A dendrogram is a chart that depicts the set of nested clusters resulting at each step of
aggregation
k-Means Clustering
Given a value of k, the k-means algorithm randomly partitions the observations into k
clusters
After all observations have been assigned to a cluster, the resulting cluster centroids are
calculated
Using the updated cluster centroids, all observations are reassigned to the cluster with the
closest centroid
Note: The algorithm repeats this process (calculate cluster centroid, assign observation to cluster
with nearest centroid) until there is no change in the clusters or a specified maximum number of
iterations is reached.
One rule of thumb is that the ratio of between-cluster distance to within-cluster distance should
exceed 1.0 for useful clusters.
Suitable when we have a small data set (e.g., Suitable when you know how many clusters you
less than 500 observations) and want to easily want and you have a larger data set (e.g., larger
examine solutions with increasing numbers of than 500 observations)
clusters
Note: Because Euclidean distance is the standard metric for k-means clustering, it is generally not as
appropriate for binary or ordinal data for which an “average” is not meaningful.
Note: Hy-Vee grocery store would like to gain insight into its customers’ purchase patterns to
possibly improve its in-aisle product placement and cross-product promotions.
Table 4.4 contains a small sample of data where each transaction comprises the items purchased by
a shopper in a single visit to a Hy-Vee.
An example of an association rule from this data would be “if {bread, jelly}, then {peanut butter}”
meaning that “if a transaction includes bread and jelly it also includes peanut butter.”
The potential impact of an association rule is often governed by the number of transactions it may
affect, which is measured by computing the support count of the item set consisting of the union of
its antecedent and consequent.
Investigating the rule “if {bread, jelly}, then {peanut butter}” from the Table 4.4, we see the support
count of {bread, jelly, peanut butter} is 2.
For the data in Table 4.4, the rule “if {bread, jelly}, then {peanut butter}” has confidence = 2/4 = 0.5
and a lift ratio = 0.5/(4/10) = 1.25
Note: This measure of confidence can be viewed as the conditional probability of the consequent
item set occurs given that the antecedent item set occurs.
A high value of confidence suggests a rule in which the consequent is frequently true when the
antecedent is true, but a high value of confidence can be misleading.
For example, if the support of the consequent is high—that is, the item set corresponding to the
then part is very frequent—then the confidence of the association rule could be high even if there is
little or no association between the items.
A lift ratio greater than 1 suggests that there is some usefulness to the rule and that it is better at
identifying cases when the consequent occurs than no rule at all.
For the data in Table 4.4, the rule “if {bread, jelly}, then {peanut butter}” has confidence = 2/4 = 0.5
and a lift ratio = 0.5/(4/10) = 1.25.
In other words, identifying a customer who purchased both bread and jelly as one who also
purchased peanut butter is 25 percent better than just guessing that a random customer purchased
peanut butter.
An association rule is ultimately judged on how actionable it is and how well it explains the
relationship between item sets
For example, Wal-Mart mined its transactional data to uncover strong evidence of the
association rule, “If a customer purchases a Barbie doll, then a customer also purchases a
candy bar”
An association rule is useful if it is well supported and explain an important previously
unknown relationship
Note: The support of an association rule can generally be improved by basing it on less specific
antecedent and consequent item sets.