0% found this document useful (0 votes)

19 views

Descriptive Data Mining

This document discusses descriptive data mining techniques. It covers supervised and unsupervised learning approaches, data preparation including treatment of missing data, identification of outliers, and variable representation. Cluster analysis techniques are also discussed, including hierarchical clustering, k-means clustering, and measures for calculating similarity between observations. The goal of clustering is to segment observations into similar groups based on their variable values.

Uploaded by

Jamila Ferrer

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Descriptive Data Mining

Uploaded by

Jamila Ferrer

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Descriptive Data Mining

Chapter 4

Introduction
The increase in the use of data-mining techniques in business has been caused largely by three
events:

 The explosion in the amount of data being produced and electronically tracked
 The ability to electronically warehouse these data
 The affordability of computer power to analyze the data

Observation: set of recorded values of variables associated with a single entity

Data-mining approaches can be separated into two categories:

 Supervised learning—For prediction and classification

 Unsupervised learning—To detect patterns and relationships in the data
o Thought of as high-dimensional descriptive analytics
o Designed to describe patterns and relationships in large data sets with many
observations of many variables

Data Preparation (Treatment of Missing Data, Identification of Outliers and Erroneous Data,
Variable Representation)

 The data in a data set are often said to be “dirty” and “raw” before they have been
preprocessed
 We need to put them into a form that is best suited for a data-mining algorithm
 Data preparation makes heavy use of the descriptive statistics and data visualization
methods

Treatment of Missing Data

The primary options for addressing missing data are:

a) To discard observations with any missing values

b) To discard any variable with missing values
c) To fill in missing entries with estimated values
d) To apply a data-mining algorithm (such as classification and regression trees) that
can handle missing values

Note: If the number of observations with missing values is small, throwing out these incomplete
observations may be a reasonable option.

If a variable is missing measurements for a large number of observations, removing this variable
from consideration may be an option.

Another option is to fill in missing values with estimates. Convenient choices include replacing the
missing entries for a variable with the variable’s mode, mean, or median.

e) Dealing with missing data requires understanding of why the data are missing and
the impact of the missing data
f) If the missing value is a random occurrence, it is called a data value missing
completely at random (MCAR)
g) If the missing values are not completely random (i.e., correlated with the values of
some other variables), these are called missing at random (MAR)
h) Data is missing not at random (MNAR) if the reason that the value is missing is
related to the value of the variable

Identification of Outliers and Erroneous Data

 Examining the variables in the data set by means of summary statistics, histograms,
PivotTables, scatter plots, and other tools can uncover data- quality issues and outliers
 Closer examination of outliers may reveal an error or a need for further investigation to
determine whether the observation is relevant to the current analysis
 A conservative approach is to create two data sets, one with and one without outliers, and
then construct a model on both data sets
 If a model’s implications depend on the inclusion or exclusion of outliers, then one should
spend additional time to track down the cause of the outliers

Example: Negative values for sales may result from a data entry error or may actually denote a
missing value.

Variable Representation

 In many data-mining applications, the number of variables for which data is recorded may
be prohibitive to analyze
 Dimension reduction: Process of removing variables from the analysis without losing any
crucial information
 One way is to examine pairwise correlations to detect variables or groups of variables that
may supply similar information
 Such variables can be aggregated or removed to allow more parsimonious model
development
 A critical part of data mining is determining how to represent the measurements of the
variables and which variables to consider
 The treatment of categorical variables is particularly important
 Often data sets contain variables that, considered separately, are not particularly insightful
but that, when combined as ratios, may represent important relationships

Note: A variable tabulating the dollars spent by a household on groceries may not be interesting
because this value may depend on the size of the household. Instead, considering the proportion of
total household spending on groceries may be more informative.

Cluster Analysis (Measuring Similarity Between Observations, Hierarchical Clustering, k-Means

Clustering, Hierarchical Clustering Versus k-Means Clustering)

Cluster Analysis

 Goal of clustering is to segment observations into similar groups based on observed variable
 Can be employed during the data-preparation step to identify variables or observations that
can be aggregated or removed from consideration
 Commonly used in marketing to divide customers into different homogenous groups; known
as market segmentation
 Used to identify outliers
Clustering methods:

 Bottom- up hierarchical clustering starts with each observation belonging to its own
cluster and then sequentially merges the most similar clusters to create a series of
nested clusters
 k-means clustering assigns each observation to one of k clusters in a manner such that
the observations assigned to the same cluster are as similar as possible

Both methods depend on how two observations are similar—hence, we have to measure
similarity between observations

Measuring Similarity Between Observations

 When observations include numeric variables, Euclidean distance is the most common
method to measure dissimilarity between observations
 Euclidean distance: Most common method to measure dissimilarity between observations,
when observations include continuous variables
 Let observations u = (u1, u2, . . . , uq) and v = (v1, v2, . . . , vq) each comprise measurements of
q variables
 The Euclidean distance between observations u and v is:

√ 2 2
d u , v = ( u1−v 1 ) + ( u2−v 2 ) + ∙∙ ∙+ ( uq −v q )
2

Illustration:

 KTC is a financial advising company that provides personalized financial advice to its clients
 KTC would like to segment its customers into several groups (or clusters) so that the
customers within a group are similar and dissimilar with respect to key characteristics
 For each customer, KTC has corresponding to a vector of measurements on seven customer
variables, that is, (Age, Female, Income, Married, Children, Car Loan, Mortgage)

Example: The observation u = (61, 0, 57881, 1, 2, 0, 0) corresponds to a 61-year-old male

with an annual income of $57,881, married with two children, but no car loan and no
mortgage

 Euclidean distance becomes smaller as a pair of observations become more similar

with respect to their variable values
Figure 4.1 Depicts Euclidean distance for two observations consisting of two variable
measurements.

Note: Euclidean distance is highly influenced by the scale on which variables are measured.
Therefore, it is common to standardize the units of each variable j of each observation u;

Example: uj, the value of variable j in observation u, is replaced with its z-score, zj.

The conversion to z-scores also makes it easier to identify outlier measurements, which can distort
the Euclidean distance between observations.

 Euclidean distance is highly influenced by the scale on which variables are measured
o Common to standardize the units of each variable j of each observation u
o Example: uj, the value of variable j in observation u, is replaced with its z-score, zj
 The conversion to z-scores also makes it easier to identify outlier measurements, which can
distort the Euclidean distance between observations
 When clustering observations solely on the basis of categorical variables encoded as 0–1, a
better measure of similarity between two observations can be achieved by counting the
number of variables with matching values
 The simplest overlap measure is called the matching coefficient and is computed by:

 A weakness of the matching coefficient is that if two observations both have a 0 entry for a
categorical variable, this is counted as a sign of similarity between the two observations
 To avoid misstating similarity due to the absence of a feature, a similarity measure called
Jaccard’s coefficient does not count matching zero entries and is computer by:

Table 4.1: Comparison of Similarity Matrixes for Observations with Binary Variables

Hierarchical Clustering

 Determines the similarity of two clusters by considering the similarity between the
observations composing either cluster
 Starts with each observation in its own cluster and then iteratively combines the two
clusters that are the most similar into a single cluster
 Given a way to measure similarity between observations, there are several clustering
method alternatives for comparing observations in two clusters to obtain a cluster similarity
measure
o Single linkage - The similarity between two clusters is defined by the similarity of the
pair of observations (one from each cluster) that are the most similar
o Complete linkage - This clustering method defines the similarity between two
clusters as the similarity of the pair of observations (one from each cluster) that are
the most different
o Group average linkage - Defines the similarity between two clusters to be the
average similarity computed over all pairs of observations between the two clusters
o Median linkage - Analogous to group average linkage except that it uses the median
of the similarities computer between all pairs of observations between the two
clusters

Note: Single linkage will consider two clusters to be close if an observation in one of the clusters is
close to at least one observation in the other cluster.

Complete linkage will consider two clusters to be close if their most different pair of observations
are close. This method produces clusters such that all member observations of a cluster are
relatively close to each other.

Figure 4.2 Measuring Similarity Between Clusters

 Centroid linkage uses the averaging concept of cluster centroids to define between-cluster
similarity
 Ward’s method merges two clusters such that the dissimilarity of the observations with the
resulting single cluster increases as little as possible
 When McQuitty’s method considers merging two clusters A and B, the dissimilarity of the
resulting cluster AB to any other cluster C is calculate as: ((dissimilarity between A and C) +
(dissimilarity between B and C))/2)
 A dendrogram is a chart that depicts the set of nested clusters resulting at each step of
aggregation

k-Means Clustering

 Given a value of k, the k-means algorithm randomly partitions the observations into k
clusters
 After all observations have been assigned to a cluster, the resulting cluster centroids are
calculated
 Using the updated cluster centroids, all observations are reassigned to the cluster with the
closest centroid

Note: The algorithm repeats this process (calculate cluster centroid, assign observation to cluster
with nearest centroid) until there is no change in the clusters or a specified maximum number of
iterations is reached.

One rule of thumb is that the ratio of between-cluster distance to within-cluster distance should
exceed 1.0 for useful clusters.

Hierarchical Clustering k-Means Clustering

Suitable when we have a small data set (e.g., Suitable when you know how many clusters you
less than 500 observations) and want to easily want and you have a larger data set (e.g., larger
examine solutions with increasing numbers of than 500 observations)
clusters

Convenient method if you want to observe how Partitions the observations,

clusters are nested
which is appropriate if trying to summarize the
data with k “average” observations

that describe the data with the minimum

amount of error

Hierarchical Clustering Versus k-Means Clustering

Note: Because Euclidean distance is the standard metric for k-means clustering, it is generally not as
appropriate for binary or ordinal data for which an “average” is not meaningful.

Association Rules (Evaluating Association Rules)

 Association rules: if-then statements which convey the likelihood of certain items being
purchased together
 Antecedent: the collection of items (or item set) corresponding to the if portion of the rule
 Although association rules are an important tool in market basket analysis
 Consequent: the item set corresponding to the then portion of the rule
 Support count of an item: number of transactions in the data that include that item set
Table 4.4: Shopping-Cart Transactions

Note: Hy-Vee grocery store would like to gain insight into its customers’ purchase patterns to
possibly improve its in-aisle product placement and cross-product promotions.

Table 4.4 contains a small sample of data where each transaction comprises the items purchased by
a shopper in a single visit to a Hy-Vee.

An example of an association rule from this data would be “if {bread, jelly}, then {peanut butter}”
meaning that “if a transaction includes bread and jelly it also includes peanut butter.”

Antecedent - {bread, jelly},

Consequent - {peanut butter}

The potential impact of an association rule is often governed by the number of transactions it may
affect, which is measured by computing the support count of the item set consisting of the union of
its antecedent and consequent.

Investigating the rule “if {bread, jelly}, then {peanut butter}” from the Table 4.4, we see the support
count of {bread, jelly, peanut butter} is 2.

Confidence: Helps identify reliable association rules

Lift ratio: Measure to evaluate the efficiency of a rule

For the data in Table 4.4, the rule “if {bread, jelly}, then {peanut butter}” has confidence = 2/4 = 0.5
and a lift ratio = 0.5/(4/10) = 1.25

Note: This measure of confidence can be viewed as the conditional probability of the consequent
item set occurs given that the antecedent item set occurs.
A high value of confidence suggests a rule in which the consequent is frequently true when the
antecedent is true, but a high value of confidence can be misleading.

For example, if the support of the consequent is high—that is, the item set corresponding to the
then part is very frequent—then the confidence of the association rule could be high even if there is
little or no association between the items.

A lift ratio greater than 1 suggests that there is some usefulness to the rule and that it is better at
identifying cases when the consequent occurs than no rule at all.

For the data in Table 4.4, the rule “if {bread, jelly}, then {peanut butter}” has confidence = 2/4 = 0.5
and a lift ratio = 0.5/(4/10) = 1.25.

In other words, identifying a customer who purchased both bread and jelly as one who also
purchased peanut butter is 25 percent better than just guessing that a random customer purchased
peanut butter.

Evaluating Association Rules

 An association rule is ultimately judged on how actionable it is and how well it explains the
relationship between item sets
 For example, Wal-Mart mined its transactional data to uncover strong evidence of the
association rule, “If a customer purchases a Barbie doll, then a customer also purchases a
candy bar”
 An association rule is useful if it is well supported and explain an important previously
unknown relationship

Note: The support of an association rule can generally be improved by basing it on less specific
antecedent and consequent item sets.

Solutions To II Unit Exercises From Kamber
83% (42)
Solutions To II Unit Exercises From Kamber
16 pages
UGRD-IT6200B Quantitative Methods ALL IN
67% (3)
UGRD-IT6200B Quantitative Methods ALL IN
11 pages
Data Cleaning Data Integration Data Selection Data Transformation Data Mining Pattern Evaluation Knowledge Presentation
No ratings yet
Data Cleaning Data Integration Data Selection Data Transformation Data Mining Pattern Evaluation Knowledge Presentation
3 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
3 pages
Clustering Importante
No ratings yet
Clustering Importante
12 pages
How To Minimize Misclassification Rate and Expected Loss For Given Model
No ratings yet
How To Minimize Misclassification Rate and Expected Loss For Given Model
7 pages
ASM using r 2 marks answer Keys
No ratings yet
ASM using r 2 marks answer Keys
10 pages
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
No ratings yet
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
42 pages
Edab Module - 1
No ratings yet
Edab Module - 1
20 pages
Data Minning Unit 2-1
No ratings yet
Data Minning Unit 2-1
10 pages
Some Methods of Detection of Outliers in Linear Regression Model-Ranjit PDF
No ratings yet
Some Methods of Detection of Outliers in Linear Regression Model-Ranjit PDF
19 pages
Mvda - Question Bank
No ratings yet
Mvda - Question Bank
14 pages
Intro to Machine Learning New (2)
No ratings yet
Intro to Machine Learning New (2)
18 pages
New Microsoft Word Document (14)
No ratings yet
New Microsoft Word Document (14)
2 pages
Data Screening Checklist
No ratings yet
Data Screening Checklist
57 pages
Business Club: Basic Statistics
No ratings yet
Business Club: Basic Statistics
26 pages
Mod3
No ratings yet
Mod3
50 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Assignment Quantitative
No ratings yet
Assignment Quantitative
6 pages
Measures of Variation, Quartiles and Percentiles, Skewness and Kurtosis
No ratings yet
Measures of Variation, Quartiles and Percentiles, Skewness and Kurtosis
16 pages
6735367a5d6e24a5f185bf9c_99512104437
No ratings yet
6735367a5d6e24a5f185bf9c_99512104437
2 pages
Engineering Math Class Note II-1
No ratings yet
Engineering Math Class Note II-1
26 pages
Evans Analytics2e PPT 10 Data Mining
No ratings yet
Evans Analytics2e PPT 10 Data Mining
69 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Assign 1revised2
No ratings yet
Assign 1revised2
14 pages
Chapter 10 - Introduction To Data Mining
No ratings yet
Chapter 10 - Introduction To Data Mining
69 pages
Lesson Plans in Urdu Subject
No ratings yet
Lesson Plans in Urdu Subject
10 pages
3. Chapter 5 CLUSTERING
No ratings yet
3. Chapter 5 CLUSTERING
36 pages
Measures of Variabilit1
No ratings yet
Measures of Variabilit1
7 pages
Mid Term
No ratings yet
Mid Term
12 pages
Unit-III Classification
No ratings yet
Unit-III Classification
10 pages
Unit 3
No ratings yet
Unit 3
47 pages
Information Retrieval Important questions
No ratings yet
Information Retrieval Important questions
20 pages
Unit 4
No ratings yet
Unit 4
20 pages
Cluster Analysis
No ratings yet
Cluster Analysis
101 pages
Data Exploration & Visualization
No ratings yet
Data Exploration & Visualization
23 pages
Data Science Interview Preparation (30 Days of Interview Preparation)
No ratings yet
Data Science Interview Preparation (30 Days of Interview Preparation)
45 pages
Data Science Interview Preparation (30 Days of Interview Preparation)
No ratings yet
Data Science Interview Preparation (30 Days of Interview Preparation)
18 pages
03-Statistics For Management
No ratings yet
03-Statistics For Management
8 pages
ML final
No ratings yet
ML final
92 pages
Mulitple Linear Regression
No ratings yet
Mulitple Linear Regression
6 pages
Exam Pa Note
No ratings yet
Exam Pa Note
73 pages
Questions Stats and Trix
No ratings yet
Questions Stats and Trix
39 pages
Data Management
No ratings yet
Data Management
48 pages
The Others in The Cluster But With Differences Between Clusters
No ratings yet
The Others in The Cluster But With Differences Between Clusters
5 pages
MBA S BIG DATA & BUSINESS ANALYTICS MGU ❤️
No ratings yet
MBA S BIG DATA & BUSINESS ANALYTICS MGU ❤️
12 pages
Q.1. Why Is Data Preprocessing Required?
100% (1)
Q.1. Why Is Data Preprocessing Required?
26 pages
Guide Data Exploration
No ratings yet
Guide Data Exploration
16 pages
Cluster Analysis
No ratings yet
Cluster Analysis
6 pages
Marielle Caccam Jewel Refran
No ratings yet
Marielle Caccam Jewel Refran
100 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
27 pages
UNIT 4 DATA SCIENCE
No ratings yet
UNIT 4 DATA SCIENCE
18 pages
4485-2
No ratings yet
4485-2
44 pages
Classification and Prediction
No ratings yet
Classification and Prediction
21 pages
Unit 2 Stats & Probability
No ratings yet
Unit 2 Stats & Probability
51 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
31 pages
DA Unit-4
No ratings yet
DA Unit-4
15 pages
Unit 1(DS)
No ratings yet
Unit 1(DS)
15 pages
DMBAR Chapter 4 Dimension Reduction
No ratings yet
DMBAR Chapter 4 Dimension Reduction
25 pages
Data Science Technical Interview Questions
No ratings yet
Data Science Technical Interview Questions
24 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
An Enlightenment To Machine Learning - Resp
No ratings yet
An Enlightenment To Machine Learning - Resp
22 pages
dataanalytics unit-4
No ratings yet
dataanalytics unit-4
23 pages
Topic: Data Mining - R - Association Rules and Apriori Algorithm Author: Ming-Chang Lee Date: 2009.03.30
No ratings yet
Topic: Data Mining - R - Association Rules and Apriori Algorithm Author: Ming-Chang Lee Date: 2009.03.30
4 pages
DATA MINING UNIT-II NOTES
No ratings yet
DATA MINING UNIT-II NOTES
24 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
74 pages
Session 7
No ratings yet
Session 7
45 pages
Arules Viz
No ratings yet
Arules Viz
24 pages
DWDM Lab Manual Using Weka-For MIC
No ratings yet
DWDM Lab Manual Using Weka-For MIC
42 pages
DM00-Intro-HWS2019_v1
No ratings yet
DM00-Intro-HWS2019_v1
71 pages
An Efficient Method of Data Inconsistency Check For Very Large Relations
No ratings yet
An Efficient Method of Data Inconsistency Check For Very Large Relations
3 pages
MIS Unit 3
No ratings yet
MIS Unit 3
39 pages
E-Learning Ct-Pro: Sistem Rekomendasi Pembelajaran Pada Menggunakan Algoritma
No ratings yet
E-Learning Ct-Pro: Sistem Rekomendasi Pembelajaran Pada Menggunakan Algoritma
6 pages
Team14 Mini Report FINAL
No ratings yet
Team14 Mini Report FINAL
61 pages
Unit-3 New
No ratings yet
Unit-3 New
75 pages
1 s2.0 S0957417423019917 Main
No ratings yet
1 s2.0 S0957417423019917 Main
15 pages
Additional exercises
No ratings yet
Additional exercises
4 pages
data mining file
No ratings yet
data mining file
87 pages
Multiple Choice Questions
No ratings yet
Multiple Choice Questions
56 pages
Bradford, West Yorkshire, Uk 29 June - 1 July 2010: (Cit-2010) (Icess-2010) (Scalcom-2010)
No ratings yet
Bradford, West Yorkshire, Uk 29 June - 1 July 2010: (Cit-2010) (Icess-2010) (Scalcom-2010)
10 pages
Full download Machine Learning for Business Analytics: Concepts, Techniques and Applications with JMP Pro, 2nd Edition Galit Shmueli pdf docx
100% (1)
Full download Machine Learning for Business Analytics: Concepts, Techniques and Applications with JMP Pro, 2nd Edition Galit Shmueli pdf docx
47 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
Data Mining For Smart Agriculture
No ratings yet
Data Mining For Smart Agriculture
10 pages
Exp9 - Apriori - Ipynb - Colaboratory
No ratings yet
Exp9 - Apriori - Ipynb - Colaboratory
16 pages
V14 Cse Aiml Iii Year
No ratings yet
V14 Cse Aiml Iii Year
41 pages
Scalable Algorithms For Association Mining: Mohammed J. Zaki, Member, IEEE
No ratings yet
Scalable Algorithms For Association Mining: Mohammed J. Zaki, Member, IEEE
19 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
Class X Ai Study Material
No ratings yet
Class X Ai Study Material
40 pages
Feature Extraction and Reduction by using ModifiedApriori algorithm (1)
No ratings yet
Feature Extraction and Reduction by using ModifiedApriori algorithm (1)
9 pages
Ontology Handbook
100% (1)
Ontology Handbook
14 pages