Data Mining Concepts & Techniques Guide
Data Mining Concepts & Techniques Guide
Contents
1
Data Mining: Concepts and Techniques Unit 0
2
Data Mining: Concepts and Techniques Unit 1
Pattern Data
Knowledge
Evaluation Mining
1. Data Cleaning: To remove noise and inconsistent data. This involves handling missing
values, smoothing noisy data, identifying or removing outliers, and resolving inconsisten-
cies.
2. Data Integration: To combine data from multiple sources. This step involves integrating
metadata from different sources and resolving schema conflicts.
3
Data Mining: Concepts and Techniques Unit 1
3. Data Selection: To retrieve data relevant to the analysis task from the database. Not all
data in a database is useful for every mining task.
4. Data Transformation: To transform and consolidate data into forms appropriate for min-
ing. This includes normalization, aggregation, and attribute construction.
5. Data Mining: An essential process where intelligent methods are applied to extract data
patterns. This is the core step.
6. Pattern Evaluation: To identify the strictly interesting patterns representing knowledge
based on interestingness measures (e.g., support, confidence).
7. Knowledge Presentation: Visualization and knowledge representation techniques are
used to present mined knowledge to users in an understandable format.
Knowledge Base
4
Data Mining: Concepts and Techniques Unit 1
• Data Mining Engine: This is essential to the data mining system and ideally consists of
a set of functional modules for tasks such as characterization, association and correlation
analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
• Pattern Evaluation Module: This component employs interestingness measures and in-
teracts with the data mining modules so as to focus the search toward interesting patterns.
• User Interface: This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining query or task.
1. Descriptive Mining
These tasks characterize the general properties of the data in the database.
• Data Characterization: Summarization of general features of objects in a target class.
Example: Summarizing the characteristics of customers who spend more than $1000 a
year.
• Data Discrimination: Comparison of the general features of target class data objects
with the general features of objects from one or a set of contrasting classes. Example:
Comparing customers who frequently buy computer products with those who rarely buy
such products.
• Association Analysis: Identifying frequent patterns, associations, and correlations among
sets of items or objects. Example: Market Basket Analysis (Beer ⇒ Diapers).
• Clustering: Analyzing data objects without consulting a known class label. The objects
are clustered or grouped based on the principle of maximizing the intraclass similarity
and minimizing the interclass similarity.
2. Predictive Mining
These tasks perform inference on the current data in order to make predictions.
• Classification: Deriving a model (or function) that describes and distinguishes data classes
or concepts. The model is used to predict the class label of objects for which the class label
is unknown. Example: Classifying bank loan applications as safe or risky.
• Prediction (Regression): Predicting missing or unavailable numerical data values rather
than class labels. Example: Predicting the salary of an employee based on experience.
• Time-Series Analysis: Analyzing time-series data to find trends, cycles, and seasonality to
predict future values.
• Outlier Analysis: Analyzing data objects that do not comply with the general behavior or
model of the data.
5
Data Mining: Concepts and Techniques Unit 1
Data Preprocessing
Data preprocessing is a crucial step in data mining. Real-world data is often:
• Incomplete: Lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data.
• Noisy: Containing errors or outliers.
• Inconsistent: Containing discrepancies in codes or names.
1. Data Cleaning
6
Data Mining: Concepts and Techniques Unit 1
2. Data Integration
3. Data Transformation
• Z-Score Normalization:
v − µA
v0 =
σA
(where µA = mean of attribute A, σA = standard deviation)
• Decimal Scaling:
v
v0 =
10j
(where j is the smallest integer such that |v 0 | < 1)
4. Data Reduction
Obtains a reduced representation of the data set that is much smaller in volume but produces
the same (or almost the same) analytical results.
• Dimensionality Reduction: Wavelet transforms, Principal Component Analysis (PCA).
• Numerosity Reduction: Regression and Log-Linear Models, Histograms, Clustering, Sam-
pling.
• Data Compression: Lossless (e.g., string compression) vs. Lossy (e.g., JPEG, Wavelet).
7
Data Mining: Concepts and Techniques Unit 2
Training Data
Test Data
Feedback Loop
The basic algorithm is a greedy algorithm that constructs decision trees in a top-down recursive
divide-and-conquer manner.
1. Start: The tree starts as a single node representing all training tuples.
2. Splitting: If the tuples are all of the same class, then the node becomes a leaf and is
labeled with that class. Otherwise, the algorithm uses an attribute selection measure to
determine the best attribute to split the tuples.
3. Recursion: The algorithm recurs on each branch.
4. Termination: The recursion stops when:
• All samples for a given node belong to the same class.
8
Data Mining: Concepts and Techniques Unit 2
• There are no remaining attributes on which the samples may be further partitioned.
• There are no samples left.
v
X |Dj |
Inf oA (D) = × Inf o(Dj )
|D|
j=1
Gain(A)
GainRatio(A) =
SplitInf oA (D)
v
X |Dj |
GiniA (D) = Gini(Dj )
|D|
j=1
The attribute that maximizes the reduction in impurity (or minimizes the Gini Index) is selected.
Bayesian Classification
Bayesian classifiers are statistical classifiers. They can predict class membership probabilities
such as the probability that a given tuple belongs to a particular class.
Bayes’ Theorem:
P (X|H)P (H)
P (H|X) =
P (X)
Where:
• X is a data tuple.
• H is some hypothesis (e.g., that tuple X belongs to class C).
• P (H|X) is the posterior probability of H conditioned on X.
• P (H) is the prior probability of H.
• P (X|H) is the likelihood.
• P (X) is the predictor prior probability.
9
Data Mining: Concepts and Techniques Unit 2
Naïve Bayes Classifier: Assumes that the effect of an attribute value on a given class is inde-
pendent of the values of the other attributes (Class Conditional Independence).
n
Y
P (X|Ci ) = P (xk |Ci )
k=1
This assumption simplifies the computation significantly. Despite its simplicity, Naïve Bayes
often outperforms more sophisticated classification methods.
Bayesian Belief Networks (BBN): Also known as Belief Networks, Bayes Nets, or Probabilistic
Directed Acyclic Graphical Models. BBNs allow class conditional independencies to be defined
between subsets of variables.
• Components: A Directed Acyclic Graph (DAG) and a set of Conditional Probability Tables
(CPT).
• DAG: Nodes represent variables, and arcs represent probabilistic dependencies.
• CPT: Each node has a CPT that describes the conditional probability distribution of the
node given its parents.
H1
I1
O1
I2 H2
O2
I3 H3
H4
10
Data Mining: Concepts and Techniques Unit 2
4. Backward Propagation: The error is propagated backward from the output layer to the
input layer. The weights are updated to minimize the error (Gradient Descent).
5. Iteration: Steps 2-4 are repeated for many epochs until the error is minimized.
11
Data Mining: Concepts and Techniques Unit 3
Given a database of n objects, a partitioning method constructs k partitions of the data, where
each partition represents a cluster and k ≤ n. That is, it classifies the data into k groups, which
together satisfy the following requirements: (1) each group must contain at least one object,
and (2) each object must belong to exactly one group.
K-Means Algorithm: The most well-known partitioning method.
1. Initialization: Randomly select k objects as initial cluster centroids.
12
Data Mining: Concepts and Techniques Unit 3
2. Assignment: Assign each object to the cluster with the nearest centroid (usually using
Euclidean distance).
3. Update: Calculate the new mean of each cluster. This becomes the new centroid.
4. Repeat: Repeat steps 2 and 3 until the centroids do not change (convergence).
Cluster 1 Cluster 2
Distance
Centroid 1 Centroid 2
2. Hierarchical Methods
A hierarchical method creates a hierarchical decomposition of the given set of data objects.
• Agglomerative (Bottom-Up): This approach starts with each object forming a separate
group. It successively merges the objects or groups that are close to one another, until all
of the groups are merged into one (the topmost level of the hierarchy), or until a termina-
tion condition holds.
• Divisive (Top-Down): This approach starts with all of the objects in the same cluster. In
each successive iteration, a cluster is split up into smaller clusters, until eventually each
object is in one cluster, or until a termination condition holds.
The result is often represented as a tree structure called a Dendrogram.
3. Density-Based Methods
Most partitioning methods cluster objects based on the distance between objects. Such meth-
ods can find only spherical-shaped clusters and encounter difficulty in discovering clusters of
arbitrary shapes. Density-based methods continue growing the given cluster as long as the
density (number of objects or data points) in the ”neighborhood” exceeds some threshold.
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
• (Epsilon): The maximum radius of the neighborhood.
• MinPts: The minimum number of points required to form a dense region.
• Core Point: A point with at least MinPts within .
• Border Point: A point within of a core point but with fewer than MinPts.
• Noise: Points that are neither Core nor Border.
13
Data Mining: Concepts and Techniques Unit 3
4. Grid-Based Methods
Grid-based methods quantize the object space into a finite number of cells that form a grid
structure. All of the clustering operations are performed on the grid structure.
• STING (Statistical Information Grid): Stores statistical information in each grid cell.
• CLIQUE: Integrates grid-based and density-based clustering.
• Advantage: Fast processing time, which is typically independent of the number of data
objects and dependent only on the number of cells in each dimension in the quantized
space.
5. Model-Based Methods
Model-based methods hypothesize a model for each of the clusters and find the best fit of the
data to the given model.
• EM (Expectation-Maximization): An extension of K-means. It assigns each object to a
cluster according to a weight representing the probability of membership. It uses Gaus-
sian Mixture Models.
• SOM (Self-Organizing Maps): A neural network-based method that maps high-dimensional
data onto a 2D grid while preserving topology.
Outlier Analysis
Outliers are data objects that deviate significantly from the rest of the data as if they were gen-
erated by a different mechanism.
• Types of Outliers:
• Global Outliers (Point Anomalies): An object significantly deviates from the rest of
the dataset.
• Contextual Outliers (Conditional Anomalies): An object deviates significantly with
respect to a specific context (e.g., 30°C is normal in summer but an outlier in winter).
• Collective Outliers: A subset of data objects collectively deviates significantly from
the whole dataset, even if the individual data objects may not be outliers.
• Detection Methods:
• Statistical Approaches: Assume a distribution (e.g., Gaussian) and find points with
low probability.
• Distance-Based Approaches: Objects that do not have enough neighbors (KNN).
• Density-Based Approaches: Objects in low-density regions (LOF - Local Outlier Fac-
tor).
14
Data Mining: Concepts and Techniques Unit 4
Basic Concepts
• Frequent Pattern: A pattern (a set of items, subsequences, substructures, etc.) that oc-
curs frequently in a data set.
• Market Basket Analysis: A typical example of frequent itemset mining. It analyzes cus-
tomer buying habits by finding associations between the different items that customers
place in their ”shopping baskets.”
• Itemset: A collection of one or more items. Example: {Milk, Bread, Diapers}.
• Support Count (σ): The frequency of occurrence of an itemset.
• Frequent Itemset: An itemset whose support is greater than or equal to a minimum sup-
port threshold (min_sup).
• Association Rule: An implication of the form X ⇒ Y , where X ⊂ I, Y ⊂ I, and X ∩ Y = ∅.
Rule Evaluation Measures:
• Support (s): The percentage of transactions in the database that contain both X and Y.
count(X ∪ Y )
Support(X ∪ Y ) = P (X ∪ Y ) =
total count
Support(X ∪ Y )
Conf idence(X ⇒ Y ) = P (Y |X) =
Support(X)
15
Data Mining: Concepts and Techniques Unit 4
Concept: Compresses the database into a frequent-pattern tree (FP-tree), which retains the
itemset association information. It then divides the compressed database into a set of con-
ditional databases, each associated with one frequent item, and mines each such database
separately.
Root
f:4 c:1
c:3 b:1
a:3
m:2
16
Data Mining: Concepts and Techniques Unit 5
1. Web Mining
Web mining is the application of data mining techniques to discover patterns from the Web.
Web Mining
• Web Content Mining: Extraction of useful information from web page contents (text, im-
ages, audio, video).
• Web Structure Mining: Discovery of the link structure of the web (hyperlinks). It involves
analyzing the node and connection structure of the web graph.
• Web Usage Mining: Extraction of interesting patterns from web access logs (e.g., user
browsing behavior, clickstreams).
• Temporal Data Mining: Mining data that changes over time (Time-Series).
• Trend Analysis: Long-term movements.
• Cyclic Movements: Recurring variations.
• Seasonal Movements: Periodic variations.
• Spatial Data Mining: Mining knowledge from large amounts of spatial data (maps, GPS,
remote sensing).
• Spatial Associations: ”What features are near parks?”
• Spatial Clustering: Grouping locations.
• Spatial Classification: Classifying regions based on properties.
• Visual Data Mining: Uses visualization techniques to help users understand and explore
data. It integrates data mining and data visualization.
• Audio Mining: Extracting patterns from audio data (speech recognition, music classifica-
tion).
17
Data Mining: Concepts and Techniques Unit 5
18