0% found this document useful (0 votes)
34 views18 pages

Data Mining Concepts & Techniques Guide

The document provides comprehensive study notes on Data Mining, covering concepts, techniques, and the KDD process. It details various components of data mining, including preprocessing, predictive modeling, and descriptive modeling, along with algorithms and methodologies. Additionally, it discusses emerging trends and issues in data mining, emphasizing the importance of knowledge extraction from large datasets.

Uploaded by

chinmaysingh2012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views18 pages

Data Mining Concepts & Techniques Guide

The document provides comprehensive study notes on Data Mining, covering concepts, techniques, and the KDD process. It details various components of data mining, including preprocessing, predictive modeling, and descriptive modeling, along with algorithms and methodologies. Additionally, it discusses emerging trends and issues in data mining, emphasizing the importance of knowledge extraction from large datasets.

Uploaded by

chinmaysingh2012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Mining: Concepts and Techniques

Comprehensive Study Notes (Units 1 - 5)


[Link] - Computer Science and Engineering

Contents

1 Introduction to Data Mining & Preprocessing 3


1.1 Introduction to Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 The KDD Process (Knowledge Discovery in Databases) . . . . . . . . . . . . . . . . . 3
1.2.1 Detailed Steps of KDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Data Mining Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Data Mining Functionalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 1. Descriptive Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 2. Predictive Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Major Issues in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.1 1. Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6.2 2. Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6.3 3. Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6.4 4. Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Predictive Modeling (Classification) 8


2.1 Overview of Classification and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Decision Tree Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Algorithm Overview (ID3, C4.5, CART) . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Attribute Selection Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Artificial Neural Networks (Backpropagation) . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Lazy vs. Eager Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Descriptive Modeling (Cluster Analysis) 12


3.1 Introduction to Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Data Types in Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Categorization of Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 1. Partitioning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 2. Hierarchical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.3 3. Density-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.4 4. Grid-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.5 5. Model-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Outlier Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Discovering Patterns and Rules (Frequent Pattern Mining) 15


4.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1
Data Mining: Concepts and Techniques Unit 0

4.2 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


4.3 FP-Growth Algorithm (Frequent Pattern Growth) . . . . . . . . . . . . . . . . . . . . 15
4.4 Advanced Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Data Mining Trends & Research Frontiers 17


5.1 Mining Complex Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.1.1 1. Web Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.1.2 2. Temporal & Spatial Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.1.3 3. Visual & Audio Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Ubiquitous and Invisible Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 Social Impacts, Privacy, and Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.4 Trends in Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2
Data Mining: Concepts and Techniques Unit 1

Introduction to Data Mining & Preprocessing

Introduction to Data Mining


Definition: Data mining is the computational process of discovering patterns in large data sets
involving methods at the intersection of artificial intelligence, machine learning, statistics, and
database systems. It is an essential step in the Knowledge Discovery in Databases (KDD) pro-
cess.
The term ”Data Mining” is a misnomer, as the goal is the extraction of patterns and knowledge
from large amounts of data, not the extraction (mining) of data itself. A more appropriate term
would be ”Knowledge Mining from Data,” which unfortunately is somewhat long. ”Data Mining”
is the shorter term that has gained popularity.
Motivation: Why Data Mining? The explosive growth of data: from terabytes to petabytes.
• Data Collection and Availability: Automated data collection tools, database systems,
web, and computerized society.
• Major Sources of Data:
• Business: Web, E-commerce, Transactions, Stocks.
• Science: Remote Sensing, Bioinformatics, Scientific Simulation.
• Society and Everyone: News, Digital Cameras, Social Media.
• The Problem: We are drowning in data, but starving for knowledge!
• The Solution: Data mining technology to automate the analysis of data and discover
hidden patterns.

The KDD Process (Knowledge Discovery in Databases)


Data mining is a step in the KDD process. The process consists of an iterative sequence of the
following steps:

Cleaning & Selection &


Integration Transform
Raw Data

Pattern Data
Knowledge
Evaluation Mining

Figure 1: Detailed KDD Process Workflow

Detailed Steps of KDD

1. Data Cleaning: To remove noise and inconsistent data. This involves handling missing
values, smoothing noisy data, identifying or removing outliers, and resolving inconsisten-
cies.
2. Data Integration: To combine data from multiple sources. This step involves integrating
metadata from different sources and resolving schema conflicts.

3
Data Mining: Concepts and Techniques Unit 1

3. Data Selection: To retrieve data relevant to the analysis task from the database. Not all
data in a database is useful for every mining task.
4. Data Transformation: To transform and consolidate data into forms appropriate for min-
ing. This includes normalization, aggregation, and attribute construction.
5. Data Mining: An essential process where intelligent methods are applied to extract data
patterns. This is the core step.
6. Pattern Evaluation: To identify the strictly interesting patterns representing knowledge
based on interestingness measures (e.g., support, confidence).
7. Knowledge Presentation: Visualization and knowledge representation techniques are
used to present mined knowledge to users in an understandable format.

Data Mining Architecture


A typical data mining system has the following major components:

User Interface (GUI)

Pattern Evaluation Module

Data Mining Engine

Knowledge Base

Database / Data Warehouse Server

Data Cleaning, Integration, Selection

Databases, WWW, Other Repositories

Figure 2: Detailed Architecture of a Data Mining System


• Database, Data Warehouse, WWW, or Other Information Repositories: This is one or
a set of databases, data warehouses, spreadsheets, or other kinds of information reposi-
tories. Data cleaning and data integration techniques may be performed on the data.
• Database or Data Warehouse Server: The database or data warehouse server is respon-
sible for fetching the relevant data, based on the user’s data mining request.
• Knowledge Base: This is the domain knowledge that is used to guide the search or eval-
uate the interestingness of resulting patterns. Such knowledge can include concept hier-
archies, used to organize attributes or attribute values into different levels of abstraction.

4
Data Mining: Concepts and Techniques Unit 1

• Data Mining Engine: This is essential to the data mining system and ideally consists of
a set of functional modules for tasks such as characterization, association and correlation
analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
• Pattern Evaluation Module: This component employs interestingness measures and in-
teracts with the data mining modules so as to focus the search toward interesting patterns.
• User Interface: This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining query or task.

Data Mining Functionalities


Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks.

1. Descriptive Mining

These tasks characterize the general properties of the data in the database.
• Data Characterization: Summarization of general features of objects in a target class.
Example: Summarizing the characteristics of customers who spend more than $1000 a
year.
• Data Discrimination: Comparison of the general features of target class data objects
with the general features of objects from one or a set of contrasting classes. Example:
Comparing customers who frequently buy computer products with those who rarely buy
such products.
• Association Analysis: Identifying frequent patterns, associations, and correlations among
sets of items or objects. Example: Market Basket Analysis (Beer ⇒ Diapers).
• Clustering: Analyzing data objects without consulting a known class label. The objects
are clustered or grouped based on the principle of maximizing the intraclass similarity
and minimizing the interclass similarity.

2. Predictive Mining

These tasks perform inference on the current data in order to make predictions.
• Classification: Deriving a model (or function) that describes and distinguishes data classes
or concepts. The model is used to predict the class label of objects for which the class label
is unknown. Example: Classifying bank loan applications as safe or risky.
• Prediction (Regression): Predicting missing or unavailable numerical data values rather
than class labels. Example: Predicting the salary of an employee based on experience.
• Time-Series Analysis: Analyzing time-series data to find trends, cycles, and seasonality to
predict future values.
• Outlier Analysis: Analyzing data objects that do not comply with the general behavior or
model of the data.

Major Issues in Data Mining


1. Mining Methodology and User Interaction Issues:
• Mining different kinds of knowledge in databases.

5
Data Mining: Concepts and Techniques Unit 1

• Interactive mining of knowledge at multiple levels of abstraction.


• Incorporation of background knowledge.
• Data mining query languages and ad hoc data mining.
• Presentation and visualization of data mining results.
• Handling noisy or incomplete data.
• Pattern evaluation: the interestingness problem.
2. Performance Issues:
• Efficiency and scalability of data mining algorithms.
• Parallel, distributed, and incremental mining algorithms.
3. Issues Relating to the Diversity of Database Types:
• Handling of relational and complex types of data.
• Mining information from heterogeneous databases and global information systems.

Data Preprocessing
Data preprocessing is a crucial step in data mining. Real-world data is often:
• Incomplete: Lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data.
• Noisy: Containing errors or outliers.
• Inconsistent: Containing discrepancies in codes or names.

1. Data Cleaning

• Handling Missing Values:


• Ignore the tuple (mostly for classification).
• Fill in the missing value manually (tedious).
• Use a global constant (e.g., ”Unknown”).
• Use the attribute mean.
• Use the most probable value (inference-based).
• Handling Noisy Data (Binning):
• Equal-width (distance) partitioning: Divides the range into N intervals of equal size.
• Equal-depth (frequency) partitioning: Divides the range into N intervals, each con-
taining approximately the same number of samples.
• Smoothing by bin means: Each value in a bin is replaced by the mean value of the
bin.
• Smoothing by bin boundaries: The minimum and maximum values in a given bin
are identified as the bin boundaries. Each bin value is then replaced by the closest
boundary value.

6
Data Mining: Concepts and Techniques Unit 1

2. Data Integration

Merges data from multiple data stores.


• Schema Integration: The problem of integrating metadata from different sources.
• Entity Identification Problem: Matching equivalent entities from multiple databases (e.g.,
matching cust_id in one DB to cust_number in another).
• Detecting and Resolving Data Value Conflicts: For the same real-world entity, attribute
values from different sources may differ.
• Redundancy: An attribute may be redundant if it can be derived from another attribute.
Correlation analysis is used to detect redundancy.

3. Data Transformation

• Smoothing: Remove noise from data.


• Aggregation: Summarization, data cube construction.
• Generalization: Concept hierarchy climbing.
• Normalization: Scaled to fall within a small, specified range.
• Min-Max Normalization:
v − minA
v0 = (new_maxA − new_minA ) + new_minA
maxA − minA

• Z-Score Normalization:
v − µA
v0 =
σA
(where µA = mean of attribute A, σA = standard deviation)
• Decimal Scaling:
v
v0 =
10j
(where j is the smallest integer such that |v 0 | < 1)

4. Data Reduction

Obtains a reduced representation of the data set that is much smaller in volume but produces
the same (or almost the same) analytical results.
• Dimensionality Reduction: Wavelet transforms, Principal Component Analysis (PCA).
• Numerosity Reduction: Regression and Log-Linear Models, Histograms, Clustering, Sam-
pling.
• Data Compression: Lossless (e.g., string compression) vs. Lossy (e.g., JPEG, Wavelet).

7
Data Mining: Concepts and Techniques Unit 2

Predictive Modeling (Classification)

Overview of Classification and Prediction


Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts. The purpose is to be able to use the model to predict the class of
objects whose class label is unknown. The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label is known).
Prediction models continuous-valued functions. That is, it is used to predict missing or unavail-
able numerical data values rather than class labels.
Process Workflow:

Classification Algorithm Classifier (Model)

Training Data

Apply Model Predictions / Accuracy

Test Data
Feedback Loop

Figure 3: Classification: Training and Testing Phases

Decision Tree Induction


Decision tree induction is the learning of decision trees from class-labeled training tuples. A
decision tree is a flowchart-like tree structure.
• Internal Node: Denotes a test on an attribute.
• Branch: Represents an outcome of the test.
• Leaf Node: Holds a class label.

Algorithm Overview (ID3, C4.5, CART)

The basic algorithm is a greedy algorithm that constructs decision trees in a top-down recursive
divide-and-conquer manner.
1. Start: The tree starts as a single node representing all training tuples.
2. Splitting: If the tuples are all of the same class, then the node becomes a leaf and is
labeled with that class. Otherwise, the algorithm uses an attribute selection measure to
determine the best attribute to split the tuples.
3. Recursion: The algorithm recurs on each branch.
4. Termination: The recursion stops when:
• All samples for a given node belong to the same class.

8
Data Mining: Concepts and Techniques Unit 2

• There are no remaining attributes on which the samples may be further partitioned.
• There are no samples left.

Attribute Selection Measures

Information Gain (ID3): Based on the concept of Entropy (measure of impurity).


m
X
Inf o(D) = − pi log2 (pi )
i=1

v
X |Dj |
Inf oA (D) = × Inf o(Dj )
|D|
j=1

Gain(A) = Inf o(D) − Inf oA (D)


The attribute with the highest Gain is chosen as the splitting attribute.
Gain Ratio (C4.5): Normalizes Information Gain to overcome the bias towards attributes with
many values.
v  
X |Dj | |Dj |
SplitInf oA (D) = − × log2
|D| |D|
j=1

Gain(A)
GainRatio(A) =
SplitInf oA (D)

Gini Index (CART): Measures the impurity of D.


m
X
Gini(D) = 1 − p2i
i=1

v
X |Dj |
GiniA (D) = Gini(Dj )
|D|
j=1

The attribute that maximizes the reduction in impurity (or minimizes the Gini Index) is selected.

Bayesian Classification
Bayesian classifiers are statistical classifiers. They can predict class membership probabilities
such as the probability that a given tuple belongs to a particular class.
Bayes’ Theorem:
P (X|H)P (H)
P (H|X) =
P (X)
Where:
• X is a data tuple.
• H is some hypothesis (e.g., that tuple X belongs to class C).
• P (H|X) is the posterior probability of H conditioned on X.
• P (H) is the prior probability of H.
• P (X|H) is the likelihood.
• P (X) is the predictor prior probability.

9
Data Mining: Concepts and Techniques Unit 2

Naïve Bayes Classifier: Assumes that the effect of an attribute value on a given class is inde-
pendent of the values of the other attributes (Class Conditional Independence).
n
Y
P (X|Ci ) = P (xk |Ci )
k=1

This assumption simplifies the computation significantly. Despite its simplicity, Naïve Bayes
often outperforms more sophisticated classification methods.
Bayesian Belief Networks (BBN): Also known as Belief Networks, Bayes Nets, or Probabilistic
Directed Acyclic Graphical Models. BBNs allow class conditional independencies to be defined
between subsets of variables.
• Components: A Directed Acyclic Graph (DAG) and a set of Conditional Probability Tables
(CPT).
• DAG: Nodes represent variables, and arcs represent probabilistic dependencies.
• CPT: Each node has a CPT that describes the conditional probability distribution of the
node given its parents.

Artificial Neural Networks (Backpropagation)


Neural networks are a set of connected input/output units where each connection has a weight
associated with it. During the learning phase, the network learns by adjusting the weights so
as to be able to predict the correct class label of the input tuples.

Input Layer Hidden Layer Output Layer

H1
I1
O1
I2 H2

O2
I3 H3

H4

Figure 4: Multilayer Feed-Forward Neural Network


Backpropagation Algorithm:
1. Initialize Weights: Weights are initialized to small random numbers.
2. Forward Propagation: The input tuples are fed into the input layer. The inputs pass
through the hidden layers to the output layer. At each node, a weighted sum is calculated
and passed through an activation function (e.g., Sigmoid).
3. Calculate Error: The predicted output is compared with the actual target value to calculate
the error.

10
Data Mining: Concepts and Techniques Unit 2

4. Backward Propagation: The error is propagated backward from the output layer to the
input layer. The weights are updated to minimize the error (Gradient Descent).
5. Iteration: Steps 2-4 are repeated for many epochs until the error is minimized.

Support Vector Machines (SVM)


SVM is a supervised machine learning algorithm used for both classification and regression.
• Hyperplane: A decision boundary that separates different classes. In 2D, it’s a line; in 3D,
it’s a plane.
• Margin: The distance between the hyperplane and the nearest data points from either
class. SVM aims to maximize this margin.
• Support Vectors: The data points that lie closest to the decision surface (hyperplane).
They are the most difficult to classify and determine the position of the hyperplane.
• Kernel Trick: SVMs can efficiently perform non-linear classification using what is called
the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

Lazy vs. Eager Learners


• Eager Learners: Construct a generalization (classification model) before receiving new
(test) tuples to classify. Examples: Decision Trees, Bayesian, Neural Networks, SVM.
• Lazy Learners: Simply store the training data (or only minor processing) and wait until a
test tuple is given. Examples: K-Nearest Neighbors (KNN), Case-Based Reasoning.

11
Data Mining: Concepts and Techniques Unit 3

Descriptive Modeling (Cluster Analysis)

Introduction to Cluster Analysis


Definition: Cluster analysis or clustering is the process of partitioning a set of data objects (or
observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to
one another, yet dissimilar to objects in other clusters.
Clustering is Unsupervised Learning: Unlike classification, there are no predefined classes.
The learning process is unsupervised.
Applications:
• Marketing: Finding groups of customers with similar behavior given a large database of
customer data containing their properties and past buying records.
• Biology: Classification of plants and animals given their features.
• Libraries: Book ordering.
• City Planning: Identifying groups of houses according to their house type, value, and
geographical location.
• Earthquake Studies: Observed earthquake epicenters should be clustered along conti-
nent faults.

Data Types in Clustering


• Interval-Scaled Variables: Continuous measurements of a roughly linear scale (e.g., weight,
height, temperature).
• Binary Variables: A variable with only two states: 0 or 1.
• Symmetric: Both states are equally important (e.g., Gender).
• Asymmetric: One state is more important (e.g., Test positive for a disease).
• Nominal (Categorical) Variables: Generalization of the binary variable in that it can take
on more than two states (e.g., Color = Red, Yellow, Green).
• Ordinal Variables: Similar to nominal variables, but the values have a meaningful order
(e.g., Size = Small, Medium, Large).
• Ratio-Scaled Variables: Positive measurement on a nonlinear scale (e.g., exponential
scale).

Categorization of Clustering Methods


1. Partitioning Methods

Given a database of n objects, a partitioning method constructs k partitions of the data, where
each partition represents a cluster and k ≤ n. That is, it classifies the data into k groups, which
together satisfy the following requirements: (1) each group must contain at least one object,
and (2) each object must belong to exactly one group.
K-Means Algorithm: The most well-known partitioning method.
1. Initialization: Randomly select k objects as initial cluster centroids.

12
Data Mining: Concepts and Techniques Unit 3

2. Assignment: Assign each object to the cluster with the nearest centroid (usually using
Euclidean distance).
3. Update: Calculate the new mean of each cluster. This becomes the new centroid.
4. Repeat: Repeat steps 2 and 3 until the centroids do not change (convergence).

Cluster 1 Cluster 2

Distance
Centroid 1 Centroid 2

Figure 5: K-Means Clustering Visualization


K-Medoids (PAM - Partitioning Around Medoids): Instead of taking the mean value of the
objects in a cluster as a reference point, we can pick actual objects to represent the clusters,
using one representative object per cluster. Each remaining object is clustered with the rep-
resentative object to which it is the most similar. The partitioning method is then performed
based on the principle of minimizing the sum of the dissimilarities between each object and its
corresponding reference point.

2. Hierarchical Methods

A hierarchical method creates a hierarchical decomposition of the given set of data objects.
• Agglomerative (Bottom-Up): This approach starts with each object forming a separate
group. It successively merges the objects or groups that are close to one another, until all
of the groups are merged into one (the topmost level of the hierarchy), or until a termina-
tion condition holds.
• Divisive (Top-Down): This approach starts with all of the objects in the same cluster. In
each successive iteration, a cluster is split up into smaller clusters, until eventually each
object is in one cluster, or until a termination condition holds.
The result is often represented as a tree structure called a Dendrogram.

3. Density-Based Methods

Most partitioning methods cluster objects based on the distance between objects. Such meth-
ods can find only spherical-shaped clusters and encounter difficulty in discovering clusters of
arbitrary shapes. Density-based methods continue growing the given cluster as long as the
density (number of objects or data points) in the ”neighborhood” exceeds some threshold.
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
•  (Epsilon): The maximum radius of the neighborhood.
• MinPts: The minimum number of points required to form a dense region.
• Core Point: A point with at least MinPts within .
• Border Point: A point within  of a core point but with fewer than MinPts.
• Noise: Points that are neither Core nor Border.

13
Data Mining: Concepts and Techniques Unit 3

• OPTICS: An extension of DBSCAN to handle varying densities.

4. Grid-Based Methods

Grid-based methods quantize the object space into a finite number of cells that form a grid
structure. All of the clustering operations are performed on the grid structure.
• STING (Statistical Information Grid): Stores statistical information in each grid cell.
• CLIQUE: Integrates grid-based and density-based clustering.
• Advantage: Fast processing time, which is typically independent of the number of data
objects and dependent only on the number of cells in each dimension in the quantized
space.

5. Model-Based Methods

Model-based methods hypothesize a model for each of the clusters and find the best fit of the
data to the given model.
• EM (Expectation-Maximization): An extension of K-means. It assigns each object to a
cluster according to a weight representing the probability of membership. It uses Gaus-
sian Mixture Models.
• SOM (Self-Organizing Maps): A neural network-based method that maps high-dimensional
data onto a 2D grid while preserving topology.

Outlier Analysis
Outliers are data objects that deviate significantly from the rest of the data as if they were gen-
erated by a different mechanism.
• Types of Outliers:
• Global Outliers (Point Anomalies): An object significantly deviates from the rest of
the dataset.
• Contextual Outliers (Conditional Anomalies): An object deviates significantly with
respect to a specific context (e.g., 30°C is normal in summer but an outlier in winter).
• Collective Outliers: A subset of data objects collectively deviates significantly from
the whole dataset, even if the individual data objects may not be outliers.
• Detection Methods:
• Statistical Approaches: Assume a distribution (e.g., Gaussian) and find points with
low probability.
• Distance-Based Approaches: Objects that do not have enough neighbors (KNN).
• Density-Based Approaches: Objects in low-density regions (LOF - Local Outlier Fac-
tor).

14
Data Mining: Concepts and Techniques Unit 4

Discovering Patterns and Rules (Frequent Pattern Mining)

Basic Concepts
• Frequent Pattern: A pattern (a set of items, subsequences, substructures, etc.) that oc-
curs frequently in a data set.
• Market Basket Analysis: A typical example of frequent itemset mining. It analyzes cus-
tomer buying habits by finding associations between the different items that customers
place in their ”shopping baskets.”
• Itemset: A collection of one or more items. Example: {Milk, Bread, Diapers}.
• Support Count (σ): The frequency of occurrence of an itemset.
• Frequent Itemset: An itemset whose support is greater than or equal to a minimum sup-
port threshold (min_sup).
• Association Rule: An implication of the form X ⇒ Y , where X ⊂ I, Y ⊂ I, and X ∩ Y = ∅.
Rule Evaluation Measures:
• Support (s): The percentage of transactions in the database that contain both X and Y.

count(X ∪ Y )
Support(X ∪ Y ) = P (X ∪ Y ) =
total count

• Confidence (c): The percentage of transactions containing X that also contain Y.

Support(X ∪ Y )
Conf idence(X ⇒ Y ) = P (Y |X) =
Support(X)

The Apriori Algorithm


Apriori is a seminal algorithm for mining frequent itemsets for Boolean association rules. It
uses an iterative level-wise search: k-itemsets are used to explore (k + 1)-itemsets.
Apriori Property: All nonempty subsets of a frequent itemset must also be frequent. Con-
versely, if an itemset is infrequent, all its supersets will be infrequent.
Algorithm Steps:
1. Initialize: k = 1. Find frequent 1-itemsets (L1 ).
2. Join Step: Generate candidate itemsets Ck+1 by joining Lk with itself.
3. Prune Step: Remove candidates in Ck+1 containing subsets that are not frequent (using
the Apriori property).
4. Scan: Scan the database to determine the support of candidates in Ck+1 .
5. Filter: Retain candidates with support ≥ min_sup to form Lk+1 .
6. Repeat: Increment k and repeat until no new frequent itemsets are found.

FP-Growth Algorithm (Frequent Pattern Growth)


Apriori suffers from repeated database scans. FP-Growth mines frequent patterns without can-
didate generation.

15
Data Mining: Concepts and Techniques Unit 4

Concept: Compresses the database into a frequent-pattern tree (FP-tree), which retains the
itemset association information. It then divides the compressed database into a set of con-
ditional databases, each associated with one frequent item, and mines each such database
separately.

Root

f:4 c:1

c:3 b:1

a:3

m:2

Figure 6: FP-Tree Structure Example


Algorithm Steps:
1. Scan DB once to find frequent 1-itemsets. Sort them in frequency descending order (F-
List).
2. Scan DB again to construct the FP-Tree.
3. For each frequent item in F-List (starting from the end), construct its Conditional Pattern
Base and Conditional FP-Tree.
4. Recursively mine the trees.

Advanced Pattern Mining


• Closed Frequent Itemset: An itemset is closed if none of its immediate supersets has the
same support count. It provides a lossless compression of frequent patterns.
• Max Frequent Itemset: An itemset is max frequent if none of its immediate supersets is
frequent. It provides a lossy compression but is much more compact.
• Multilevel Association Rules: Mining rules at different levels of abstraction.
• Level 1: Buys(Computer) ⇒ Buys(Software)
• Level 2: Buys(Laptop) ⇒ Buys(Antivirus)
• Multidimensional Association Rules: Rules involving more than one dimension or pred-
icate.
• Single-dimensional: Buys(Milk) ⇒ Buys(Bread)
• Multi-dimensional: Age(X, ”20-29”) ∧ Occupation(X, ”Student”) ⇒ Buys(X, ”Laptop”)

16
Data Mining: Concepts and Techniques Unit 5

Data Mining Trends & Research Frontiers

Mining Complex Data Types


Traditional data mining focused on relational data. Modern applications involve complex data
types.

1. Web Mining

Web mining is the application of data mining techniques to discover patterns from the Web.

Web Mining

Web Con- Web Us-


tent Mining Web Struc- age Mining
ture Mining

• Web Content Mining: Extraction of useful information from web page contents (text, im-
ages, audio, video).
• Web Structure Mining: Discovery of the link structure of the web (hyperlinks). It involves
analyzing the node and connection structure of the web graph.
• Web Usage Mining: Extraction of interesting patterns from web access logs (e.g., user
browsing behavior, clickstreams).

2. Temporal & Spatial Mining

• Temporal Data Mining: Mining data that changes over time (Time-Series).
• Trend Analysis: Long-term movements.
• Cyclic Movements: Recurring variations.
• Seasonal Movements: Periodic variations.
• Spatial Data Mining: Mining knowledge from large amounts of spatial data (maps, GPS,
remote sensing).
• Spatial Associations: ”What features are near parks?”
• Spatial Clustering: Grouping locations.
• Spatial Classification: Classifying regions based on properties.

3. Visual & Audio Mining

• Visual Data Mining: Uses visualization techniques to help users understand and explore
data. It integrates data mining and data visualization.
• Audio Mining: Extracting patterns from audio data (speech recognition, music classifica-
tion).

17
Data Mining: Concepts and Techniques Unit 5

Ubiquitous and Invisible Data Mining


• Ubiquitous Data Mining (UDM): Mining data on mobile devices, sensors, and embed-
ded systems. It deals with the challenges of resource-constrained environments (battery,
bandwidth, memory).
• Applications: Real-time health monitoring, Intelligent traffic systems.
• Invisible Data Mining: Data mining functionality is embedded seamlessly into daily ap-
plications. The user is often unaware that mining is taking place.
• Examples: Google Search results ranking, Amazon product recommendations, Email
Spam filters.

Social Impacts, Privacy, and Ethics


Data mining has profound effects on society, raising significant ethical and privacy concerns.
• Privacy Concerns: Data mining can reveal sensitive personal information (medical, finan-
cial, lifestyle) without explicit consent. Aggregation of data from multiple sources can lead
to detailed personal profiles (”Big Brother”).
• Data Security: Large databases are targets for hackers. Data breaches can expose sensi-
tive information.
• Ethical Issues:
• Discrimination: Profiling can lead to bias (e.g., denying loans or insurance based on
ethnicity or location).
• Transparency: Users often don’t know how their data is being used or what algo-
rithms are making decisions about them.

Trends in Data Mining


• Scalability: Developing algorithms that can handle massive datasets (Big Data) efficiently.
• Interactive Mining: User-guided mining processes where the user can refine the search
during the process.
• Distributed Mining: Mining data that is distributed across multiple locations or servers
without moving all data to a central location.
• Biological Data Mining: Analysis of DNA, protein sequences, and genomic data.
• Graph Mining: Analyzing complex network structures (social networks, chemical com-
pounds).

18

You might also like