Concept Description

Conceptual notes..

Uploaded by

harshithamoturu6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views48 pages

Concept Description

Conceptual notes..

Uploaded by

harshithamoturu6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 48

Concept Description: Characterization and comparison,What is Concept

Description,Data Generalization by Attribute-Oriented Induction(AOI),

AOI for Data Characterization, Efficient Implementation of AOI.

Mining Frequent Patterns, Associations and Correlations: Basic

Concepts, Frequent Itemset Mining Methods: Apriori method,
generating Association Rules, Improving the Efficiency of Apriori,
Pattern-Growth Approach for mining Frequent Item sets.
What is Concept Description
• Concept description is a fundamental descriptive data mining task
that involves summarizing and comparing data.
• This process can involve two main
tasks: characterization and comparison (or discrimination).

• Characterization: This involves summarizing a collection of data

points into a concise and easily understandable form.
Example: If you have data on all students in a school, characterization
might involve summarizing information like average grades or most
common subjects.
• Example:
• Consider a retail company that wants to understand the purchasing
behavior of its customers.
• Data: Customer transactions for the last year.
• Characterization:
• Total number of customers: 10,000
• Average purchase value: $45
• Most frequently purchased items: Coffee, milk, bread
• Customer demographics: 60% female, 40% male, majority in the age group of
25-34
• This characterization helps the company understand its customer base
and their purchasing habits, allowing it to tailor marketing strategies or
product offerings.
Comparison (Discrimination): This involves comparing two or more
collections of data to identify differences between them.
Example:
• A university wants to compare the performance of students in two
different programs: Engineering and Arts.
• Data: Academic records for all students in both programs.
Example:
Comparison:
• Engineering Students:
• Average GPA: 3.2
• Common Courses: Mathematics, Physics, Computer Science
• Graduation Rate: 85%
• Arts Students:
• Average GPA: 3.5
• Common Courses: History, Literature, Sociology
• Graduation Rate: 90%
• This comparison reveals that Arts students tend to have slightly higher
GPAs and graduation rates compared to Engineering students.
Techniques and Tools for Concept Description
• Concept description utilizes various data mining techniques and tools to
perform characterization and comparison effectively:
• Statistical Methods: Mean, median, mode, standard deviation, and other
statistical measures provide quick insights into data distributions and
averages.
• Data Visualization: Charts, graphs, and plots (such as histograms, box plots,
and scatter plots) can visually summarize data and highlight key
characteristics or differences.
• Cluster Analysis: Groups data into clusters of similar items, which can then
be summarized to characterize different segments of data.
• Decision Trees: Used for discrimination tasks to understand the rules or
features that best separate different classes or categories.
• Practical Applications of Concept Description
• Market Basket Analysis: Characterize the purchasing behavior of
customers and compare different customer segments to improve
marketing strategies and promotions.
• Healthcare: Characterize patient profiles based on health records and
compare different patient groups to identify risk factors or treatment
outcomes.
• Education: Summarize student performance and compare different
cohorts to identify trends in educational outcomes and areas for
improvement.
Two Main Approaches:
• Attribute-Oriented Induction (AOI)
• Purpose: Generalizes data by abstracting from detailed attributes to higher-
level concepts.
• Example: You have detailed data on item sales (item_ID, name, price). Using
AOI, you might summarize sales data by item category (e.g., electronics,
clothing) rather than by individual items.
• Data Cube (OLAP)
• Purpose: Uses a multidimensional database to perform efficient data
summarization and aggregation.
• Example: A sales database might store data in a cube where you can drill
down from yearly totals to monthly or daily sales
Techniques for Concept Description:
Data Generalization:
• Purpose: Simplifies data by summarizing it at higher levels of
abstraction.
• Example: Instead of looking at individual sales transactions, you might
look at total sales per month or by product category
• Attribute-Oriented Induction
• Purpose: Generalizes data by either removing or generalizing attributes.
• Two Techniques:
• Attribute Removal
• Purpose: Remove attributes that have too many distinct values or that
can't be generalized.
• Example: If you have detailed student records including every class they
attended, you might remove the specific class information to summarize
performance by course type.
• Attribute Generalization
• Purpose: Use generalization operators to simplify detailed attributes into
broader categories.
• Example: Generalizing student ages from exact ages to age ranges (e.g.,
10-12 years old).
AOI for Data Characterization:
Attribute-Oriented Induction (AOI) is a method used in data mining to
generalize and summarize data, providing a high-level view of the data
by reducing the number of attributes and their distinct values.

This approach is particularly useful when working with large datasets

where the goal is to derive meaningful patterns or knowledge from the
data.
Basic principles of AOI:
A set of basic principles for the attribute-oriented induction in
relational databases is summarized as follows:-
1. Data Focusing
• Principle: AOI begins by focusing on task-relevant data. This means
that only the data relevant to the specific task or analysis is
considered.
• 2. Attribute Removal
• Principle: Remove an attribute if it has a large set of distinct values
and either:
• There is no generalization operator available for that attribute.
• The higher-level concept of the attribute can be represented by other
attributes.
3. Attribute Generalization
Principle: If there is a large set of distinct values for an attribute and
there exists a generalization operator, use the operator to generalize
the attribute.

Example: If the Age attribute has many distinct values, it can be

generalized into age groups. For example:
Ages 18-25: Young Adults
Ages 26-35: Adults
Ages 36-50: Middle Aged
Ages 51+: Seniors
4. Attribute Threshold Control
• Principle: To avoid over-generalization, thresholds are set for the
maximum number of distinct values an attribute can have before it is
considered for generalization. Typically, this threshold is set between
2 and 8.
Example:
Example: If we set a threshold of 4 for the number of distinct Age
Group categories, the attribute generalization would be limited to those
categories.
Apriori Algorithm:
• The Apriori algorithm, introduced by R. Agrawal and R. Srikant in
1994, is a popular method for mining frequent itemsets and
discovering association rules.
• It uses an iterative, level-wise approach, where frequent 1-itemsets
are used to find frequent 2-itemsets, which in turn are used to find
frequent 3-itemsets, and so on.
Key Terms in Apriori :
Frequent Itemsets: A collection of items that appear together in a
transaction database with a frequency above a user-defined threshold
(minimum support).
Example:
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7},
in these two transactions, 2 and 3 are the frequent itemsets.
Candidate set:
A collection of item combinations that are being evaluated to
determine if they frequently occur together in a dataset.
These combinations are generated based on existing frequent itemsets
and are used to identify potential patterns or associations.
• Components of Apriori Algorithm:
Support indicates the frequency of the rule occurring in the dataset.
1.support freq(A,B)/N
• Confidence measures how often the rule has been found to be true.
2.Confidencesfreq(A,B)/freq(A)
• Lift it tell about how strong the assoication is
3.Lift Lift(A→B) = Confidence(A→B) / Support(B)
Steps of the Apriori Algorithm
1. Find Frequent 1-Itemsets:
Scan the transaction data to count the occurrence of each item.
Keep only those items that meet or exceed the minimum support threshold.
2. Generate Candidate 2-Itemsets:
Combine the frequent 1-itemsets to form all possible 2-itemsets.
3. Prune Candidates:
Eliminate candidate 2-itemsets that have any subset which is not frequent.
4. Count Support and Determine Frequent 2-Itemsets:
Scan the transaction data again to count the occurrence of each candidate 2-itemset.
Retain only those 2-itemsets that meet the minimum support threshold.
5. Repeat for Larger Itemsets:
Repeat steps 2 to 4 for 3-itemsets, 4-itemsets, and so on, until no more frequent itemsets
can be generated.
Flow chart for apriori algorithm:
Apriori Algorithm Working
Example: Suppose we have the following dataset that has various
transactions, and from this dataset, we need to find the frequent itemsets and
generate the association rules :
Step-1: Calculating C1 and L1:

Candidate set or
C1. frequent itemset
L1.
Step-2: Candidate Generation C2, and L2:
Step-3: Candidate generation C3,
and L3:
Advantages of Apriori Algorithm

• This is easy to understand algorithm

• The join and prune steps of the algorithm can be easily implemented
on large datas
Disadvantages of Apriori Algorithm

• The apriori algorithm works slow compared to other algorithms.

• The overall performance can be reduced as it scans the database for
multiple times.
• The time complexity and space complexity of the apriori algorithm is
O(2D), which is very high. Here D represents the horizontal width
present in the database.
Improving the Efficiency of Apriori:
Apriori, while a classic algorithm for frequent itemset mining, may
struggle to handle large datasets due to its efficiency limitations.
Several techniques have been proposed to address this issues:
Hash-Based Technique:
Hash functions can be used to efficiently generate only those candidate
itemsets that are likely to be frequent, reducing the computational
overhead.
By hashing itemsets into buckets, we can quickly identify those that are
unlikely to be frequent and eliminate them from further consideration.
Transaction reduction:
• Frequent Itemsets: If a transaction doesn't have any frequent k-
itemsets, it also won’t have frequent (k + 1)-itemsets.
• Transaction Reduction: You can ignore or remove such transactions
from future checks, as they don’t contribute to finding new frequent
itemsets.
• Efficiency: This reduces the number of transactions you need to scan
in later iterations, making the process faster.
Example:
Partitioning:
Partitioning is a technique that divides a large dataset into smaller,
manageable parts to improve the efficiency of frequent itemset mining.
Here's how it works:
Divide the dataset: The dataset is split into multiple partitions.
Find local frequent itemsets: Each partition is processed separately to
identify frequent itemsets within that partition.
Combine candidates: The frequent itemsets found in each partition are
combined to form a set of potential global frequent itemsets.
Find global frequent itemsets: A final scan of the entire dataset is done
to determine which of the potential candidates are actually frequent in
the whole dataset.
Sampling:
Sampling is a technique used to improve the efficiency of Apriori by
reducing the amount of data that needs to be processed. Instead of
mining frequent itemsets on the entire dataset, a random sample is taken
and analyzed.
Key Steps:
Random Sampling: A subset of transactions is selected randomly from
the original dataset.
Frequent Itemset Mining: Apriori is applied to the sample to find
frequent itemsets.
Verification: The frequent itemsets found in the sample are verified
against the entire dataset to ensure they are truly frequent.
Dynamic itemset counting:
• Database in Blocks: Split the database into blocks with start points.
• Add Candidates Anytime: New itemsets can be added at any start
point, not just before a full scan like in Apriori.
• Count-so-Far: Use the count so far as a minimum estimate. If it meets
the support threshold, add it to the list of frequent itemsets.
• Fewer Scans: This method often needs fewer database scans than
Apriori to find all frequent itemsets.
Frequent Pattern Growth Algorithm:
The two primary drawbacks of the Apriori Algorithm are:
• At each step, candidate sets have to be built.
• To build the candidate sets, the algorithm has to repeatedly scan the
database.
What is FP Growth Algorithm?
• The FP-Growth Algorithm is an alternative way to find frequent item
sets without using candidate generations, thus improving
performance. For so much, it uses a divide-and-conquer strategy.
• The core of this method is the usage of a special data structure
named frequent-pattern tree (FP-tree), which retains the item set
association informatio
FP-Tree
• The frequent-pattern tree (FP-tree) is a compact data structure that
stores quantitative information about frequent patterns in a database.
• Each transaction is read and then mapped onto a path in the FP-tree.
This is done until all transactions have been read.
FP-Growth Algorithm Steps
1. Compute Item Frequency: Count occurrences of each item.
2. Filter by Minimum Support: Remove infrequent items.
3. Sort by Frequency: Order items by descending frequency.
4. Create Ordered-Item Sets: Reorder transactions based on sorted frequent
items.
5. Build FP-Tree: Insert ordered-item sets into the FP-tree.
6. Generate Conditional Pattern Bases: Extract prefix paths for each item.
7. Construct Conditional FP-Trees: Build FP-trees from conditional pattern bases.
8. Mine Frequent Patterns: Recursively find frequent patterns from conditional
FP-trees.
9. Generate Association Rules: Create and validate association rules from
frequent patterns.
Example:
STEP1:Compute Item Frequency: Count occurrences of each item.
2.Filter by Minimum Support: Remove infrequent items.
3.Sort by Frequency: Order items by descending frequency.
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
4.Create Ordered-Item Sets: Reorder transactions based on sorted
frequent items.
5.Build FP-Tree: Insert ordered-item sets into the FP-tree.
6.Construct Conditional FP-Trees: Build FP-trees from conditional
pattern bases
7.Construct Conditional FP-Trees: Build FP-trees from conditional
pattern bases.
8.Mine Frequent Patterns: Recursively find frequent patterns from
conditional FP-trees.
Advantages of FP Growth Algorithm
• This algorithm needs to scan the database twice when compared to
Apriori, which scans the transactions for each iteration.
• The pairing of items is not done in this algorithm, making it faster.
• The database is stored in a compact version in memory.
• It is efficient and scalable for mining both long and short frequent
patterns.
Disadvantages of FP-Growth Algorithm
• It may be expensive.
• The algorithm may not fit in the shared memory when the database is
large.

Concept Description
No ratings yet
Concept Description
12 pages
Concept Description & Rule Mining
No ratings yet
Concept Description & Rule Mining
58 pages
Unit 3
No ratings yet
Unit 3
38 pages
6.concept Description Characterization and Comparison
No ratings yet
6.concept Description Characterization and Comparison
69 pages
Data Mining Unit2
No ratings yet
Data Mining Unit2
9 pages
PSC 2010 Paper
No ratings yet
PSC 2010 Paper
5 pages
Concept Description: Characterization and Comparision: Chapter-10
No ratings yet
Concept Description: Characterization and Comparision: Chapter-10
5 pages
05 DM BI Concept Description
No ratings yet
05 DM BI Concept Description
21 pages
3final CH 5 Concept
No ratings yet
3final CH 5 Concept
101 pages
Lecture 2.1.1 2.1.2
No ratings yet
Lecture 2.1.1 2.1.2
23 pages
Chapter 5 Concept Description Characterization and Comparison 395
No ratings yet
Chapter 5 Concept Description Characterization and Comparison 395
64 pages
Unit-Iii Data Mining Material
No ratings yet
Unit-Iii Data Mining Material
23 pages
Lecture 2.3.1 2.3.2
No ratings yet
Lecture 2.3.1 2.3.2
23 pages
Data Mining: Concepts and Techniques: April 30, 2012
No ratings yet
Data Mining: Concepts and Techniques: April 30, 2012
64 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 5
73 pages
Association Rule Mining
No ratings yet
Association Rule Mining
61 pages
Data Mining Unit-III
No ratings yet
Data Mining Unit-III
5 pages
Data Mining: Concept Description
No ratings yet
Data Mining: Concept Description
64 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
58 pages
DM GTU Study Material Presentations Unit-3 21052021124240PM
No ratings yet
DM GTU Study Material Presentations Unit-3 21052021124240PM
54 pages
Data Mining: Concepts and Techniques: January 14, 2014
No ratings yet
Data Mining: Concepts and Techniques: January 14, 2014
64 pages
Data Mining: Concept Descriptions & Rules
No ratings yet
Data Mining: Concept Descriptions & Rules
9 pages
DM Concepts
No ratings yet
DM Concepts
64 pages
Data Mining Unit3
No ratings yet
Data Mining Unit3
19 pages
Unit 4 Data Warehousing and Data Mining
No ratings yet
Unit 4 Data Warehousing and Data Mining
15 pages
Unit III: Concept Description: Characterization and Comparison
No ratings yet
Unit III: Concept Description: Characterization and Comparison
53 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
15 pages
5 Desc
No ratings yet
5 Desc
60 pages
Data Mining Unit 2 1
No ratings yet
Data Mining Unit 2 1
15 pages
DM GTU Study Material Presentations Unit-3 21052021124240PM
No ratings yet
DM GTU Study Material Presentations Unit-3 21052021124240PM
54 pages
Data Mining Mid 2
No ratings yet
Data Mining Mid 2
20 pages
Dataminingassignmentjohnvictorgichonge
No ratings yet
Dataminingassignmentjohnvictorgichonge
2 pages
Association Rule
No ratings yet
Association Rule
106 pages
Ch5 DataMIning
No ratings yet
Ch5 DataMIning
99 pages
Data Generalization
No ratings yet
Data Generalization
3 pages
Lecture 2.1.3 2.1.4
No ratings yet
Lecture 2.1.3 2.1.4
34 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
94 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
99 pages
Data Mining: Concepts and Techniques: Mining Association Rules in Large Databases
No ratings yet
Data Mining: Concepts and Techniques: Mining Association Rules in Large Databases
81 pages
DMDW Chapter 4
No ratings yet
DMDW Chapter 4
29 pages
Data Mining: Concepts and Techniques: - Chapter 5
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 5
63 pages
ADB Slides 5
No ratings yet
ADB Slides 5
52 pages
Mining Frequent Patterns Ubnit 3
No ratings yet
Mining Frequent Patterns Ubnit 3
25 pages
An 15 DM Caracterizacion
No ratings yet
An 15 DM Caracterizacion
38 pages
FALLSEM2022-23 SWE2009 ETH VL2022230101117 Reference Material I 25-08-2022 Frequent Pattern Mining
No ratings yet
FALLSEM2022-23 SWE2009 ETH VL2022230101117 Reference Material I 25-08-2022 Frequent Pattern Mining
42 pages
Unit 5
No ratings yet
Unit 5
40 pages
Presentations PPT Unit-5 29042019034847AM
No ratings yet
Presentations PPT Unit-5 29042019034847AM
39 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
99 pages
Association Rules
No ratings yet
Association Rules
64 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
Unit-5 DMDW
No ratings yet
Unit-5 DMDW
21 pages
Data Mining & Association Rules
No ratings yet
Data Mining & Association Rules
39 pages
Mining Association Rules in Large Databases
No ratings yet
Mining Association Rules in Large Databases
77 pages
UNIT-5 DWDM (Data Warehousing and Data Mining) Association Analysis
No ratings yet
UNIT-5 DWDM (Data Warehousing and Data Mining) Association Analysis
7 pages
Module-4 DM - Introduction
No ratings yet
Module-4 DM - Introduction
5 pages
Association Rule Mining
No ratings yet
Association Rule Mining
72 pages
Data Mining: Techniques & Applications
No ratings yet
Data Mining: Techniques & Applications
21 pages
Association Rule Mining
No ratings yet
Association Rule Mining
11 pages
Data Mining - Module 6
No ratings yet
Data Mining - Module 6
7 pages
Project Report
No ratings yet
Project Report
57 pages
Fast - Algorithms - For - Mining Association Rules - R Agrawal - R Srikant-IBM
No ratings yet
Fast - Algorithms - For - Mining Association Rules - R Agrawal - R Srikant-IBM
32 pages
AnalyzeMarket Basket Data Using FP-growth and Apriori Algorithm
No ratings yet
AnalyzeMarket Basket Data Using FP-growth and Apriori Algorithm
4 pages
Importance of Clustering
No ratings yet
Importance of Clustering
5 pages
Module 4 BDA NOTES
No ratings yet
Module 4 BDA NOTES
75 pages
COS10022 DSP Week06 Association Rules
No ratings yet
COS10022 DSP Week06 Association Rules
52 pages
APRIORI Algorithm: Professor Anita Wasilewska Lecture Notes
No ratings yet
APRIORI Algorithm: Professor Anita Wasilewska Lecture Notes
23 pages
Predictive Data Analytics With Python
100% (2)
Predictive Data Analytics With Python
97 pages
Chap 6
No ratings yet
Chap 6
77 pages
Association Rule Mining
No ratings yet
Association Rule Mining
20 pages
Intro To Data Minning
No ratings yet
Intro To Data Minning
24 pages
4 Data Mining & Preprocessing L 11,12,13,14,15,16
No ratings yet
4 Data Mining & Preprocessing L 11,12,13,14,15,16
100 pages
DMT Unit-IV - UR20 - New
No ratings yet
DMT Unit-IV - UR20 - New
62 pages
Apriori Algorithm Overview: Prior Knowledge Apriori Property
No ratings yet
Apriori Algorithm Overview: Prior Knowledge Apriori Property
5 pages
Improve The Efficiency of Apriori-Unit3
No ratings yet
Improve The Efficiency of Apriori-Unit3
2 pages
Data Warehousing MCQ
No ratings yet
Data Warehousing MCQ
71 pages
Association Rules
No ratings yet
Association Rules
2 pages
Data Mining for IT Students
No ratings yet
Data Mining for IT Students
31 pages
(Ebook) Information and Knowledge Engineering by Hamid R. Arabnia Ray R. Hashemi Fernando G. Tinetti Cheng-Ying Yang ISBN 9781683925750, 1683925750 Updated 2025
No ratings yet
(Ebook) Information and Knowledge Engineering by Hamid R. Arabnia Ray R. Hashemi Fernando G. Tinetti Cheng-Ying Yang ISBN 9781683925750, 1683925750 Updated 2025
89 pages
Data Mining Notes Jntuh Compress
No ratings yet
Data Mining Notes Jntuh Compress
62 pages
"Fast Algorithms For Mining Association Rules" by Rakesh Agarwal Ramakrishnan Srikant
No ratings yet
"Fast Algorithms For Mining Association Rules" by Rakesh Agarwal Ramakrishnan Srikant
5 pages
Mining Frequent Patterns Without Candidate Generation
No ratings yet
Mining Frequent Patterns Without Candidate Generation
44 pages
Apriori Algorithm Numerical Example
No ratings yet
Apriori Algorithm Numerical Example
13 pages
Apriori Algorithm in Association Rule Learning
No ratings yet
Apriori Algorithm in Association Rule Learning
13 pages
Unit5-Dwdm
No ratings yet
Unit5-Dwdm
58 pages
DWM QB Cyse
No ratings yet
DWM QB Cyse
8 pages
RDataMining Slides Association Rules PDF
No ratings yet
RDataMining Slides Association Rules PDF
75 pages
Efficient Learning Machines Theories Concepts and Applications For Engineers and System Designers Rahul Khanna Instant Download
100% (1)
Efficient Learning Machines Theories Concepts and Applications For Engineers and System Designers Rahul Khanna Instant Download
91 pages

Concept Description

Uploaded by

Concept Description

Uploaded by

Concept Description: Characterization and comparison,What is Concept

Description,Data Generalization by Attribute-Oriented Induction(AOI),

Mining Frequent Patterns, Associations and Correlations: Basic

• Characterization: This involves summarizing a collection of data

This approach is particularly useful when working with large datasets

Example: If the Age attribute has many distinct values, it can be

• This is easy to understand algorithm

• The apriori algorithm works slow compared to other algorithms.

You might also like