0% found this document useful (0 votes)
18 views48 pages

Concept Description

Conceptual notes..

Uploaded by

harshithamoturu6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views48 pages

Concept Description

Conceptual notes..

Uploaded by

harshithamoturu6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Concept Description: Characterization and comparison,What is Concept

Description,Data Generalization by Attribute-Oriented Induction(AOI),


AOI for Data Characterization, Efficient Implementation of AOI.

Mining Frequent Patterns, Associations and Correlations: Basic


Concepts, Frequent Itemset Mining Methods: Apriori method,
generating Association Rules, Improving the Efficiency of Apriori,
Pattern-Growth Approach for mining Frequent Item sets.
What is Concept Description
• Concept description is a fundamental descriptive data mining task
that involves summarizing and comparing data.
• This process can involve two main
tasks: characterization and comparison (or discrimination).

• Characterization: This involves summarizing a collection of data


points into a concise and easily understandable form.
Example: If you have data on all students in a school, characterization
might involve summarizing information like average grades or most
common subjects.
• Example:
• Consider a retail company that wants to understand the purchasing
behavior of its customers.
• Data: Customer transactions for the last year.
• Characterization:
• Total number of customers: 10,000
• Average purchase value: $45
• Most frequently purchased items: Coffee, milk, bread
• Customer demographics: 60% female, 40% male, majority in the age group of
25-34
• This characterization helps the company understand its customer base
and their purchasing habits, allowing it to tailor marketing strategies or
product offerings.
Comparison (Discrimination): This involves comparing two or more
collections of data to identify differences between them.
Example:
• A university wants to compare the performance of students in two
different programs: Engineering and Arts.
• Data: Academic records for all students in both programs.
Example:
Comparison:
• Engineering Students:
• Average GPA: 3.2
• Common Courses: Mathematics, Physics, Computer Science
• Graduation Rate: 85%
• Arts Students:
• Average GPA: 3.5
• Common Courses: History, Literature, Sociology
• Graduation Rate: 90%
• This comparison reveals that Arts students tend to have slightly higher
GPAs and graduation rates compared to Engineering students.
Techniques and Tools for Concept Description
• Concept description utilizes various data mining techniques and tools to
perform characterization and comparison effectively:
• Statistical Methods: Mean, median, mode, standard deviation, and other
statistical measures provide quick insights into data distributions and
averages.
• Data Visualization: Charts, graphs, and plots (such as histograms, box plots,
and scatter plots) can visually summarize data and highlight key
characteristics or differences.
• Cluster Analysis: Groups data into clusters of similar items, which can then
be summarized to characterize different segments of data.
• Decision Trees: Used for discrimination tasks to understand the rules or
features that best separate different classes or categories.
• Practical Applications of Concept Description
• Market Basket Analysis: Characterize the purchasing behavior of
customers and compare different customer segments to improve
marketing strategies and promotions.
• Healthcare: Characterize patient profiles based on health records and
compare different patient groups to identify risk factors or treatment
outcomes.
• Education: Summarize student performance and compare different
cohorts to identify trends in educational outcomes and areas for
improvement.
Two Main Approaches:
• Attribute-Oriented Induction (AOI)
• Purpose: Generalizes data by abstracting from detailed attributes to higher-
level concepts.
• Example: You have detailed data on item sales (item_ID, name, price). Using
AOI, you might summarize sales data by item category (e.g., electronics,
clothing) rather than by individual items.
• Data Cube (OLAP)
• Purpose: Uses a multidimensional database to perform efficient data
summarization and aggregation.
• Example: A sales database might store data in a cube where you can drill
down from yearly totals to monthly or daily sales
Techniques for Concept Description:
Data Generalization:
• Purpose: Simplifies data by summarizing it at higher levels of
abstraction.
• Example: Instead of looking at individual sales transactions, you might
look at total sales per month or by product category
• Attribute-Oriented Induction
• Purpose: Generalizes data by either removing or generalizing attributes.
• Two Techniques:
• Attribute Removal
• Purpose: Remove attributes that have too many distinct values or that
can't be generalized.
• Example: If you have detailed student records including every class they
attended, you might remove the specific class information to summarize
performance by course type.
• Attribute Generalization
• Purpose: Use generalization operators to simplify detailed attributes into
broader categories.
• Example: Generalizing student ages from exact ages to age ranges (e.g.,
10-12 years old).
AOI for Data Characterization:
Attribute-Oriented Induction (AOI) is a method used in data mining to
generalize and summarize data, providing a high-level view of the data
by reducing the number of attributes and their distinct values.

This approach is particularly useful when working with large datasets


where the goal is to derive meaningful patterns or knowledge from the
data.
Basic principles of AOI:
A set of basic principles for the attribute-oriented induction in
relational databases is summarized as follows:-
1. Data Focusing
• Principle: AOI begins by focusing on task-relevant data. This means
that only the data relevant to the specific task or analysis is
considered.
• 2. Attribute Removal
• Principle: Remove an attribute if it has a large set of distinct values
and either:
• There is no generalization operator available for that attribute.
• The higher-level concept of the attribute can be represented by other
attributes.
3. Attribute Generalization
Principle: If there is a large set of distinct values for an attribute and
there exists a generalization operator, use the operator to generalize
the attribute.

Example: If the Age attribute has many distinct values, it can be


generalized into age groups. For example:
Ages 18-25: Young Adults
Ages 26-35: Adults
Ages 36-50: Middle Aged
Ages 51+: Seniors
4. Attribute Threshold Control
• Principle: To avoid over-generalization, thresholds are set for the
maximum number of distinct values an attribute can have before it is
considered for generalization. Typically, this threshold is set between
2 and 8.
Example:
Example: If we set a threshold of 4 for the number of distinct Age
Group categories, the attribute generalization would be limited to those
categories.
Apriori Algorithm:
• The Apriori algorithm, introduced by R. Agrawal and R. Srikant in
1994, is a popular method for mining frequent itemsets and
discovering association rules.
• It uses an iterative, level-wise approach, where frequent 1-itemsets
are used to find frequent 2-itemsets, which in turn are used to find
frequent 3-itemsets, and so on.
Key Terms in Apriori :
Frequent Itemsets: A collection of items that appear together in a
transaction database with a frequency above a user-defined threshold
(minimum support).
Example:
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7},
in these two transactions, 2 and 3 are the frequent itemsets.
Candidate set:
A collection of item combinations that are being evaluated to
determine if they frequently occur together in a dataset.
These combinations are generated based on existing frequent itemsets
and are used to identify potential patterns or associations.
• Components of Apriori Algorithm:
Support indicates the frequency of the rule occurring in the dataset.
1.support freq(A,B)/N
• Confidence measures how often the rule has been found to be true.
2.Confidencesfreq(A,B)/freq(A)
• Lift it tell about how strong the assoication is
3.Lift Lift(A→B) = Confidence(A→B) / Support(B)
Steps of the Apriori Algorithm
1. Find Frequent 1-Itemsets:
Scan the transaction data to count the occurrence of each item.
Keep only those items that meet or exceed the minimum support threshold.
2. Generate Candidate 2-Itemsets:
Combine the frequent 1-itemsets to form all possible 2-itemsets.
3. Prune Candidates:
Eliminate candidate 2-itemsets that have any subset which is not frequent.
4. Count Support and Determine Frequent 2-Itemsets:
Scan the transaction data again to count the occurrence of each candidate 2-itemset.
Retain only those 2-itemsets that meet the minimum support threshold.
5. Repeat for Larger Itemsets:
Repeat steps 2 to 4 for 3-itemsets, 4-itemsets, and so on, until no more frequent itemsets
can be generated.
Flow chart for apriori algorithm:
Apriori Algorithm Working
Example: Suppose we have the following dataset that has various
transactions, and from this dataset, we need to find the frequent itemsets and
generate the association rules :
Step-1: Calculating C1 and L1:

Candidate set or
C1. frequent itemset
L1.
Step-2: Candidate Generation C2, and L2:
Step-3: Candidate generation C3,
and L3:
Advantages of Apriori Algorithm

• This is easy to understand algorithm


• The join and prune steps of the algorithm can be easily implemented
on large datas
Disadvantages of Apriori Algorithm

• The apriori algorithm works slow compared to other algorithms.


• The overall performance can be reduced as it scans the database for
multiple times.
• The time complexity and space complexity of the apriori algorithm is
O(2D), which is very high. Here D represents the horizontal width
present in the database.
Improving the Efficiency of Apriori:
Apriori, while a classic algorithm for frequent itemset mining, may
struggle to handle large datasets due to its efficiency limitations.
Several techniques have been proposed to address this issues:
Hash-Based Technique:
Hash functions can be used to efficiently generate only those candidate
itemsets that are likely to be frequent, reducing the computational
overhead.
By hashing itemsets into buckets, we can quickly identify those that are
unlikely to be frequent and eliminate them from further consideration.
Transaction reduction:
• Frequent Itemsets: If a transaction doesn't have any frequent k-
itemsets, it also won’t have frequent (k + 1)-itemsets.
• Transaction Reduction: You can ignore or remove such transactions
from future checks, as they don’t contribute to finding new frequent
itemsets.
• Efficiency: This reduces the number of transactions you need to scan
in later iterations, making the process faster.
Example:
Partitioning:
Partitioning is a technique that divides a large dataset into smaller,
manageable parts to improve the efficiency of frequent itemset mining.
Here's how it works:
Divide the dataset: The dataset is split into multiple partitions.
Find local frequent itemsets: Each partition is processed separately to
identify frequent itemsets within that partition.
Combine candidates: The frequent itemsets found in each partition are
combined to form a set of potential global frequent itemsets.
Find global frequent itemsets: A final scan of the entire dataset is done
to determine which of the potential candidates are actually frequent in
the whole dataset.
Sampling:
Sampling is a technique used to improve the efficiency of Apriori by
reducing the amount of data that needs to be processed. Instead of
mining frequent itemsets on the entire dataset, a random sample is taken
and analyzed.
Key Steps:
Random Sampling: A subset of transactions is selected randomly from
the original dataset.
Frequent Itemset Mining: Apriori is applied to the sample to find
frequent itemsets.
Verification: The frequent itemsets found in the sample are verified
against the entire dataset to ensure they are truly frequent.
Dynamic itemset counting:
• Database in Blocks: Split the database into blocks with start points.
• Add Candidates Anytime: New itemsets can be added at any start
point, not just before a full scan like in Apriori.
• Count-so-Far: Use the count so far as a minimum estimate. If it meets
the support threshold, add it to the list of frequent itemsets.
• Fewer Scans: This method often needs fewer database scans than
Apriori to find all frequent itemsets.
Frequent Pattern Growth Algorithm:
The two primary drawbacks of the Apriori Algorithm are:
• At each step, candidate sets have to be built.
• To build the candidate sets, the algorithm has to repeatedly scan the
database.
What is FP Growth Algorithm?
• The FP-Growth Algorithm is an alternative way to find frequent item
sets without using candidate generations, thus improving
performance. For so much, it uses a divide-and-conquer strategy.
• The core of this method is the usage of a special data structure
named frequent-pattern tree (FP-tree), which retains the item set
association informatio
FP-Tree
• The frequent-pattern tree (FP-tree) is a compact data structure that
stores quantitative information about frequent patterns in a database.
• Each transaction is read and then mapped onto a path in the FP-tree.
This is done until all transactions have been read.
FP-Growth Algorithm Steps
1. Compute Item Frequency: Count occurrences of each item.
2. Filter by Minimum Support: Remove infrequent items.
3. Sort by Frequency: Order items by descending frequency.
4. Create Ordered-Item Sets: Reorder transactions based on sorted frequent
items.
5. Build FP-Tree: Insert ordered-item sets into the FP-tree.
6. Generate Conditional Pattern Bases: Extract prefix paths for each item.
7. Construct Conditional FP-Trees: Build FP-trees from conditional pattern bases.
8. Mine Frequent Patterns: Recursively find frequent patterns from conditional
FP-trees.
9. Generate Association Rules: Create and validate association rules from
frequent patterns.
Example:
STEP1:Compute Item Frequency: Count occurrences of each item.
2.Filter by Minimum Support: Remove infrequent items.
3.Sort by Frequency: Order items by descending frequency.
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
4.Create Ordered-Item Sets: Reorder transactions based on sorted
frequent items.
5.Build FP-Tree: Insert ordered-item sets into the FP-tree.
6.Construct Conditional FP-Trees: Build FP-trees from conditional
pattern bases
7.Construct Conditional FP-Trees: Build FP-trees from conditional
pattern bases.
8.Mine Frequent Patterns: Recursively find frequent patterns from
conditional FP-trees.
Advantages of FP Growth Algorithm
• This algorithm needs to scan the database twice when compared to
Apriori, which scans the transactions for each iteration.
• The pairing of items is not done in this algorithm, making it faster.
• The database is stored in a compact version in memory.
• It is efficient and scalable for mining both long and short frequent
patterns.
Disadvantages of FP-Growth Algorithm
• It may be expensive.
• The algorithm may not fit in the shared memory when the database is
large.

You might also like