Data Mining
Data Mining
References: S. Brin, R. Motwani, J.D. Ullman, S. Tsur, "Dynamic Itemset Counting and Implication Rules for Market Basket Data", SIGMOD Record, Volume 6, Number 2: New York, June 1997, pp. 255 - 264. Su, Yibin, Dynamic Itemset Counting and Implication Rules for Market Basket Data: Project Final Report, CS831, April 2000.
Introduction
Alternative to Apriori Itemset Generation Itemsets are dynamically added and deleted as transactions are read Relies on the fact that for an itemset to be frequent, all of its subsets must also be frequent, so we only examine those itemsets whose subsets are all frequent
Train analogy: There are stations every M transactions. The passengers are itemsets. Itemsets can get on at any stop as long as they get off at the same stop in the next pass around the database. Only itemsets on the train are counted when they occur in transactions. At the very beginning we can start counting 1-itemsets, at the first station we can start counting some of the 2-itemsets. At the second station we can start counting 3-itemsets as well as any more 2-itemsets that can be counted and so on.
Solid box: confirmed frequent itemset - an itemset we have finished counting and exceeds the support threshold minsupp Solid circle: below minsupp confirmed infrequent itemset - we have finished counting and it is
Dashed box: suspected frequent itemset - an itemset we are still counting that exceeds minsupp Dashed circle: is below minsupp suspected infrequent itemset - an itemset we are still counting that
DIC Algorithm
Algorithm: 1. Mark the empty itemset with a solid square. Mark all the 1-itemsets with dashed circles. Leave all other itemsets unmarked. 2. While any dashed itemsets remain: 1. Read M transactions (if we reach the end of the transaction file, continue from the beginning). For each transaction, increment the respective counters for the itemsets that appear in the transaction and are marked with dashes. 2. If a dashed circle's count exceeds minsupp, turn it into a dashed square. If any immediate superset of it has all of its subsets as solid or dashed squares, add a new counter for it and make it a dashed circle. 3. Once a dashed itemset has been counted through all the transactions, make it solid and stop counting it.
Itemset lattices: An itemset lattice contains all of the possible itemsets for a transaction database. Each itemset in the lattice points to all of its supersets. When represented graphically, a itemset lattice can help us to understand the concepts behind the DIC algorithm.
0 0 Transaction Database
Counters: A = 0, B = 0, C = 0 Empty itemset is marked with a solid box. All 1-itemsets are marked with dashed circles.
Counters: A = 2, B = 1, C = 0, AB = 0 We change A and B to dashed boxes because their counters are greater than minsup (1) and add a counter for AB because both of its subsets are boxes.
Counters: A = 2, B = 2, C = 1, AB = 0, AC = 0, BC = 0 C changes to a square because its counter is greater than minsup.A, B and C have been counted all the way through so we stop counting them and make their boxes solid. Add counters for AC and BC because their subsets are all boxes.
Counters: A = 2, B = 2, C = 1, AB = 1, AC = 0, BC = 0 AB has been counted all the way through and its counter satisfies minsup so we change it to a solid box. BC changes to a dashed box.
Counters: A = 2, B = 2, C = 1, AB = 1, AC = 0, BC = 1 AC and BC are counted all the way through. We do not count ABC because one of its subsets is a circle. There are no dashed itemsets left so the algorithm is done.
Implementation
Go to the DIC Implementation page to see a working implementation in Java. Operations: 1. 2. 3. 4. add new itemsets maintain a counter for every itemset manage itemset states from dashed to solid and from circle to square when itemsets become large determine which new itemsets should be added because they could potentially be large
Pseudocode Algorithm: SS = ; // solid square (frequent) SC = ; // solid circle (infrequent) DS = ; // dashed square (suspected frequent) DC = { all 1-itemsets } ; // dashed circle (suspected infrequent) while (DS != 0) or (DC != 0) do begin read M transactions from database into T forall transactions t T do begin //increment the respective counters of the itemsets marked with dash for each itemset c in DS or DC do begin if ( c t ) then c.counter++ ; for each itemset c in DC if ( c.counter threshold ) then
move c from DC to DS ; if ( any immediate superset sc of c has all of its subsets in SS or DS ) then add a new itemset sc in DC ; end for each itemset c in DS if ( c has been counted through all transactions ) then move it into SS ; for each itemset c in DC if ( c has been counted through all transactions ) then move it into SC ; end end Answer = { c SS } ;
DIC Implementation
Note: The DIC implementation given here may not produce accurate output for small
databases (fewer than 100 transactions). To get accurate output for these databases we need to choose step M > 4. Download the following files:
1. dic.java: The DIC algorithm. 2. config.txt: Consists of four lines.
1. 2. 3. 4.
Number of items Number of transactions Minimum support, i.e. 20 represents 20% minsupp Size of step M for the DIC algorithm. This line is ignored by the Apriori algorithm
3. transa.txt: Contains the transaction database as a n x m table, with n rows and m columns. Each row represents a transaction. Columns are separated by a space and represent items. A 1 indicates that an item is present in the transaction and a 0 indicates that it is not. The sample file has 10000 lines (transactions) with values for 8 items on each line. Compile the .java file:
Any warning messages about deprecated files can be ignored: If you get the following message, you forgot the -deprecation flag: Note: dic.java uses a deprecated API. Recompile with "-deprecation" for details.
Change config.txt and transa.txt to represent the database and criteria to be tested. Run the programs: hercules[2]% java dic
Example
We use the database example from Apriori Itemset Generation. The minsupp is 40%.
TID T1 T2 T3 T4 T5
A 1 1 1 1 1
B 1 1 0 0 1
C 1 1 1 1 1
D 0 1 1 1 1
E 0 1 0 1 0
Transa.txt contains a row for each of the five transactions and a column for each of the five items.
Config.txt: Here we use 5 as the size of step M for the DIC algorithm
5 40 5
Output:
Algorithm apriori starting now..... Press 'C' to change the default configuration and transaction files or any other key to continue. Input configuration: 5 items, 5 transactions, minsup = 40% Frequent 1-itemsets: [1, 2, 3, 4, 5] Frequent 2-itemsets: [1 2, 1 3, 1 4, 1 5, 2 3, 2 4, 3 4, 3 5, 4 5] Frequent 3-itemsets: [1 2 3, 1 2 4, 1 3 4, 1 3 5, 1 4 5, 2 3 4, 3 4 5] Frequent 4-itemsets: [1 2 3 4, 1 3 4 5] Execution time is: 0 seconds. hercules[68]%
Execution of dic.java
We get the same results as we did earlier when we did the Apriori algorithm by hand.