Comparison of Two Association Rule Mining Algorith PDF
Comparison of Two Association Rule Mining Algorith PDF
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/267959080
CITATIONS READS
10 459
2 authors, including:
Belgin Ergen
Izmir Institute of Technology
22 PUBLICATIONS 68 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
DFIS: Dynamic Itemset Mining and Hiding under Multiple Support Thresholds View project
All content following this page was uploaded by Belgin Ergen on 26 June 2015.
After constructing the tree the mining proceeds as In Figure 2, Matrix Apriori algorithm is demonstrated.
follows. Start from each frequent length-1 pattern, The example database is the same database used in
construct its conditional pattern base, then construct its previous section and minimum support value is again 2
conditional FP-tree and perform mining recursively on (%50). Firstly, a database scan to determine frequent
such a tree. The support of a candidate (conditional) items is executed and a frequent items list is obtained. The
itemset is counted traversing the tree. The sum of count list is in descending order (see Figure 2a). Following this,
values at least frequent items nodes gives the support a second scan on database is executed. During the scan
value. The frequent pattern generation process is the MFI and STE is built as follows. Each transaction is
demonstrated in Figure 1c. read. If the transaction has any item that is in the frequent
item list then it is represented as 1 and otherwise 0.
3.1.2 Matrix Apriori This pattern is added as a row to MFI matrix and its
occurrence is set to 1 in STE vector. While reading
Matrix Apriori [9] is similar to FP-Growth in the database remaining transactions if the transaction is already
scan step. However, the data structure build for Matrix included in MFI then in STE its occurrence is
Apriori is a matrix representing frequent items (MFI) and incremented. Otherwise it is added to MFI and its
a vector holding support of candidates (STE). The search occurrence in STE is set to 1. After reading transactions
for frequent patterns is executed on this two structures, the MFI matrix is modified to speed up frequent pattern
which are easier to build and use compared to FP-tree. search. For each column of MFI, beginning from the first
row, the value of a cell is set to the row number in which candidate itemsets and count its support value. The
the item is 1. If there is not any 1 in remaining rows support value of an itemset is the sum of the items at STE
then the value of the cell is set to 1 which means down of which index are rows where all the items of the
to the bottom of the matrix, no row contains this item (see candidate itemset are included in MFIs related row.
Figure 2b). Frequent itemsets found can be seen in Figure 2c.
After constructing the MFI matrix finding patterns is
simple. Beginning from the least frequent item, create
It will be beneficial to give a short comparison of given Finding patterns for both algorithms need producing
algorithms with an example to show the execution of the candidate itemsets and control. This is called conditional
algorithms. First scans of both algorithms are carried out pattern base in FP-Growth and there is no specific name
in the same way. Frequent items are found and listed in for Matrix Apriori. Counting support value is easy to
order. During second scan, FP-Growth adds transactions handle in Matrix Apriori. However, in FP-Growth
to tree structure and Matrix Apriori to matrix structure. traversing the tree is complex.
Addition of a transaction to the tree structure needs less
control compared to matrix structure. For example, 3.2 Implementation
consider 2nd and 3rd transactions. Second transaction is
added as a branch to the tree and as a row to the matrix. In this section, we give brief information about the
But addition of third transaction shows the difference. For implementation of algorithms. Algorithms, which are
tree structure we need to control only the branch that has explained in previous section, are coded as they are
the same prefix with our transaction. So addition of a new understood from related papers [8 and 9]. For both
branch to node E is enough. On the other hand, for the algorithms the dataset file is read in order to take the
matrix structure we need to control all the items of rows. information about number of transactions, number of
If we find the same pattern then we increase the related items and name of items to create a temporary file and
item of STE. Otherwise we need to scan matrix till we data mining process is carried out on this file. In this
find the same pattern. If we cannot find then a new row is paper, database term is used for the temporary file.
added to matrix. It seems that building matrix needs more
control and time, however, management of matrix
structure is easier compared to tree structure.
3.2.1 FP-Growth the causes of performance differences in different phases.
In order to keep the system state similar for all test runs,
The implementation of FP-Growth is divided into three we assured all back-ground jobs which consume system
steps. resources were inactive. It is also ensured that test runs
Step 1: Database is read and the count of items is found. give close results when repeated.
According to the minimum support threshold,
frequent items are selected and sorted. 4.1 Simulation Environment
Step 2: Initialization of the FP-tree is done. From the
frequent items a node list is created which will The test runs are performed on a computer with 2.4 GHz
be connected to nodes of the tree. After dual core processor and 3 GB memory. At each run, both
initialization the database is read again. This programs give results about data mining process. These
time, if an item in a transaction is selected as are
frequent then it is added to the tree structure. time cost for first scan of database,
Step 3: Beginning from the least frequent item, a number of frequent items found at first scan of
frequent pattern finder procedure is called database,
recursively. The support count of the patterns are time cost for second scan of database and
found and displayed if they are frequent. building the data structure,
time cost for finding frequent itemsets,
3.2.2 Matrix Apriori number of frequent itemsets found after mining
process,
The implementation of Matrix Apriori is divided into four total time cost for whole data mining process.
steps. Procedures for first database scan are same for both
Step 1: This is carried out in the same way as step 1 of algorithms so time cost is identical. During our case
FP-growth. studies we will call first phase as first scan of database
Step 2: Initialization of MFI is done. According to and second scan performed for building the specific data
frequent items list first row of MFI is created. structure. Second phase will be traversing the data
After initialization, database is read again. Each structures created by the first phase in order to find
transaction is converted to an array. This array is frequent itemsets.
at length of MFIs one row. If in MFI there is a Although real life data has different characteristics from
pattern same as the array of transaction then its synthetically generated data as mentioned in [15], we used
occurrence is increased. Otherwise a new row is synthetic data since the control of parameters were easily
added to MFI. manageable. In [16], the drawbacks of using real world
Step 3: MFI is modified. This modification will speed data and synthetic data and comparison of some dataset
up the pattern search and support counting generators are given. Our aim was to have datasets with
process. different characteristics as representing different domain
Step 4: Similar to FP-Growth beginning from the least needs.
frequent item a procedure is called recursively Synthetic databases are generated using ARtool software
and support values of patterns are counted. [17]. ARtool generates a database according to defined
The implementation steps of the algorithms are explained parameters as number of items, number of transactions,
above. Step 1 of both implementations is identical. In step average size of transactions, number of patterns, average
2, the procedures used for database reading are the same size of patterns. Two data sets are generated varying the
in both algorithm codes. Building the data structure for parameters number of items and average size of
the algorithms is different from each other and additional patterns in order to have different dataset characteristics
step for modifying MFI is needed for Matrix Apriori. of different domains. One data set is characterized with
Candidate generation procedures for both algorithms are long patterns and low diversity of items and other with
equal but counting support is clearly different. short patterns and high diversity of items. These
differences affect the size of the specific data structures of
the algorithms and so the run times.
4. Performance Evaluation In the following subsections, performance analysis on the
algorithms for two case studies is given. For the generated
In this section, we compare Matrix Apriori and FP- data sets, we aimed to observe how change of minimum
Growth algorithms based on the publications discussed in support affects the performance of algorithms. The
previous chapter. Both algorithms are coded using algorithms are compared for six minimum support values
Lazarus IDE (02.96.2) which uses Pascal programming in the range of 15% and 2,5%.
language. ARTool (1.1.2) dataset generator is used for our
synthetic datasets. Two case studies analyzing the
algorithms are carried out step by step using two synthetic
datasets generated in order i) to see their performance on
datasets having different characteristics, ii) to understand
4.2 Case1: Database of Long Patterns with Low 16000 1600000
Matrix Apriori 1400000 Matrix Apriori
14000
Diversity of Items FP-Growth 1200000 FP-Growth
12000
10000 1000000
A database is generated for having long patterns and low 8000 800000
diversity of items where number of items=10000, number 6000 600000
time(ms)
4000 400000
time(ms)
of transactions=30000, average size of transactions=20, 2000 200000
average size of patterns=10. Number of frequent items is 0 0
given in Figure 3a and number of frequent itemsets is
given in Figure 3b while minimum support value is
varied. It is clear that decrease in minimum support minimum support (%) minimum support (%)
increases the number of frequent items from 16 to 240 and (a) (b)
the number of frequent itemsets from 1014 to 198048. Figure 5. (a)First phase performance for Case1 (b)
Second phase performance for Case1
300 250000
250 200000 The second phase of evaluation is finding frequent
item count itemset count
200 itemsets. As displayed in Figure 5b Matrix Apriori is
150000
150 faster at minimum support values below 10%. Although at
100000
100 10% threshold, FP-Growth is 20% faster, Matrix Apriori
count
50000
count
The total performance of Matrix Apriori and FP-Growth A database is generated for short patterns and high
is demonstrated in Figure 4. It is seen that their diversity of items using the parameters where number of
performance is identical for minimum support values items=30000, number of transaction=30000, average size
above 7,5%. On the other hand below 7,5% minimum of transactions=20, average size of patterns=5. The
support value Matrix Apriori performs clearly better such change of frequent items and itemsets count is given in
that at 2,5% threshold it is 230% faster. Figure 6a and Figure 6b consecutively. Frequent items
found changes from 58 to 127 and frequent itemsets found
1600000
1400000 Matrix Apriori changes from 254 to 71553 with decreasing minimum
1200000 FP-Growth support values.
1000000
800000 140 80000
600000 120 70000
400000 itemset count
100 60000
time(ms)
200000 50000
80
0 40000
60
item count 30000
40 20000
20 10000
minimum support (%)
count
count
0 0
Figure 4. Total performance for Case1
The reason of FP-Growths falling behind at total minimum support (%) minimum support (%)
performance can be understood by looking at the (a) (b)
performance of phases of evaluation. First phase Figure 6. a) Number of frequent items, (b) Number of
performances of algorithms demonstrated in Figure 5a frequent itemsets for Case2
showed that building matrix data structure of Matrix
Apriori needs 20% to 177% more time compared to The total performance of both algorithms is given in
building tree data structure of FP-Growth. First phase of Figure 7. Increase in minimum support decreases runtime
Matrix Apriori shows similar pattern with the number of for both algorithms. For minimum support values 12,5%
frequent items demonstrated in Figure 3a. and 15% FP-Growth performed faster by up to 56%.
However, for lower minimum support values Matrix
Apriori performed better up to 150%.
400000 faster leading to better total performance of Matrix
350000 Matrix Apriori
FP-Growth Apriori.
300000
250000 Our second case study is performed on a database of short
200000 patterns with high diversity of items. It is seen that at
150000 12,5%-15% minimum support values, performances of
100000
time(ms)
50000
both algorithms are close. However, below 12,5% value,
0 the performance gap between the algorithms becomes
larger in favor of Matrix Apriori. It is seen that the
impacts of having more items and less average pattern
minimum support (%) length caused both algorithms to have more runtime
Figure 7. Total performance for Case2 values compared to first case study. At 15% at first case
study 1014 itemsets are found in 1031-1078 ms however
First phase performance of algorithms is demonstrated in at second case study 254 itemsets are found in 12172-
Figure 7a. FP-Growth is observed to have better first 19030 ms. In addition, for all threshold values first phase
phase performance. runtime values are higher in second case study.
Common points in both case studies are i) Matrix Apriori
30000 350000 is faster at finding itemset phase compared to FP-Growth
Matrix Apriori
25000 300000
FP-Growth
and slower at building data structure phase, ii) for
20000 250000 threshold values below 10% Matrix Apriori is more
200000
15000 efficient by up to 230%, iii) first phase performance of
150000
10000 100000
Matrix Apriori is correlated with number of frequent
Matrix Apriori items, iv) second phase performance of FP-Growth is
time(ms)
5000
time(ms)
50000
FP-Growth
0 0 correlated with number of frequent itemsets.