Associative Classi Cation Approaches: Review and Comparison: Neda Abdelhamid
Associative Classi Cation Approaches: Review and Comparison: Neda Abdelhamid
FA1
Journal of Information & Knowledge Management, Vol. 13, No. 3 (2014) 1450027 (30 pages)
.c World Scienti¯c Publishing Co.
#
DOI: 10.1142/S0219649214500270
Neda Abdelhamid
Computing and Informatics Department
De Montfort University, Leicester, UK
[email protected]
Fadi Thabtah
Ebusiness Department
Canadian University of Dubai, Dubai, UAE
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
[email protected]
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
Abstract. Associative classi¯cation (AC) is a promising data primarily using association rule algorithms and then a
mining approach that integrates classi¯cation and association classi¯er is constructed after sorting the knowledge and
rule discovery to build classi¯cation models (classi¯ers). In the
pruning useless and redundant ones. Many research
last decade, several AC algorithms have been proposed such as
Classi¯cation based Association (CBA), Classi¯cation based on studies including (Yin and Han, 2003; Thabtah et al.,
Predicted Association Rule (CPAR), Multi-class Classi¯cation 2005; Li et al., 2008; Ye et al., 2008; Niu et al., 2009;
using Association Rule (MCAR), Live and Let Live (L3) and Thabtah et al., 2010; Baralis and Garza, 2012; Abdelha-
others. These algorithms use di®erent procedures for rule learn- mid et al., 2012a; Zhu et al., 2012; Jabbar et al., 2013;
ing, rule sorting, rule pruning, classi¯er building and class allo- Taiwiah and Sheng, 2013) revealed that AC methods
cation for test cases. This paper sheds the light and critically
compares common AC algorithms with reference to the above-
usually extract better classi¯ers with reference to error
mentioned procedures. Moreover, data representation formats in rate than other classi¯cation data mining approaches like
AC mining are discussed along with potential new research decision tree (Quinlan, 1993) and rule induction (Jensen
directions. and Cohen, 2000).
Normally, an AC algorithm operates in three main
Keywords: Associative classi¯cation; classi¯cation; data mining;
rule learning; rule sorting; pruning; prediction. phases. During the ¯rst phase, it looks for hidden correla-
tions among the attribute values and the class attribute
values in the training data set and generates them as \Class
1. Introduction
Association Rule" (CARs) in \IF-THEN" format (Thab-
Association rule discovery and classi¯cation are closely tah et al., 2010). After the complete set of CARs are found,
related data mining tasks with the exception that Asso- ranking and pruning procedures (phase 2) start operating
ciation Rule ¯nds relationships among attribute values in where the ranking procedure sorts rules according to cer-
a database whereas the classi¯cation's goal is allocating tain thresholds such as con¯dence and support (Li et al.,
class labels to unseen data known as test data set as cor- 2008). Further, during pruning, contradicting and dupli-
rectly as possible. The joining of association rule and cating rules are discarded from the complete set of CARs.
classi¯cation had come to surface as a promising research The output of phase 2 is the set of CARs which represents
discipline named associative classi¯cation (AC) during the the classi¯er. Lastly, the classi¯er derived gets tested on
year 1998 in a paper titled \Integrating classi¯cation and new independent data set to measure its e®ectiveness in
association rule" (Liu et al., 1998). In AC mining, the forecasting the class of unseen test cases. The output of the
training phase is about searching for hidden knowledge last phase is the accuracy or error-rate of the classi¯er.
1450027-1
September 30, 2014 2:00:17pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
Research studies for instance (Veloso et al., 2007; goal of this paper is to survey and compare the state-of-
Wang et al., 2011) have shown that AC has two distin- the-art AC techniques with reference to di®erent proce-
guishing features over other traditional classi¯cation dures employed during the algorithm's lifecycle, i.e. (Data
approaches. The ¯rst one is that it produces very simple formats, training phase, building the classi¯er, rule ran-
knowledge (rules) that can be easily interpreted and kling, prediction, etc). This may enable other researchers
manually updated by the end-user. Secondly, this ap- to spot possible issues and research directions in this ¯eld
proach often ¯nds additional useful hidden knowledge for further improvement.
missed by other classi¯cation algorithms and therefore the The rest of the paper is structured as follows: the AC
error rate of the resulting classi¯er is minimised. The main problem, its solution scheme, the di®erent data represen-
reason behind producing the additional knowledge is that tation models and its main advantages and disadvantages
AC utilises association rule methods in the training phase are discussed in Sec. 2. Section 3 is devoted to the di®erent
(Liu et al., 1998; Thabtah et al., 2004) where all possible learning strategies employed in AC. Rule sorting and its
relationships among the attribute values in the training associated procedures are surveyed in Sec. 4, and Sec. 5
data set and the class attribute are found and extracted. highlights the di®erent methods employed to build the
Though, in some cases the possible numbers of the derived classi¯er and to prune unnecessary rules. Section 6 reviews
rules may become excessive (Li et al., 2001; Al-Maqaleh, the di®erent prediction methods in AC and possible new
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
1450027-2
September 30, 2014 2:00:20pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
De¯nition 4. A ruleitem r is of the form hantecedent, ci, Frequent attribute Support (%) Con¯dence (%)
where antecedent is an AttributeValueSet and c"C is a value
class. ha1 i; c2 40 57.10
ha1 i; c1 30 42.85
De¯nition 5. The actual occurrence (actoccr) of a
hb1 i; c2 30 60
ruleitem r in D is the number of cases in D that match hb2 i; c1 40 80
r 0 s antecedent. ha1 ; b1 i; c2 30 100
ha1 ; b2 i; c1 30 75
De¯nition 6. The support (supp) of a ruleitem r is the
number of cases in D that matches r 0 s antecedent, and
belongs to a class c. have been set to 30% and 50%, respectively. A typical AC
De¯nition 7. A ruleitem r passes the minsupp if, algorithm such as MCAR (Thabtah et al., 2005) ¯rstly
suppðrÞ=jDj minsupp. Such a ruleitem is said to be a discovers all frequent ruleitems which hold enough sup-
frequent ruleitem. ports (Table 2). Once all frequent ruleitems are found,
then MCAR transforms the subset of which hold enough
De¯nition 8. The ruleitem's con¯dence is represented as con¯dence values into candidate rules. The bold rows
the frequency of the attribute value and its related class in within Table 2 are the candidate rules, and from those the
the training data set from the frequency of that attributes classi¯er is derived. A rule is considered part of the clas-
value in the training data. So for a ruleitem r, it passes the si¯er if it covers certain number of cases in the training
minconf if suppðrÞ=actoccrðrÞ minconf. data set. So, a subset of the discovered candidate rules is
De¯nition 9. A rule is represented as: Antecedent ! c, chosen to form the classi¯er which in turn is evaluated
where antecedent is an AttributeValueSet and the against an independent data set to obtain its e®ectiveness.
consequent is a class.
2.3. Advantages of AC approach
2.2. Solution strategy
AC is a data mining research topic that has been exten-
As mentioned earlier, the majority of AC algorithms sively studied in the last decade and applied in di®erent
operate in three steps, step one involves rules discovery application domains including, text categorisation (Abu-
and production, and in step two, a classi¯er is built from mansour et al., 2010), bioinformatics (Clare and King,
the discovered rules found in step one and lastly the 2001), website security (Ye et al., 2008) and others. The
classi¯er is evaluated on test cases in step three. To ex- high applicability of this classi¯cation approach is mainly
plain the discovery of rules and building the classi¯er, due to several advantages o®ered such as the simplicity of
consider the training data set shown in Table 1, which the output, the high predictive accuracy of the classi¯er
represents three attributes (Att1, Att2) and the class and the end-user maintenance of the classi¯er where rules
attribute (Class). Assume that the minsupp and minconf can be easily sorted, added and removed. In this section,
1450027-3
September 30, 2014 2:00:20pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
Table 3(a). The general di®erences between AC and association rule mining.
we shed the light on the main advantages, disadvantages usually happens when the minsupp is set to a very small
of AC mining and highlight its main di®erences with rule value or the input data set is highly correlated.
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
based classi¯cation such as rule induction, covering and Another important advantage of AC is the simple
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
1450027-4
September 30, 2014 2:00:21pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
it looks for the rule that has the highest expected accuracy changes happened. Table 3(b) depicts the general di®er-
and produces it, and continues discovering the rules until ences between AC and other rule based classi¯cation
that subset becomes empty. The rules derived in this way approaches with reference to learning methodologies,
are considered local since they were derived from subsets classi¯er output format and other criteria.
of the training data set and not the whole set, and the
learning strategy is indeed greedy since the algorithm is 2.4. Data representation in AC
searching for the largest expected accuracy rule after
testing all attribute values in a certain subset. On the 2.4.1. Horizontal versus vertical
contrary, AC explores the complete training data set once Before the dissemination of the MMAC algorithm
aiming to build a global classi¯er (Thabtah et al., 2004). (Thabtah et al., 2004), there was only one data repre-
Precisely, it ¯nds the set of CARs from the complete sentation in AC adopted from association rule called
training data set. horizontal (Liu et al., 1998). In the horizontal data format,
Moreover, other classi¯cation approaches such as rule the training data set consists of a number of cases or rows
induction derive also local classi¯ers. The derived rules are in which each row has a number followed by the list of
local since when a rule is found, all cases associated with it attribute values. Table 1 that has been displayed earlier is
in the training data set are removed and the process an example of horizontal data format. The authors of
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
continues until a stopping condition is met, e.g. the rule MMAC have introduced the vertical data format in AC
discovered has unacceptable error rate (Thabtah et al., where the training data set gets converted into a table
2005). Moreover, searching for rules in these algorithms is similar to Table 4 in which each attribute value is repre-
exhaustive since for instance \Incremental Reduced Error sented by its locations (row numbers) in the training data
Pruning" (IREP) chooses the rules based on Foil-gain set. This representation is highly e®ective particularly in
(Quinlan and Cameron-Jones, 1993). In other words, the computing the support for each attribute value. There-
rule with the highest Foil-gain has higher rank in the ¯nal fore, on the contrary of the horizontal data format which
classi¯er. Unlike covering and rule induction approaches is often associated with computational costs such as the
in classi¯cation that require exhaustive search to build time required for merging disjoint ruleitems and ruleitems
local classi¯ers, AC searches the whole training data set support calculation, the discovery of frequent ruleitems in
aiming to build a global classi¯er. the vertical data format is accomplished by simple inter-
Lastly, decision trees such as C4.5 (Quinlan, 1993) and sections of disjoint attribute values locations.
C5 (Quinlan, 1998) derive the classi¯er as a tree where For example, the determination of frequent 2-ruleitem
each path from the root to the leaf represents a rule. In are based on intersecting disjoint frequent 1-ruleitem
this context, one cannot add or update the tree without locations. So for the candidate 2-ruleitem hðAttr1 ; a1 Þ;
having large impact on nodes and leaves within it. Alter- ðAttr2 ; b1 Þ; c2 i in Table 4 to determine its frequency is done
natively, if the end-user wishes to insert a new rule in a by intersecting the locations of ruleitems hðAttr1 ; a1 Þi and
classi¯er produced by an AC algorithm, he can do that in hðAttr2 ; b1 Þi, respectively. In other words, the set (1,2,
a straightforward manner without a®ecting the rules set. 4,6,8,9,10) is intersected with the set (1,2,3,5,6), and the
Whereas if the same process is applied on a decision tree, result of the intersection (1,2,6) denotes the row numbers
this necessitates reshaping the complete tree to re°ect in the training data in which the new candidate ruleitem
Table 3(b). Certain general di®erences for some rule-based classi¯cation approaches.
AC Association rule mining Con¯dence, support, rules Database coverage, lazy pruning Rules
generated ¯rst (rule length)
Decision tree Entropy & information No ranking Backward and forward pruning, Trees/Rules
gain e.g. pessimistic error
Covering Exhaustive search based No ranking Some algorithms use rule's Rules
greedy accuracy and others no pruning
Rule induction Greedy Certain mathematical Reduced error pruning, Rules
measure, e.g. (foil-gain) incremental REP
1450027-5
September 30, 2014 2:00:23pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
1 3 5 1 4 3 1
2 7 2 7 4 2
4 3 8 5 6
6 5 9 7 10
8 6 10 8
9 9
10
hðAttr1 ; a1 Þ; ðAttr2 ; b1 Þ; c2 i has appeared. Then by locating RowId: The line number (row id) of the ¯rst occurrence
the row numbers of the class \C2" we simply ¯nd out that of an item in the original data set.
this candidate 2-ruleitem size, i.e. 3, denotes the support Once the original data is represented in ItemId format,
count. If the support count is larger than the minsupp then all intermediate data generated in the algorithm will
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
then this candidate 2-ruleitem will become frequent, oth- keep the same representation. This makes the iterative
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
1450027-6
September 30, 2014 2:00:24pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
Table 5.1. Initial data in line space. 3.1. CBA based approaches
Line:Label Attributes Apriori is an association rule discovery algorithm that has
been proposed by Agrawal and Srikant (1994) and its
0:0 (0)0 (1)0 (2)0
name is based on the fact that it uses prior knowledge of
1:0 (0)1 (1)0 (2)0
2:2 (0)1 (1)2 (2)0 frequent itemsets. A frequent itemset is an item that has a
3:3 (0)1 (1)2 (2)0 frequency in the input database above the user minsupp
4:2 (0)0 (1)0 (2)4 threshold. The complete set of frequent itemsets are uti-
5:3 (0)0 (1)2 (2)4 lised to produce the association rules, and more precisely
6:3 (0)1 (1)2 (2)4
7:3 (0)1 (1)0 (2)7
any frequent itemset in the form X ! Y that holds
8:3 (0)0 (1)0 (2)4 enough con¯dence becomes a rule. In Apriori, the dis-
covery of frequent itemsets is implemented in a level-wise
fashion, where in each iteration, a complete database scan
is compulsory to generate the new candidate itemsets from
Table 5.2. Initial data in item space. frequent itemsets already found in the previous iteration.
Apriori uses the \downward-closure" property to mini-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
Attribute Line:Label
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
1450027-7
September 30, 2014 2:00:24pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
number of candidate itemsets joining which usually con- closed itemsets, instead of having to enumerate many
sumes time and memory resources. Lastly, MAC algo- possible subsets.
rithm (Abdelhamid et al., 2012a) has enhanced both the Few years ago, (Li et al., 2008) extended Charm to
pruning and prediction phase of CBA and added one tie handle classi¯cation benchmarks in an AC algorithm
breaking condition in the rule ranking. called ACCF. In particular, ACCF employed the concept
Currently, there are several AC algorithms that uses of closed itemsets of Charm to cut down the number of
CBA's style during the learning step to ¯nd frequent CARs produced so that decision makers can control the
ruleitems and generate the CARs including CBA (2) (Liu classi¯er and edit the rules. Experimental results against
et al., 2001), ARC-BC (Antonie and Zaïane, 2002), 18 di®erent data sets from the UCI data repository (Merz
NegativeRules (Antonie and Zaïane, 2004), lazy associa- and Murphy, 1996) showed that ACCF produced slightly
tive (Baralis et al., 2004), CAAR (Xu et al., 2004), En- better classi¯er with respect to accuracy as well as size
tropy associative (Su et al., 2008) and ACN (Kundu et al., than CBA.
2007, 2008). These algorithms have improved upon CBA
in one or more of its main steps including rule learning, 3.3. Combinatorial mathematics
sorting, pruning or prediction. For example, ARC-BC has
One recent AC approach for mining CARs which is based
been applied on unstructured textual data collection, and
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
1450027-8
September 30, 2014 2:00:25pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
3.4. Imbalanced class distribution based Baralis et al. (2004) proposed a related multiple sup-
approach ports approach that looks at the current rules generated
for all class labels in iteration N in order to amend the
The classes in some classi¯cation data sets are unevenly
support value for class labels that have no rules repre-
distributed. This may result in the production of very
sentation by lowering their support. Therefore, ensuring
small number of rules and in some cases no rules at all for
rules appearance for most of class labels in the training
the low frequency class and numerous numbers of rules for
data set is a must.
the high frequency class(s) (Arunasalam and Chawla,
2006). This problem normally happens because of the
minsupp threshold which controls the rule discovery step 3.5. TID-list intersection based approach
in which if we set it to a value larger than certain class To reduce the number of passes over the input database in
frequency, there will be no rules representation for that horizontal mining algorithms, the Eclat algorithm has
class in the classi¯er, and several strong rules will be been presented by Zaki et al. (1997), which requires only a
simply ignored during the rule discovery step. Therefore, single database scan, addressing the question of whether
researchers have investigated the possibility of utilising all frequent itemsets can be derived in a single pass. Eclat
multiple supports (Liu et al., 2001; Baralis et al., 2004) or introduced the concept of vertical database representation
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
other measures such as Complement Class Support (CCS) in association rule (Table 4), where frequent itemsets are
(Arunasalam and Chawla, 2006) that may overcome the obtained by applying simple tid-lists intersections, with-
class imbalance issues in classi¯cation benchmarks. out the need for complex data structures. A tid-list of an
One possible solution to the class imbalance problem is item is the locations (row numbers) in which this item has
the abundance of the minsupp threshold from taking any appeared in the training data set. In 2003, a variation of
role in the rule generation and the use of new measures the Eclat algorithm, called dEclat, was proposed by Zaki
such as CCS that primarily takes into account positively and Gouda (2003). The dEclat algorithm uses a newer
correlated rules as shown in the equation below: layout called di®set, which stores the di®erences in the
CCS for a ruleðRÞðA ! CÞ transactions identi¯ers (tids) of a candidate itemset from
its generating frequent itemsets. This considerably reduces
¼ SupportðA [ C Þ=SupportðC Þ; ð1Þ
the size of the memory required to store the tids. The
where A is the conjunction of the attribute values in R's di®set approach avoids storing the complete tids of each
body and C frepresents the complement of class C. The itemset; rather the di®erence between the class and its
learning approach of Arunasalam and Chawla (2006) only member itemsets are stored. Two itemsets share the same
looks for strong correlation between the rule antecedent class if they share a common pre¯x. A class represents
(rule body) and consequent (class), meaning rules that items that the pre¯x can be extended with to obtain new
have low CCS are produced and other rules with high CCS class. For instance, for a class of itemsets with pre¯x
are discarded. Experimentations against eight data sets x; ½x ¼ fa1 ; a2 ; a3 ; a4 g, one can perform the intersection of
from UCI repository showed that CCS based algorithm xai with all xaj with j > i to get the new classes. From [x],
slightly outperformed CBA with respect to one error rate we can obtain classes ½xa1 ¼ fa2 ; a3 ; a4 g; ½xa2 ¼ fa3 ; a4 g;
measure and the results also revealed that the CCS based ½xa3 ¼ fa4 g.
algorithm performed well on imbalanced data sets when it In AC mining, MCAR (Thabtah et al., 2005) and
comes to predictive accuracy. MMAC (Thabtah et al., 2004) algorithms modi¯ed the
Another possible solution to the class imbalance tid-list intersection learning used in association rule to
problem is the enhancement performed on CBA algorithm handle classi¯cation benchmarks. We will explain the
by Liu et al. (2001) that considers the frequency of class learning strategy of MMAC in the multi-label classi¯ca-
labels in the input data set, and assigns each class di®erent tion (Sec. 3.9) since it is a multiple label algorithm. MCAR
support value. In other words, the original minsupp value consists of two main phases: Rules generation and a
is distributed to each class according to the class frequency classi¯er builder. In the ¯rst phase, the training data set is
in the input data set. So, a low frequency class gets a low scanned once to discover frequent 1-ruleitem, and then
minsupp to guarantee the production of rules for it. An MCAR combines ruleitems generated to produce candi-
evaluation study against 34 data sets from UCI repository date ruleitems involving more attributes. Any ruleitem
showed that on average, the error rate of CBA (2) is lower with support and con¯dence larger than minsupp and
than that of CBA and C4.5 algorithms. minconf, respectively, is created as a candidate rule. In the
1450027-9
September 30, 2014 2:00:25pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
second phase, rules created are used to build a classi¯er by support and con¯dence parameters. Though correlation is
considering their e®ectiveness on the training data set. not a casual thing and it only reveals statistical associa-
The frequent ruleitems discovery method of MCAR tion between a set of objects in an implication, e.g.
scans the training data set to count the frequencies of X ! Y . If we discover casual correlation between the rule
1-ruleitems, from which it determines those that hold antecedent and consequent, one can reveal consequential
enough support. During the scan, frequent 1-ruleitems are factors with reference to class labels in the data set.
determined, and their occurrences in the training data Therefore, unlike current AC algorithms which produce a
(rowIds) are stored inside an array in a vertical format large search space for frequent ruleitems during the rule
along with classes and their frequencies and any ruleitem discovery, the use of causality and EP in AC mining can
that fails to pass the support threshold is discarded. minimise the search space of the candidate ruleitems by
MCAR ¯nds frequent ruleitems of size t by appending only keeping ruleitems that have causal impact on the
disjoint frequent itemsets of size t 1 and intersecting class (Yu et al., 2011). In other words, when CARs are
their rowIds in the training data set. The result of this discovered the only attribute values considered in the
simple intersection gives a set of rowIds where both CARs are those that belong to this causal attribute values
itemsets occur together in the training data. This set along space instead of the combinations of all attributes values.
with the class array holding the class label frequencies This signi¯cantly minimises the demand on resources in-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
derived during the ¯rst scan, can be used to compute the cluding training time and memory in the rule discovery
support and con¯dence of the new ruleitem resulted from step.
the intersection. Experimentations on real scheduling data The ¯rst algorithm which employed EP was proposed
collections as well as UCI data repository showed that by Dong et al. (1999) and it is called Classi¯cation based
MCAR outperformed CBA and other classic classi¯cation on Aggregating Emerging Patterns (CAEP). An EP is an
algorithms such as RIPPER and C4.5 with respect to attribute value which has a support that changes from one
accuracy. data set to another, with a change rate larger than a
In Tang and Liao (2007), a vertical AC algorithm called constant : The support rate between two data sets for a
CACA was proposed. It scans the training data set, stores given attribute value is called the growth-rate, which can
data in vertical data format like MCAR, counts the fre- be computed as follows:
quency of every attribute value and arranges attributes in
Supportd 0 ðattÞ
descending order according to their frequencies. Any at- ; ð2Þ
Supportd ðattÞ
tribute which fails to satisfy the minsupp is removed in this
step. For the remaining attribute values, CACA intersects are the data sets
where att is the attribute value,d and d
attributes locations to cut down the searching space of which the attribute value's support has changed. Given a
frequent pattern. For each attribute in a class group that minsupp threshold and a growth-rate, the algorithm ¯nds
passes, the minconf is inserted in an Ordered Rule Tree EPs that survive , also known as -attribute values. In
(OR-Tree) as a path from the root node and its support, mining EPs, the input data set is ¯rst divided into parts
con¯dence and class are stored at the last node in the path. based on the class labels, and a production of all -attri-
Limited experimental results suggested that CACA per- bute values from one part to another is implemented
forms better with reference to accuracy and computation (Dong et al., 1999).
time than MCAR on sample of the UCI data sets. Experimental studies (Dong et al., 1999) showed that
EP's based AC algorithms generate competitive classi¯ers
with respect to classi¯cation rate if compared to CBA,
3.6. Casual and EP approach
CMAR, CPAR and C4.5.
The majority of AC algorithms employ minsupp and
minconf which are mainly statistical correlation para-
meters to discover the rules. The minsupp is used to
3.7. CMAR and lazy based approaches
capture frequent attribute values (items) and the minconf Han et al. (2000) presented an association rule discovery
is hired to show the strong rules from the set of frequent method called Frequent Pattern Growth (FP-Growth)
attribute values. A di®erent AC approach based on the that converts the transactional database into a condensed
idea of causality and EP has been proposed by Yu et al. frequent pattern tree (FP-tree) in which each transaction
(2011) and Dong et al. (1999). Most of the current AC corresponds to one path in the tree containing the fre-
algorithms determine the correlation between rule ante- quent items in that transaction. Therefore, the new re-
cedent (attribute value) and consequent (class) based on presentation of the input database (FP-tree) can be seen
1450027-10
September 30, 2014 2:00:25pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
practical since frequent itemsets in each transaction are In general, most of the AC algorithms that employ the
known by the tree, and the FP-tree is usually smaller in CMAR learning strategy take the common attribute
size than the complete input database because of the items values contained in the rules into consideration. This in-
sharing among frequent itemsets. In addition, the number deed reduces the memory usage as well as the searching
of iterations over the input database necessary to build the time for frequent ruleitems if compared with CBA-like
FP-tree is just two rather than N as in Apriori where N algorithms such as CBA, CBA(2) and LCA. Experimental
equals the size of largest frequent itemset. Once the studies (Li et al., 2001; Ye et al., 2008) on UCI data re-
algorithm constructs the FP-tree, pattern growth heuris- pository and Malware security data collection demon-
tic kicks in to produce the rules from the FP-tree. For each strated that CMAR-like algorithms produce higher
frequent pattern X, the heuristic uses links in the tree to quality classi¯ers than CBA-based algorithms and they
derive other available patterns co-occurring with X, and may save more memory storage (Li et al., 2001; Thabtah
then the FP-growth algorithm concatenates X with the and Cowling, 2007). Nevertheless, one major de¯ciency of
other patterns extracted from the FP-tree. CMAR-like algorithms is that the CR-tree may not ¯t in
In AC mining, a modi¯ed version of the FP-growth has the main memory in cases when the input data is dense
been successfully implemented by a number of algorithms and huge in size.
including Malware detection AC (Ye et al., 2008), L3G
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
1450027-11
September 30, 2014 2:00:26pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
and jNj negative training cases that correspond to R: data sets and using known AC algorithms including CBA,
ðA1; v1Þ ! c. CMAR and CPAR showed that Chen et al.'s (2005) al-
gorithm is competitive to these algorithms and in partic-
Foil-gainðA1 ; v1 Þ ular it slightly outperformed CMAR and CBA on the
jP j jP 0 j considered data sets.
¼ jP j log log 0 : ð3Þ
jP j þ jNj jP j þ jN 0 j
It is clear that FOIL always looks for the largest FOIL- 3.9. Repetitive learning and multiple
gain attribute value in order to add it into the rule. labels approach
Though, there could be more than one attribute value The majority of current AC algorithms extract single label
with similar FOIL-gain which makes the selection of just classi¯ers in which the consequent of the rules contains
one attribute value questionable. This also can lead to only one class (Taiwiah and Sheng, 2013). In the searching
deterioration in the classi¯cation accuracy during the process for rules in the training data set, these algorithms
prediction step since a limited number of rules are often only consider the largest frequency class associated with
extracted by FOIL. Another problem associated with the attribute value and produce it in the potential rule
FOIL learning fashion is that the rules are derived from consequent. However, an attribute value may associate
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
parts of the training data set and not from the complete with multiple class labels with similar frequencies making
set, which makes them local rules and not global ones. extracting just one class in the rule highly undesirable and
In 2003, Yin and Han (2003) proposed an AC algo- questionable. This is since these class labels comprise im-
rithm called CPAR that enhances FOIL rule learning in portant and useful knowledge to the decision maker and
which once a rule such as R is constructed, CPAR does producing all of them is a de¯nite advantage.
not discard the positive cases associated with R instead The ¯rst AC that considers the production of multiple
weights of these cases are lowered by a multiplying factor. labels in the rule consequent is MMAC (Thabtah et al.,
This enhancement guarantees the production of more 2004). This algorithm proposed a recursive learning phase
rules as a training case is allowed to be covered by mul- that combines local classi¯ers derived during a number of
tiple rules instead of a single, and consequently the clas- iterations into a multiple label global classi¯er. For a given
si¯cation accuracy gets improved as well. Moreover, training data set T , MMAC operates similar to MCAR
CPAR ¯nds all attribute value with the largest FOIL-gain algorithm in the training step and extracts the ¯rst single
rather than just one as in FOIL so it can add multiple label classi¯er in iteration one. Then all training cases
attribute values into the rules and thus building rules associated with the derived rules are discarded, and the
simultaneously. remaining unclassi¯ed cases in the original training data
Furthermore, the searching process for the attribute set comprise a new data set T1 . In the next iteration, the
value with the largest FOIL-gain can be exhaustive and algorithm ¯nds all rules from T1 , builds another single
requires storage resources (e.g. main memory) especially label classi¯er, removes all cases in T1 which are associated
when the available number of attributes in the training with the generated rules and so forth. The results are n
data set is large. In this context, CPAR employs an e±- classi¯ers in which MMAC merges them to form a multi-
cient data structure to keep all necessary data about the label classi¯er. One distinguishing feature of MMAC
rule such as the positive and the negative cases before besides discovering additional knowledge often missed by
adding the attribute value into the rule antecedent and other AC approaches is that it can extract multi-label
after adding it into the rule. It has been shown that CPAR classi¯ers not only from multiple labels data sets but also
is highly competitive with reference to predictive accuracy from single label ones.
to other AC algorithms such as CBA and traditional A closely related multi-label AC algorithm called
classi¯cation algorithms such as RIPPER and C4.5 Ranked Multilabel Rule (RMR) (Thabtah and Cowling,
against the UCI data collection. 2007) solved the problem of rule overlapping and class
The di®erent steps in AC mining have been studied by ranking. This algorithm had proposed a post training
Chen et al. (2005) in order to come up with a new algo- heuristic that adjusts the position of the class labels in
rithm that can take an advantage from the previous each of the rule inside the classi¯er. More details on this
studies. The outcome was an algorithm that learns the algorithm are given in Sec. 3.11. Another multiple labels
rules using FOIL-gain measure, and then discards detailed AC classi¯cation algorithm called Correlated Lazy Asso-
rules and weakly correlated rules similar to CMAR algo- ciative Classi¯er (CLAC) (Veloso et al., 2007) that adopts
rithm with minor modi¯cations. Evaluation using ten UCI lazy classi¯cation and delays the reasoning process until a
1450027-12
September 30, 2014 2:00:27pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
test case is given. Similar to MMAC and RMR, CLAC Table 6. Partial training data adopted
allows the presence of multiple classes in the consequent of from Thabtah and Cowling (2007).
the rules. Unlike binary classi¯cation which does not Row Id Att1 Att2 Class
consider the correlation among classes, CLAC takes into
account classes relationships and training data over- 1 a b c1
lapping with these classes. The learning strategy used by 2 a b c1
3 a b c1
CLAC assigns a weight consisting of the con¯dence and
4 e b c1
support value of the rule(s) having the class and belonging 5 d b c1
to the test case, then the class labels applicable to the test 6 — b c2
case are sorted by their weights. CLAC then gives the test 7 — b c2
case the class with the largest weight, and considers the 8 — b c2
test case a new feature and iteratively assigns new class 9 e f c3
. — — —
labels to the test case until no more labels can be found.
Furthermore, this learning method deals with the small
disjuncts (rules that cover limited number of training (1, 2, 3). The deletion of the r1's training cases impacts
data), which removing them may reduce classi¯cation other candidate rules that share these cases such as r2 .
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
accuracy according to Veloso et al. (2007). Therefore, after r1 is inserted into the classi¯er, class c1 of
Empirical evaluations (Thabtah et al., 2004; Thabtah rule r2 would not be the largest frequency class anymore
and Cowling, 2007; Veloso et al., 2011) revealed that since some training cases of r2 are removed when r1 was
multi-label AC algorithms construct additional useful produced. In fact when r1 was derived, a new class of r2
rules that improve the classi¯cation accuracy of the becomes the largest frequency class, e.g. c2 , because it has
resulting classi¯ers if compared with other single label AC the largest representation among the remaining r2 rows in
such as CBA, CPAR and MCAR. the training data set. This rule overlapping problem is called
the \¯ttest class problem" (Thabtah and Cowling, 2007).
The RMR algorithm proposed a post training heuristic
3.10. Semi incremental and post training that adjusts the position (rank) of the class labels in the
approaches rules taking into consideration the rules overlapping in the
The majority of AC algorithms use the classi¯cation rules training cases. This heuristic operates as follow: Starting
discovered from the training data set for constructing the with the top ranked rule, it iterates over the training data
classi¯er which in turn is applied to predict the class of set removing all training cases applicable to the rule.
unseen test data. Though, in circumstances where there Then, the support and con¯dence of the lower ranked
are limited input data or the input data gets frequently rules' decrease since it shares training examples with the
updated, there should be a mechanism that can take into selected rule. This may result in adjusting the class labels
consideration (1) the new update(s) on the source data position(s) in the lower ranked rules and the largest fre-
and the classi¯ed resources (rules and the test data). quency class for some of these rules may not be the ¯ttest
Moreover, the problem of correlation between the class class any more. The process is repeated until all training
and the training cases may result in generating rules as- data cases are removed or the algorithm has iterated over
sociated with wrong class since these rules overlap in the all rules. This post training process is similar to covering
training cases. Precisely, the rule discovery strategies approach in classi¯cation in which it allows the training
employed by current AC algorithm are normally adopted case to be covered by just a single rule in the classi¯er
from association rule in which these algorithms allow the solving an important de¯ciency inherited from association
training case to be covered by multiple rules. So when a rule to AC which allows a training case to be covered by
rule is derived, other potential lower ranked rules may still multiple rules.
be able to cover the derived rule training cases, and thus Moreover, Wang et al. (2011) proposed an AC called
classes associated with many rules learned during the Adapting Associative Classi¯cation (ADA) that con-
learning step are not the most accurate ones. structs rules from both the input training data set as well
Consider Table 6, which contains two attributes and as the classi¯ed resources such as the training data set,
the class attribute. Assume that r1 : a ^ b ! c1 and r2 : current classi¯cation rules and test cases. Meaning the
b ! c1 are generated from Table 6, and r1 has a higher classi¯er is amended on the °y after the classi¯ed resources
rank than r2 . In current AC algorithm such as CBA when reach a certain amount. The authors have used a co-
r1 is generated, its training cases will be deleted, i.e. rows training method (Mei et al., 2006) to accomplish the task
1450027-13
September 30, 2014 2:00:27pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
of updating the classi¯er by re¯ning the newly discovered of K-means reduces the runtime of the algorithm by 30%.
knowledge from the existing classi¯cation rules. The Dhok and Varma (2010) developed a scheduler algorithm
co-training method used in ADA has been adopted from that uses pattern classi¯cation for the task assignment in
the semi-supervised learning of pattern context where the MapReduce framework. The developed scheduling algo-
labelled training documents are exercised to ¯gure out the rithm was able to cut down the response time of some
class labels of the unlabelled cases. More details can be workloads by considerable amount as compared to the
found in Mei et al. (2006). Overall, ADA can be considered original scheduler. The decision tree C4.5 classi¯cation
a semi-incremental AC algorithm since few training cases data mining algorithm (Wu et al., 2009) was implemented
or users set of frequent patterns (keywords) are only using the MapReduce framework to enforce parallel and
necessary to build the classi¯er instead of the complete distributed classi¯cation. After experimentations, the
training cases. Then, the classi¯ed cases as well as the results revealed that an increase in the number of nodes
classi¯cation rules are employed to update the classi¯er by positively impacts the classi¯cation modelling.
adding or removing rules. In AC mining, a new algorithm called MapReduce Mul-
An empirical study (Thabtah and Cowling, 2007) on ticlass Classi¯cation based Association Rule (MRMCAR)
multi-class and multi-label data sets from UCI data sets as which is based on a recent work (Thabtah and Hammoud,
well as scheduling showed that removing the overlapping 2013) can be seen as generalised version of MCAR algo-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
among the rules in the classi¯er by RMR algorithm out- rithm and distributable on MapReduce framework was
performed the MMAC algorithm with respect to classi¯- proposed by Thabtah and Hamoud (2013). It consists of
cation accuracy. Moreover, limited experimentations on four main steps, where each step may demand one or more
four data sets from the UCI data repository have been MapReduce jobs:
performed using ADA, CBA, CMAR and C4.5 algorithms
. Step One (Initialising): Representing the input data set
by Wang et al. (2011). The results showed similarity on
in a suitable format for the MapReduce framework, i.e.
the classi¯cation accuracy performance of the AC algo-
ItemId ¼ (ColumnId) RowId.
rithms and superiority over decision tree approach (C4.5).
. Step Two (Rule Discovery): This step includes ¯nding
frequent ruleitems, rule extraction and rule pruning.
3.11. Distributed MapReduce approach . Step Three (Constructing the classi¯cation model):
This step involves selecting high con¯dence and repre-
MapReduce is an emerging model, yet, not much research
sentative rules from the set of candidate rules extracted
on simulating the performance of MapReduce cluster has
in Step (2) to represent the classi¯cation model.
been done. To the best of our knowledge, MRPerf (Wang
. Step Four (Predicting test cases): In this step,
et al., 2009) and Mumak (Apache JIRA, 2009) are the
MRMCAR algorithm utilises a hybrid method consist-
only simulators targeting the MapReduce framework.
ing of single and multiple rules prediction methods.
Recently, MapReduce has been adopted by many search
enterprises such as Yahoo, Google and Amazon to enable In the learning phase, the MRMCAR maps each row in
building petabyte data centres comprising hundreds of the data set to a unique integer that represents the
thousands of nodes. These data centres are of low cost number of lines where the row occurs in the data set.
hardware and with a software infrastructure to allow Every frequent item id (ItemId) consists of two parts:
parallel processing analysis of the stored data. MapReduce column ids and RowId, i.e. ItemId ¼ (column ids) RowId.
model provides a software infrastructure to simplify Once the original data is represented in ItemId format,
writing applications that can access and process this then all intermediate data generated in the algorithm will
massive data. However, the cluster setup to get optimum keep the same representation. This makes the iterative
performance is not a trivial problem. It needs con¯gura- process of ¯nding frequent ruleitems simpler throughout
tion of tens of setup parameters and dynamic job para- the algorithm.
meters which a®ect every task execution. Frequent ruleitem discovery in MRMCAR works by
MapReduce programming paradigm has been recently repeating the transformation of the input data between
employed in data mining research because of its ability of the Line-space and the Frequent-item space until all fre-
performing parallel processing particularly during learning quent ruleitems are discovered. Data transformation
step and when the input data size is massive. For instance from a Line-space to a Frequent-space is performed using
Zhao et al. (2009) have implemented the known clustering the MapReduce methods \ToFrequent.Mapper" and
algorithm K-means utilising the MapReduce paradigm. \ToFrequent.Reducer". The input for the \ToFrequent.
The results showed that the MapReduce implementation Mapper" method is <line: label, list of ItemId>, and the
1450027-14
September 30, 2014 2:00:27pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
output is <ItemId, (Line: label)>, which then gets in- To describe the learning style of MRMCAR, we revisit
putted to the \ToFrequent.Reducer" and this method Table 5 and assume that the last attribute is the class
outputs <ItemId, FrequentItem>. On the other hand, attribute and the minsupp is 2 (support count). The
transforming the data from a Frequent-space into a Line- MRMCAR algorithm initially transforms the data into
space is performed using the methods \ToLine.Mapper" Line-space as shown in Table 5.1, and applies the
and \ToLine.Reducer". The \ToLine.Mapper" gets \ToFrequent.Mapper" and "ToFrequent.Reducer" meth-
<ItemId, FrequentItem> as an input and produces <Line ods to map the input data to entries in the Frequent-
Number:Label, ItemId> as an output, which in turn gets space. In this way and for each item in the Line-space the
inputted for the \ToLine.Reducer" and this method col- \ToFrequent.Mapper" method is invoked to emit list of
lects the ItemIds entries for a certain line and outputs- <ItemId, (Line,Label)>.
<line: label, list of ItemId> (Line-space).
(line 0) <0:0, (0)0, (1)0, (2)0> ToFrequentItem.Mapper <(0)0 ,(0:0)>, <(1)0, (0:0)>, <(2)0, (0:0)>.
(line 1) <1:0, (0)1, (1)0 , (2)0> => ToFrequentItem.Mapper =><(0)1,( 1:0)>,< (1)0 ,( 1:0)>,< (2)0 ,( 1:0)>
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
... etc.
Then, the output results from the Mapper are sorted For instance, for attribute values (keywords) \a" and \c",
and introduced to the Reducer grouped by the key value. the data o®ered to the Reducer are as follows:
<(0)0, 0:0 >,<(0)0, 4:2 >, <(0)0, 5:3 >,<(0)0, 8:3 > ToFrequentItem.reduce < (0)0 ,[ 0:0, 4:2, 5:3, 8:3]>
For these particular attribute values, it is obvious that and (0)1 are marked as frequent with class label \3" since
(0)0 and (0)1 are frequent ruleitems with support values they appear in the training data set with it more than the
2/9 and 3/9, respectively. It should be noted that in the rest of the class labels (label \3" corresponds to \R" in
rule discovery step while determining the frequent rulei- original data set). This is the preliminary label choice
tems, MRMCAR considers the attribute value occurrence attached to this ruleitem. Now we have a frequent item set
with its largest frequency class, and for this reason (0)0 of size 1 (1-ruleitems).
1450027-15
September 30, 2014 2:00:28pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
As shown previously, in each frequent ruleitem, lines of format using the MapReduce methods \ToLineItem.
the same class value are grouped together. Once the fre- Mappe" and \ToLineItem.Reducer". So for ruleitems h\
quent ruleitems of size 1 are determined, then only their a 00 ; ri and h\b 00 ; ri which are frequent, their Line-space
occurrences are transformed into the Lin-space data representations are:
(0)0 { sup=2 , conf=0.500, 0:[0] 2:[4] 3:[5, 8]} => ToLineMapper =>
The sample outputs are sorted and grouped by the line Arti¯cial Intelligence (AI) based on the Darwinian natural
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
number and then o®ered to the \ToLine.Reducer" which selection and mutation in Biological production. Nor-
will only accumulate the ItemIds and output them to line- mally, a GA method starts with an initial population of
space. So the lines would be similar to the previous lines objects, and it tests the ¯tness of the objects in the pop-
set of Table 5.1 excluding any attribute value which was ulation until a stopping criteria is met. During testing, it
discarded during the generation of frequent ruleitems. If performs selection, crossover and mutation operations on
no ItemIds were thrown with a certain line, then this line objects. In the GA algorithm, the input is di®erent con-
is dropped from the line space. In the next iteration, the tinuous attributes some of which are technical indicators
algorithm simply ¯nds frequent ruleitems of size N by (The relevant di®erence between two items). Then the
appending frequent ruleitems of size N 1. Particularly, algorithm discovers the relation sets among the items in
and for each two disjoint ItemIds in a single line within the the form of relation hitem; operator; itemi in which there
Line-space, the algorithm checks the possibility of joining are three di®erent items (constant, technical indicator,
them to one ItemId. attributei and the operator are restricted to (h; i). A
conjunction of the relation sets is the rule antecedent.
3.12. Genetic algorithm (GA) approach The GA algorithm cuts down the search space by
providing a relation pruning method that indicates which
When the training data set contains numerical attributes
pairs of items can be compared for which attributes in a
or the application domains produce continuous data type
attributes, AC algorithms tend to preprocess the input relation. During the rule discovery, a rule is encoded in
data using discretisation techniques in order to map the multi-level structure and represented as a chromosome.
continuous attribute to a set of ¯nite possible values. In The ¯rst level contains the number of items encoded and
addition, most of the current AC algorithms are unable to the value of the gene corresponds to the relation type of
discover the correlations among the numerical attributes the item. The algorithm produces the genes for the ¯rst
in application data like stock trading or any relevant data level and then the second level and considers discarding
with the same features. To be more speci¯c, the data in irrelevant relations. It should be noted that only the ¯rst
the stock trading application contain continuous attri- level genes are applied in the crossover to prevent pro-
butes such as quantities and prices for stocks sold over ducing useless rules, though, mutation is applied to genes
time and several technical indicators can be discovered in the ¯rst and second levels. All rules produced must pass
from the data to be laterally used by domain experts in the minsupp and the minconf thresholds, and then sorted
order to discover trading signals (Chien and Chen, 2010). according to CBA (Liu et al., 1998) sorting procedure.
In fact, the technical indicators can be used in the rule Limited experimentations on stock data collection
antecedent and the selling or buying are the class labels of gathered from ten di®erent companies have been carried
the rule. out with reference to accuracy. The results pointed out
An AC algorithm called GA-ACR that adopts GA that the GA-ACR algorithm outperforms a simple data
search strategy to build classi¯ers was proposed by Chien distribution algorithm. No comparisons of the GA algo-
and Chen (2010). GA is a common searching strategy in rithm and other AC algorithms are conducted in order to
1450027-16
September 30, 2014 2:00:29pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
Table 7. Summary of learning approaches in AC. to be informative if it has a gain above a certain threshold.
In this section, we highlight di®erent rule sorting proce-
Learning methodology Common AC algorithms
dures in AC.
CBA (Apriori candidate CAAR, CBA, negative rules
generation) (ARC-AC), CARGBA, CAN, etc 4.1. Con¯dence, support and rule cardinality
CMAR (FP-growth CMAR, L 3 , L 3 G
procedure
approach)
CPAR (Greedy) (Yin and Han; 2003) The ¯rst rule sorting procedure in AC was introduced by
Closed itemset (Charm) ACCF Liu et al. (1998) and it is based on rule's con¯dence,
Emerging patterns ICEP, ADT
support and the number of attributes in the rule's ante-
Multiple support CCS, CBA(2)
MCAR (TId List MCAR, MAC, CACA cedent. This procedure is displayed in Fig. 1. Using this
Intersection rule preference procedure has derived good quality classi-
approach) ¯ers with respect to accuracy according to some empirical
Multiple labels MMAC, RMR studies, (Liu et al., 2001) though the number of rules with
Test data training Calibrated AC, ADA similar con¯dence and support values are still massive.
Distributed MapReduce MR-ARM, MR-MCAR
Genetic algorithm GA-ACR
Consider for example two data sets (\Auto" and \Glass")
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
1450027-17
September 30, 2014 2:00:30pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
These kinds of rules are named speci¯c rules. In fact, lazy Decision tree algorithms such as C4.5 and C5 (Quin-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
algorithms try to hold almost all knowledge discovered lan, 1998) compute IG to assess which attribute goes into
even if redundancy exists among them aiming to maximise a decision node. The algorithm selects a root attribute
the predictive power of the ¯nal classi¯ers. Unlike CBA from the ones available in the training data set. As
rule ranking procedure, the L3 ranking procedure (Fig. 2) mentioned earlier, the choice is very important since it
mainly prefers speci¯c rules over general ones in order to a®ects the distribution of the available classes, and thus
give the speci¯c rules a higher chance in the prediction it is vital to select the best candidate as a root. C4.5
step since they are often more accurate than general rules. makes the selection of the root based on the most infor-
In the prediction phase, when the speci¯c rules are unable mative attribute and the process of selecting an attribute
to assign a class to the test case, then general rules with is repeated recursively at the so-called child nodes of the
smaller number of attributes in their antecedent are root, excluding the attributes that have been chosen be-
considered. fore, until the remaining training data cases cannot be
split any more. At that point, a decision tree is derived
4.3. Information gain where each node corresponds to an attribute and each arc
to a possible value of that attribute. Each path from the
IG is a mathematical measure mainly used in decision
root node to any given leaf in the tree corresponds to a
trees to decide which attribute goes into a root and
rule.
represents the expected amount of information required to
An AC method which utilises IG for rule sorting was
determine which class should be given to a new unclassi-
disseminated by (Su et al., 2008). Speci¯cally, the IG of
¯ed case. In other words, it measures how well a given
the rule r : Cond ! C is de¯ned as GainðrÞ ¼ GD
attribute divides the training data cases into classes. The
Gcond G cond ,
attribute with the highest information is chosen. In order
to de¯ne IG, someone ¯rst, has to measure the amount of where GD represents the IG of the training data set D and
information in an attribute using Entropy. is de¯ned as
Given a set of training data cases D of c classes, X m
Ci Ci
X GD ¼ D log D ; ð6Þ
EntropyðDÞ ¼ Pc log2 Pc ; ð4Þ i¼1
where Pc is the probability that D belongs to class c. The where jCi j represents the number of data cases which be-
IG of a set of data cases on attribute A is de¯ned as long to class Ci .
The IG of the rule antecedent (Gcond ) is de¯ned as
GainðD; AÞ
X N1 N11 N11 N12 N12
¼ EntropyðDÞ ððjDa j=jDjÞ EntropyðDa ÞÞ; ð5Þ Gcond ¼ log log ; ð7Þ
jDj N1 N1 N1 N1
where the sum is over each value a of all possible values of
where
attribute A, Da ¼ subset of D for which attribute A has
value a; jDa j ¼ number of data cases in Da , jDj ¼ number N1 ¼ jDj
SupportðRÞ
;
of data cases in D. ConfidenceðRÞ
1450027-18
September 30, 2014 2:00:31pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
Support, con¯dence, rules generated ¯rst CBA, CBA(2), Negative rules, ARC-AC,
CARGBA, ACCF, CAAR, CMAR, etc
Support, con¯dence, rules cardinality (longest rule) L 3, L 3G
Support, con¯dence, rules cardinality (shortest rule), MMAC, MCAR
rules class distribution (dominant class)
Support, con¯dence, rules cardinality (shortest rule), MAC
rules class distribution (minority class)
Information gain AC-IG
N11 ¼ jDj SupportðRÞ and Lastly, recent algorithms consider information theory
based measures such as IG as the base for rule preferences.
N12 ¼ N1 N11 :
ð8Þ An experimental study (Thabtah et al., 2005) revealed
that using con¯dence, support and rule antecedent car-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
Finally, the training cases that do not match the rule dinality in rule ranking is an e®ective approach. Though,
antecedent are also considered as: recent studies (Abdelhamid et al., 2012b) and the example
" #
N Xm
jC j jC j discussed in Sec. 4.1 showed that imposing more tie
Gcond ¼
2
i
log i
; where breaking conditions besides con¯dence and support may
jDj i¼1
N2 N2 ð9Þ
reduce the chance of randomisation in ranking which
N2 ¼ jDj N1 : consequently limits the use of default class later on in
prediction step. The employment of mathematical mea-
So the rule (r) is said to be informative if r has support
sures such as Entropy and IG seems to be promising
and con¯dence greater than the minsupp and minconf as
towards improving the process of sorting the rules. Fi-
well as the GAIN ðrÞ > 0. After the rules are discovered,
nally, approaches that favour speci¯c rules may sometimes
the ranking procedure will be invoked where rules with
gain slight improvement in accuracy, however it su®ers
larger gain are placed at a higher rank. In cases when two
from holding a large number of rules, many of which are
or more rules have similar gain, then the algorithm eval-
never used and thus it consumes memory as well as
uates the con¯dence, support and rule antecedent length
training time. Table 8 depicts the general ranking models
similar to CBA rule preference procedure.
used in AC mining.
Lastly, it is worth to mention that the authors (Lan
et al., 2006) have utilised the dilated Chi-square method
for rule sorting instead of the con¯dence and support 5. Building the Classi¯er and Rule Pruning
thresholds. So, after rules are found the learning algorithm
evaluates the dilated Chi-square for each rule, and places Once the complete set of rules are found in the training
the rules with high values ¯rst. phase and then ranked, the AC algorithm has to decide
the way it should choose a subset of highly e®ective rules
to represent the classi¯er. There are di®erent ways used in
4.4. Discussion on rule ranking AC to build the classi¯er, for instance, CBA utilises the
Rule sorting is considered a pre-processing phase in AC database coverage rule pruning where rules that cover
mining which impacts the (1) classi¯er building process correctly a certain number of training cases are marked as
and (2) test cases prediction. As a matter of fact, without accurate rules and the remaining rules get discarded. L3
rule sorting, the algorithm will not be able to easily choose and L3G algorithms employ lazy pruning that stores pri-
the rules that can be employed in the prediction step. Rule mary and secondary rules in the classi¯er. Moreover,
preference has been de¯ned di®erently by AC algorithms. (Thabtah et al., 2010) has proposed di®erent rule pruning
CBA and its successors considered con¯dence and support methods based on exact rule matching and partial rule
the main criteria for rule preference and MCAR adds upon matching of the rule body and the training case. Lastly, a
CBA the class distribution of the rules if two or more rules pruning method that does not consider the similarity of
have identical con¯dence and support. On the other hand, the evaluated rule class and the training case class was
unlike CBA and MCAR, L3 algorithm prefers speci¯c rules developed by Abdelhamid et al. (2012a). This section
over general ones since they contain multiple general rules. discusses the di®erent procedures applied in selecting the
1450027-19
September 30, 2014 2:00:31pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
classi¯er rules in AC mining. Furthermore, di®erent Input: The complete set of discovered rules R sorted, and the training data set D
mathematical rule pruning methods including Pessimistic
Error Estimation, Chi-Square testing and others are sur- 1 For each rule ri in R do
veyed in this Section. 2 Mark all applicable cases in D that match ri’s body
3 If ri correctly classifies a case in D
5.1. Full and partial match rule pruning 4 Insert ri into the classifier
5 Discard all cases in D covered by ri
De¯nition 10. A rule is said to fully match a training
6 end if
case if the attribute values in the rule body are contained
7 If ri covers no cases in D
in the training case.
8 Delete ri
De¯nition 11. A rule is said to partially match a 9 end if
training case if at least one of the attribute values in the 10 end
rule body is contained in the training case. 11 If D is not empty
12 Generate a default rule for the largest frequency class in D
Di®erent rule pruning methods are discussed in this sec-
tion primarily those that consider partial or full matching 13 Mark the least error rule in R as a cutoff rule.
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
between the selected rule and the training case. In par- 14 end if
ticular, database coverage (Liu et al., 1998), High Prece- Fig. 3. The database coverage method.
dence (HP) and High Precedence Classify Correctly
(HCP) (Abumansour et al., 2010) are surveyed. The da-
tabase coverage method considers a rule signi¯cant if its the prediction step in cases when there is no classi¯er rule
body fully matches the training case attribute values and applicable to the test case.
the rule class is similar to that of the training case. A closely related method to the database coverage was
Whereas FMP is similar to the database coverage but it proposed by Abdelhamid et al. (2012a). In this method, a
abandons the class similarity condition. The HCP con- rule is inserted into the classi¯er if its body fully matches
siders a rule signi¯cant if it's body partially matches any the training case without having an identical class to the
of the training cases and the rule class is identical to that training case class. Once a rule is evaluated, all training
of the training case. Finally, the HP signi¯es a rule if its cases covered by it are removed and the process continues
body partially matches any of the training cases without until all rules are evaluated or the training data set
checking the class value. becomes empty. After proposing CBA, several AC algo-
rithms have successfully employed the database coverage
5.1.1. Database coverage like methods in building the classi¯er, i.e. CBA (2), ARC-BC, CAAR,
The database coverage is the ¯rst pruning method in AC ACN and ACCF.
that has been applied by CBA to select the classi¯er. This
method is simple and e®ective and it evaluates the com-
5.1.2. High classify pruning method (HCP)
plete set of discovered rules against the training data set Many rules found in the training step cannot be used to
aiming to keep only high e®ective and accurate rules. forecast test cases, and thus some discovered rules are
Figure 3 depicts the database coverage method in which deleted. This rule evaluation method, High classify prun-
for each rule starting with the highest ranked rule, all ing method (HCP) (Abumansour et al., 2010) (Fig. 4),
training cases covered by the rule are marked for deletion goes over the complete set of rules after ranking and
and the rule gets inserted into the classi¯er. In cases where applies each rule against the training data set. If the rule
a rule cannot cover a training case (the rule body does not covers (partially matches) a training case and has a
match any training case attribute values) then the rule is common class to that of the training case, it will be
discarded. inserted into the classi¯er and all training cases covered by
The database coverage method terminates when the the rule are removed. The method repeats the same pro-
training data set becomes empty or if there are no more cess for each remaining rule until the training data set
rules to be evaluated. In that case, the remaining uncov- becomes empty and it considers the rules within the
ered training cases are used to generate the default class classi¯er during the prediction step.
rule which represents the largest frequency class in the The distinct di®erence between this method and the
remaining unclassi¯ed cases in the training data set. It database coverage is that a rule is added into the classi¯er
should be noted that the default class rule is ¯red during if it partially covers at least one training case, whereas in
1450027-20
September 30, 2014 2:00:32pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
1 R′ = sort(R);
2 For each rule ri in R′ Do
3 Find all applicable training cases in T that partially match ri’s condition
4 If ri correctly classifies a training case in T
5 Insert the rule at the end of Cl
6 Remove all training cases in T covered by ri
7 end if
8 If ri cannot correctly cover any training case in T
9 Remove ri from R
10 end if
11 end for
the database coverage, a rule body must fully match the covers at least one training case regardless if it classi¯es
training case in order to be part of the classi¯er. that case correctly or not. On the other hand, in the HCP,
a rule must classify a training case correctly in order to be
5.1.3. HP method considered in the classi¯er.
The HP method (Abumansour et al., 2010) (Fig. 5) allows
a rule to be inserted into the classi¯er if its body partially
5.1.4. Lazy methods
matches the training case regardless the class similarity Lazy AC scholars (Baralis et al., 2004), believed that
between the rule class and that of the training case. So, pruning should be limited to rules that incorrectly cover the
once rules are extracted and ranked, this method iterates training cases during building the classi¯er. This is since
over the rules starting with the highest sorted one, all these rules are the only ones that lead to misclassi¯cation
training cases covered by the selected rule are discarded on the training data set, and therefore they are the only
and the rule is inserted into the classi¯er. Any rule that ones that should be discarded. Unlike database coverage
does not cover a training case is removed. The loop ter- based methods, which prune any rule that do not cover a
minates when either the training data set is empty or all training case, lazy AC algorithms store these rules in a
rules are tested. compact set aiming to use them during the prediction step.
The di®erence between HP and HCP methods is that in The lazy pruning occurs when the complete set of rules
the HP, a rule gets inserted into the classi¯er if it partially are discovered and ranked in descending order in which
1 R ′ = sort(R);
2 For each rule ri in R ′ Do
3 Find all applicable training cases in T that partially match ri’s condition
4 Insert the rule at the end of Cl
5 Remove all training cases in T covered by ri
6 If ri cannot correctly cover any training case in T
7 Remove ri from R
8 end if
9 end for
1450027-21
September 30, 2014 2:00:33pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
longer rules (those with more attribute values) are 5.3. Mathematical based pruning
favoured over general rules. For each rule, if the selected
5.3.1. Pessimistic error estimation
rule covers correctly a training case (has a common class
to that of the training case), it will be inserted into the Pessimistic error estimation is mainly used in data mining
primary rule set, and all of its corresponding training cases within decision trees (Quinlan, 1993) in order to decide
will be deleted. Whereas, if a higher ranked rule covers whether to replace a sub-tree with a leaf node or to keep
correctly the current selected rule training case(s), the the sub-tree unchanged. The method of replacing a sub-
selected rule will be inserted into the secondary rule set. tree with a leaf is called sub-tree replacement, and the
Lastly, if the selected rule does not cover correctly any error is computed using the pessimistic measure on the
training case, it will be removed. The process is repeated training data set. To clarify, the probability of an error at
until all discovered rules are tested or the training data set a node v,
becomes empty. At that time, the output of this lazy Nv Nv;c þ 0:5
pruning will be two rules sets, a primary set which holds qðvÞ ¼ ; ð10Þ
Nv
all rules that correctly cover a training case, and a sec-
ondary set which contains rules that has never been used where
during the pruning since some higher ranked rules have
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
covered their training cases. Nv;c is the number of training cases belonging to the
The main distinguishing di®erence between the data- largest frequency class at node v.
base coverage and lazy pruning is that the secondary rules
set which is held in the main memory by the lazy methods The error rate at sub-tree T ,
is completely removed during building the classi¯er by the P
l2leafsðT Þ Nl Nl;c þ 0:5
database coverage. In other words, the classi¯er resulting qðT Þ ¼ P : ð11Þ
l2leafsðT Þ Nl
from CBA based algorithms which employ the database
coverage pruning does not contain the secondary rules set The sub-tree T is pruned if qðvÞ qðT Þ.
of the lazy pruning, and thus it is often smaller in size. The pessimistic error estimation has been exploited
This is indeed an advantage especially in applications that successfully in decision tree algorithms including C4.5 and
necessitates a concise set of rules so the end user can easily See5. In AC mining, the ¯rst algorithm which has
control and maintain the classi¯er. employed pessimistic error pruning is CBA. For a rule R,
Empirical studies, (Baralis et al., 2004) against large CBA removes one of the attribute value in its antecedent
number of UCI data sets revealed that using lazy algo- to make a new rule R 0 , then it compares the estimated
rithms such as L3 and L 3 G sometimes decrease the error error of R 0 with that of R. If the expected error of R 0 is
rate more than CBA like algorithms. Though, the large smaller than that of R, then the original rule R gets
classi¯ers derived by lazy algorithms and the main mem- replaced with the new rule R 0 .
ory usage cost limit their use.
5.3.2. Chi-square testing
5.2. Long rules pruning The chi-square test ( 2 ) is normally applied to decide
whether there is a signi¯cant di®erence between the ob-
A rule ¯ltering method that discards long rules (speci¯c served frequencies and the expected frequencies in one or
rules) that have con¯dence values larger than their subset more categories. It is de¯ned as a known discrete data
(general rules) was proposed by Li et al. (2001). This rule hypothesis in mathematics that tests the relationship be-
pruning method eliminates rules redundancy since many tween two objects in order to decide whether they are
of the discovered rules have common attribute values in correlated (Witten and Frank, 2002). The evaluation
their antecedents. As a result, the classi¯er may contain using 2 for a group of objects to decide their indepen-
redundant rules and this becomes obvious particularly dence or correlation is given as:
when the classi¯er size is large. The ¯rst algorithm that
X
n
ðOi Ei Þ 2
uses the long rules pruning was CMAR, in which when the 2 ¼ ; ð12Þ
rule is about to be inserted in the classi¯er, a test is issued i¼1
Ei
to check whether the rule can be removed or any of the
where
existing rules may be deleted. There are some AC methods
that employ this type of pruning, including ARC-BC and Oi is the observed frequencies,
negative rules. Ei is the expected frequencies.
1450027-22
September 30, 2014 2:00:34pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
If the observed frequencies and the expected frequencies Table 9. Summary of rule pruning models in AC.
are remarkably di®erent, the assumption that they are
Pruning method AC algorithms
related is declined.
The ¯rst AC algorithm that employed a weighted Database coverage CBA, CBA(2), CAAR,
version of 2 is CMAR. It evaluates the correlation be- Negative rules (ARC-AC),
tween the antecedent and the consequent of the rule and CARGBA, CAN, CMAR, etc
Redundant rules CMAR, CAEP
removes rules that are negatively correlated. A rule R :
CPAR CPAR
Antecedent ! c is removed if the class c is not positively Chi-square CMAR
correlated with the antecedent. In other words, if the re- Pessimistic error CBA, CBA(2), ADT, SARC
sult of the correlation exceeds a certain threshold, this Lazy pruning L 3, L3G
indicates a positive correlation and R will be kept. Oth- Partial matching MAC, MMAC
erwise, R will be deleted since negative correlation exists HCP, HP MCAR
in R. To clarify, for R, assume Support (c) denote the
number of training cases associated with class c and
Support (Antecedent) denote the number of training cases test cases. This step is called class prediction or forecast-
associated with the R's antecedent. Also assume that jT j ing. There are several di®erent methods for class alloca-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
denote the size of the training data set. The weighted tion in AC some of which employs the highest ranked rule
chi-square denoted Max 2 of R is de¯ned as: in the classi¯er (Liu et al., 1998; Thabtah and Cowling,
2007) and others use multiple rules (Li et al., 2001;
Max 2 ¼ minfSupportðAntecedentÞ; SupportðcÞg Thabtah et al., 2011; Abdelhamid et al., 2012a). In this
2 section we discuss on the di®erent prediction methods
SupportðAntecedentÞSupportðcÞ employed by the current AC algorithms.
jT ju;
jT j
ð13Þ 6.1. One rule class forecasting
where The basic idea of the one rule prediction (Fig. 6) was
1 introduced in CBA algorithm. This method works as
u ¼ follows: Once the classi¯er is constructed and rules within
SupportðAntecedentÞSupportðcÞ
1 it are sorted in descending manner according to con¯-
þ dence and support thresholds, and a test case is about to
SupportðAntecedentÞðjT j SupportðcÞÞ
1
þ
ðjT j SupportðAntecedentÞÞSupportðcÞ Input: Classifier (R), test data set (Ts), array Tr
1 Output: error rate Pe
þ :
ðjT j SupportðAntecedentÞÞðjT j SupportðcÞÞ Given a test data set (Ts), the classification process works as follow:
1450027-23
September 30, 2014 2:00:34pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
be forecasted, CBA iterates over the rules in the classi¯er 6.3.1. Dominant class and highest con¯dence
and assigns the class associated with the highest sorted method(s)
rule that matches the test case body to the test case.
Two closely related prediction methods that use multiple
In cases when no rules matches the test case body,
rules to forecast test cases were proposed by Thabtah
CBA takes on the default class and assigns it to the test
et al. (2011). The ¯rst method is called \Dominant Class",
case.
which marks all rules in the classi¯er that are applicable to
After the dissemination of CBA algorithm, a number of
the test case, then divides them into groups according to
other AC algorithms have employed its prediction method
class labels, and assigns the test case the class of the group
(Baralis et al., 2002; Thabtah et al., 2005; Tang and Liao,
which contains the largest number of rules as shown in
2007; Li et al., 2008; Kundu et al., 2008 and Niu et al.,
Fig. 7. In cases where no rule is applicable to the test case,
2009).
the default class will be used.
The second prediction method is called \Highest Group
6.2. Predictive con¯dence forecasting Con¯dence", which works similar to the \Dominant
Class" method in the way of marking and dividing the
A rule's con¯dence is the main criteria for choosing the
applicable rules into groups based on the classes. However,
right classi¯er rule to use for test cases prediction. However,
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
1450027-24
September 30, 2014 2:00:35pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
CPAR divides them into groups according to the classes, Table 10. Summary of class forecasting methods in AC.
and calculates the average expected accuracy for each
Method name Common algorithms
group. Finally, it assigns t the class with the largest av-
erage expected accuracy value. The expected accuracy for One rule full matching with CBA, CBA(2), ADT,
each a rule (R) is obtained as follows: class similarity CAAR, L 3 G, L 3 , etc
Multiple rules label based on CMAR
ðpc ðRÞ þ 1Þ weighted chi-square
LaplaceðRÞ ¼ ; ð14Þ
ðptot ðRÞ þ pÞ Multiple label based Laplace CPAR
expected accuracy
where Aggregated rules scores CAEP
One rule full matching without MAC
p is the number of classes in the training data set class similarity
ptot ðRÞ is the number of training cases matching r Dominant factor multiple label Negative Rules ARC-BC,
antecedent ARC-BC
pc ðRÞ is the number of training cases covered by R that Group of rules full matching Enhanced MAC
without class similarity
belong to class c. Highest group con¯dence Modi¯ed MCAR
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
1450027-25
September 30, 2014 2:00:35pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
have successfully applied in optimization, online security each class rules set before evaluating the complete set of
and data mining is Arti¯cial Immune System (AIS). As a rules on the training data to determine the classi¯er.
matter of fact, AIS has been utilised in classi¯cation Empirical evaluations using a limited number of UCI
problem in last decade and devised a competitive perfor- data sets indicated that the AIS proposed algorithms
mance results in accuracy rate. Examples of known clas- (Elsayed et al., 2012; Do et al., 2009) are highly compet-
si¯cation algorithms that are based on AIS are clonal itive in accuracy and execution time to the \Predictive
selection and negative selection (Do et al., 2009). We be- Apriori" algorithm which is a simpli¯ed version of CBA
lieve that AIS can be used in AC especially to minimise that primarily uses Apriori algorithm for extracting the
the search space for rules by reducing the number of rules without pruning.
candidate rules. Hereunder, two attempts in using AIS
within AC have been outlined.
There have been some initial attempts to adapt the
7.2. Calibration
learning methodology of NIS especially the clonal selection Accuracy is one of the main metrics used in classi¯cation
in AC context that have resulted in an algorithm named algorithms in data mining to favour an algorithm over
arti¯cial immune system-associative classi¯cation (AIS- others for certain data sets. In fact, most of the classi¯-
AC) (Do et al., 2009). The AIS-AC algorithm was pro- cation problems such as credit card scoring, website clas-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
posed in 2005 and extended in 2009 and follows the evo- si¯cation, weather forecasting, etc, use accuracy or its
lutionary process by reducing the search space of the complement one-error-rate as the main evaluation metric
candidate rules by keeping just high predictive rules. This to distinguish among classi¯cation algorithms. Though,
process is accomplished by extracting frequent 1-ruleitems certain applications like cost-sensitive classi¯cation,
after passing over the initial training data set, and gen- Information Retrieval ranking in search engines and text
erating the possible candidate ruleitems at iteration N categorization for digital libraries, may require additional
from results derived at iteration N 1 and so forth. The information beside classi¯cation accuracy such as class
minsupp and minconf are utilised as sharp lines to dis- membership probabilities per test. So in calibrated AC
criminate among ruleitems at each iteration. Further, two approach, the derived rules per test data are used to de-
new parameters are introduced named Clonal rate and scribe the training data set and these rules are utilised to
Max generation. The clonal rate (de¯ned below) denotes compute the class membership probabilities. When the
the rate at which items in the candidate rules at a given rules are accurate, calibrated AC algorithms assumes that
generation are extended, and is proportional to the rule the estimated class membership probabilities are also
con¯dence. accurate and can be generalised.
There are many classic rule based and non-rule based
n Clonal rate
Clonal rate ¼ P n ; ð15Þ approaches in classi¯cation that have employed calibra-
i¼1 confðri Þ
tion. Examples are SVM, decision trees and Statistical and
where n is the number of rules at the current iteration, and probabilistic (Witten and Frank, 2002). In AC, one cali-
the clonal rate is a prede¯ned user parameter. Once the brated approach has been used AC (Veloso et al., 2011).
candidate rules are extracted, they are tested on the We believe that calibration is an important issue that
training data keeping only those that have one or more should be studied extensively in AC simply since initial
training example(s) coverage. The algorithm terminates results revealed good predictive performance if compared
once the complete training data set is covered or the to other current algorithms. Furthermore, for multiple
Max generation condition has been met (often set to 10). label classi¯cation including the class membership proba-
The candidate rules that have training data coverage are bilities are much more useful than single label classi¯ca-
kept in the classi¯er. The AIS-AC algorithm applies the tion because of two reasons. Firstly, in multi-label
rules in the classi¯er on the test data similar to CBA classi¯cation, the input data instance may belong to sev-
prediction method. eral classes and therefore we can assign weights or class
Recently, another AIS based on AC called AC-CS was memberships in particular when classes overlap in the
proposed in (Elsayed et al., 2012). This algorithm follows training data. Thus, the decision maker can distinguish
the same track of the previously described AIS-AC and it easily to which the input data belongs to or can merge
uses the same strategies in deriving the rules and classi- multiple classes together to come up with new class label.
fying test data. One simple di®erence between AC-CS and Secondly, some of the rules in the classi¯er will be con-
AIS-AC is that AC-CS builds the candidate rules in gen- nected to set of classes and therefore calibration can assist
erations per class rather than at once and then merges in prioritising these classes (Ranking).
1450027-26
September 30, 2014 2:00:36pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
misleading in some cases especially since the rule with the since these algorithms avoid repeatedly scanning the
largest con¯dence is chosen to predict the test case in the original databases and employ e±cient search methods
test data set. So, instead of computing the con¯dence from based on TIDs intersections to ¯gure out frequent rulei-
the training data set as most AC methods, the test data tems. Moreover, cutting down the number of candidate
should be considered in favouring rules during the pre- rules seems to be a necessity for the success and appli-
diction phase. Therefore, the authors proposed a measure cability of AC algorithms on real applications. Recent
of rule goodness called \predictive con¯dence" which is studies revealed new attempts to develop rule pruning
based on statistical information in the test data set (the methods particularly pruning based on mathematical
frequencies of the test cases applicable to a rule). The formulas like IG besides database like pruning \partial
new predictive con¯dence based AC approach is called matching". These provide promising research directions
AC-S. This approach is required to calculate the rule to accomplish this task. Furthermore, calibrated AC like
(R) \confidence decrease " ¼ RðConfðTrainingÞÞ R(Conf CLAC algorithm prunes by minimising the search space
(Training)) − in order to estimate the predictive con¯- for candidate rules when it comes to classifying test data.
dence for each rule before predicting test cases. Finally, despite the computation time for class allocation
The AC-S algorithm depends on several parameters procedures that are based on group of rule prediction,
that must be known at the time of prediction and for each algorithms that employ this type of prediction such as
test case before the algorithm chooses the most applicable CMAR, CPAR and MAC are more accurate than single
rule to the test case. Precisely, the support and con¯dence rule based procedure approach (CBA, MCAR). It is the
for each candidate rule must be computed and from both ¯rm belief of the authors that due to the rapid advances
the training and testing data sets so that AC-S can be able in hardware technology and storage like cloud services
to estimate the predictive accuracy for each rule. This infrastructure, processing large amounts of data is no
indeed is time consuming and can be a burden in cir- longer a huge set back due to the fact that services such
cumstances where the training data set is highly corre- as processors can be let directly from cloud service pro-
lated. Further, it is impractical to estimate the support viders. Thus, new AC research areas like distributed AC
and con¯dence for each rule in the testing data set in become feasible in this era.
advance since we do not know which rule will be used for In near future, we intend to develop a new AC algo-
prediction. Yet, we can utilise the test data during the rithm for structured and unstructured textual documents
prediction step to narrow down candidate rules. This can that not only generate single label classi¯ers but also
be seen as a new research path for enhancing the current multi-label ones.
\predictive con¯dence" approach. A comparison between
AC-S and other known AC algorithms such as CBA, CBA
(2) and CMAR was conducted against some UCI data References
sets. The results of the accuracy showed that AC-S is Abdelhamid, N, A Ayesh and F Thabtah (2013). Asso-
competitive to CBA, though CBA (2) and CMAR algo- ciative classi¯cation mining for website phishing clas-
rithms derived higher quality classi¯ers than AC-S. si¯cation. In Proc. ICAI '2013, pp. 687–695, USA.
1450027-27
September 30, 2014 2:00:36pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
Abdelhamid, N, A Ayesh, F Thabtah, S Ahmadi and DaWaK 2008, LNCS 5182, Song, I-Y, J Eder and TM
W Hadi (2012a). MAC: A multiclass associative clas- Nguyen (eds.), pp. 293–304.
si¯cation algorithm. Journal of Information and Chen, J, J Yin and J Huang (2005). Mining correlated
Knowledge Management, 11(2), 1250011-1–1250011-10. rules for associative classi¯cation. In Proc. ADMA.
Abdelhamid, N, A Ayesh and F Thabtah (2012b). An pp. 130–140.
experimental study of three di®erent rule ranking for- Chien, Y and Y Chen (2010). Mining associative classi¯-
mulas in associative classi¯cation mining. In Proc. 7th cation rules with stock trading data — A GA-based
Int. Conf. for Internet Technology and Secured method. Knowledge-Based Systems, 23, 605–614.
Transactions (ICITST-2012). Clare, A and R King (2001). Knowledge discovery in
Abumansour, H, W Hadi, L McCluskey and F Thabtah multi-label phenotype data. In Proc. PKDD '01, De
(2010). Associative text categorisation rules pruning Raedt, L and A Siebes (eds.), Vol. 2168, Lecture Notes
method. In Proc. Linguistic and Cognitive Approaches in Arti¯cial Intelligence, pp. 42–53.
to Dialog Agents Symposium (LaCATODA-10), Dhok, J and V Varma (2010). Using pattern classi¯cation
Rzepka, R (ed.), at the AISB 2010 convention, pp. 39– for task assignment in mapreduce. In Proc. 10th IEEE/
44. April 2010, UK. ACM Int. Conference CCGrid 2010. Melbourne,
Aburrous, MA Hossain, K Dahal and F Thabtah (2010). Australia.
Intelligent phishing detection system for e-banking Do, TD, SC Hui, ACM Fong and B Fong (2009). Asso-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
using fuzzy data mining. Expert Systems with Appli- ciative classi¯cation with arti¯cial immune system.
cations: An International Journal, 7913–7921. IEEE Transactions on Evolutionary Computation,
Al-Maqaleh, B (2013). Discovering interesting association 13, 217–228.
rules: A multi-objective genetic algorithm approach. Do, TD, SC Hui and ACM Fong (2005). Associative
International Journal of Applied Information Systems, classi¯cation with prediction con¯dence. In Proc. 2005
5(3), 47–52. Int. Conf. Machine Learning and Cybernetics, Vol. 4,
Apache JIRA (2009). Mumak Hadoop MapReduce Sim- pp. 1993–1998.
ulator, https://issues.apache.org/jira/browse/MAPRE Dong, G, X Zhang, L Wong and J Li (1999). CAEP:
DUCE-728. Classi¯cation by aggregating emerging patterns. In
Agrawal, R and R Srikant (1994). Fast algorithms for DS'99, pp. 30–42.
mining association rule. In Proc. 20th Int. Conf. Very Duda, R and P Hart (1973). Pattern Classi¯cation and
Large Data Bases-VLDP, pp. 487–499. Scene Analysis. John Wiley & Son.
Antonie, M and O Zaïane (2004). An associative classi¯er Elsayed, S, S Rajasekaran and R Ammar (2012). AC-CS:
based on positive and negative rules. In Proc. 9th ACM An immune-inspired associative classi¯cation algo-
SIGMOD Workshop on Research Issues in Data rithm. ICARIS, pp. 139–151.
Mining and Knowledge Discovery, pp. 64–69. Paris, Han, J, TY Lin, J Li and N Cercone (2007). Constructing
France. associative classi¯ers from decision tables. In Proc. Int.
Antonie, M and O Zaïane (2002). Text document cate- Conf. Rough Sets, Fuzzy Sets, Data Mining, and
gorization by term association. In Proc. IEEE Int. Granular-Soft Computing — RSFDGrC, pp. 305–313.
Conf. Data Mining, pp. 19–26. Maebashi City, Japan. Han, J, J Pei and Y Yin (2000). Mining frequent patterns
Arunasalam, B and S Chawla (2006). CCCS: A top-down without candidate generation. In Proc. 2000 ACM
associative classi¯er for imbalanced class distribution. SIGMOD Int. Conf. Management of Data, pp. 1–12.
In KDD 2006, 517–522. Jabbar, MA, BL Deekshatulu and P Chandra (2013).
Baralis, E and P Garza (2012). I-prune: Item selection for Knowledge discovery using associative classi¯cation for
associative classi¯cation. International Journal of heart disease prediction. Advances in Intelligent Sys-
Intelligent Systems, 27(3), 279–299. tems and Computing, 182, 29–39.
Baralis, E, S Chiusano and P Graza (2004). On support Jabez, C (2011). A statistical approach for associative
thresholds in associative classi¯cation. In Proc. 2004 classi¯cation. European Journal of Scienti¯c Research,
ACM Symp. Applied Computing, pp. 553–558. Nicosia, 58(2), 140–147.
Cyprus. Jensen, D and P Cohen (2000). Multiple comparisons
Baralis, E and P Torino (2002). A lazy approach to in induction algorithms. Machine Learning, 38(3),
pruning classi¯cation rules. In Proc. 2002 IEEE 309–338.
ICDM'02, p. 35. Kundu, G, M Islam, S Munir and M Bari (2008). ACN: An
Cendrowska, J (1987). Prism: An algorithm for inducing associative classi¯er with negative rules, In 11th IEEE
modular rules. International Journal of Man-Machine Int. Conf. Computational Science and Engineering.
Studies, 27(4), 349–370. pp. 369–375.
Cerf, L, D Gay, N Selmaoui and F Boulicaut (2008). Kundu, G, S Munir, M Md. Islam and K Murase (2007).
A parameter-free associative classi¯cation method. A novel algorithm for associative classi¯cation. In Proc.
1450027-28
September 30, 2014 2:00:36pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
Int. Conf. Neural Information Processing—ICONIP, In Proc. Knowledge Acquisition and Modeling Work-
pp. 453–459. shop — KAM Workshop, pp. 1060–1063.
Yu, K, X Wu, W Ding and H Wang (2011). Causal as- Tang, Z and Q Liao (2007). A new class based associative
sociative classi¯cation, In Proc. 11th IEEE Int. Conf. classi¯cation algorithm. IMECS 2007, pp. 685–689.
Data Mining (ICDM '11), December 11–14, 2011, Taiwiah, CA and V Sheng (2013). A study on multi-label
Vancouver, Canada, 914–923. classi¯cation. Advances in Data Mining. Applications
Lan, Y, D Janssens, G Chen and G Wets (2006). Im- and Theoretical Aspects. Lecture Notes in Computer
proving associative classi¯cation by incorporating novel Science, Volume 7987, pp. 137–150.
interestingness measures. Expert System Applications, Thabtah, F and S Hammoud (2013). MR-ARM: A
31(1), 184–192. MapReduce association rule mining. Journal of Parallel
Li, X, D Qin and C Yu (2008). ACCF: Associative clas- Processing Letter, 23, 1350012.
si¯cation based on closed frequent itemsets. In Proc. Thabtah, F, W Hadi, N Abdelhamid and A Issa (2011).
Fifth Int. Conf. Fuzzy Systems and Knowledge Discov- Prediction phase in associative classi¯cation. Interna-
ery — FSKD, pp. 380–384. tional Journal of Software Engineering and Knowledge
Li, W, J Han and J Pei (2001). CMAR: Accurate and Engineering, 21(6), 855–876.
e±cient classi¯cation based on multiple-class associa- Thabtah, F, Q Mahmood, L McCluskey and H Abdel-
tion rule. In Proc. IEEE Int. Conf. Data Mining — jaber (2010). A new classi¯cation based on association
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
1450027-29
September 30, 2014 2:00:36pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1
Xu, X, G Han and H Min (2004). A novel algorithm for Zaki, M and CJ Hsiao (2002). CHARM: an e±cient al-
associative classi¯cation of images blocks. In Proc. gorithm for closed itemset mining. In Proc. 2002 SIAM
fourth IEEE Int. Conf. Computer and Information Int. Conf. Data Mining (SDM'02), pp. 457–473.
Technology, pp. 46–51. Zaki, M, S Parthasarathy, M Ogihara and W Li (1997).
Ye, Y, Q Jiang and W Zhuang (2008). Associative New algorithms for fast discovery of association rules.
classi¯cation and post-processing techniques used for In Proc. 3rd KDD Conf., pp. 283–286.
malware detection. In Proc. 2nd Int. Conf. Anti- Zhao, Z, H Ma and Q He (2009). Parallel k-means clus-
Counterfeiting, Security and Identi¯cation, 2008– tering based on MapReduce. Cloud Computing. Lecture
ASID, pp. 276–279. Notes in Computer Science, Vol. 5931, p. 674. Springer-
Yin, X and J Han (2003). CPAR: Classi¯cation based on Verlag Berlin Heidelberg.
predictive association rule. In Proc. — the SIAM Int. Zhu, Y, W Luo, G Chen and J Ou (2012). A multi-label
Conf. Data Mining — SDM, pp. 369–376. classi¯cation method based on associative rules.
Zaki, M and K Gouda (2003). Fast vertical mining using Journal of Computational Information Systems, 8(2)
di®sets. In Proc. Ninth ACM SIGKDD Int. Conf. 791–799.
Knowledge Discovery and Data Mining, pp. 326–335.
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com
1450027-30