0% found this document useful (0 votes)
62 views30 pages

Associative Classi Cation Approaches: Review and Comparison: Neda Abdelhamid

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views30 pages

Associative Classi Cation Approaches: Review and Comparison: Neda Abdelhamid

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

September 30, 2014 2:00:17pm WSPC/188-JIKM 1450027 ISSN: 0219-6492

FA1

Journal of Information & Knowledge Management, Vol. 13, No. 3 (2014) 1450027 (30 pages)
.c World Scienti¯c Publishing Co.
#
DOI: 10.1142/S0219649214500270

Associative Classi¯cation Approaches: Review


and Comparison

Neda Abdelhamid
Computing and Informatics Department
De Montfort University, Leicester, UK
[email protected]
Fadi Thabtah
Ebusiness Department
Canadian University of Dubai, Dubai, UAE
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.

[email protected]
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

Published 18 September 2014

Abstract. Associative classi¯cation (AC) is a promising data primarily using association rule algorithms and then a
mining approach that integrates classi¯cation and association classi¯er is constructed after sorting the knowledge and
rule discovery to build classi¯cation models (classi¯ers). In the
pruning useless and redundant ones. Many research
last decade, several AC algorithms have been proposed such as
Classi¯cation based Association (CBA), Classi¯cation based on studies including (Yin and Han, 2003; Thabtah et al.,
Predicted Association Rule (CPAR), Multi-class Classi¯cation 2005; Li et al., 2008; Ye et al., 2008; Niu et al., 2009;
using Association Rule (MCAR), Live and Let Live (L3) and Thabtah et al., 2010; Baralis and Garza, 2012; Abdelha-
others. These algorithms use di®erent procedures for rule learn- mid et al., 2012a; Zhu et al., 2012; Jabbar et al., 2013;
ing, rule sorting, rule pruning, classi¯er building and class allo- Taiwiah and Sheng, 2013) revealed that AC methods
cation for test cases. This paper sheds the light and critically
compares common AC algorithms with reference to the above-
usually extract better classi¯ers with reference to error
mentioned procedures. Moreover, data representation formats in rate than other classi¯cation data mining approaches like
AC mining are discussed along with potential new research decision tree (Quinlan, 1993) and rule induction (Jensen
directions. and Cohen, 2000).
Normally, an AC algorithm operates in three main
Keywords: Associative classi¯cation; classi¯cation; data mining;
rule learning; rule sorting; pruning; prediction. phases. During the ¯rst phase, it looks for hidden correla-
tions among the attribute values and the class attribute
values in the training data set and generates them as \Class
1. Introduction
Association Rule" (CARs) in \IF-THEN" format (Thab-
Association rule discovery and classi¯cation are closely tah et al., 2010). After the complete set of CARs are found,
related data mining tasks with the exception that Asso- ranking and pruning procedures (phase 2) start operating
ciation Rule ¯nds relationships among attribute values in where the ranking procedure sorts rules according to cer-
a database whereas the classi¯cation's goal is allocating tain thresholds such as con¯dence and support (Li et al.,
class labels to unseen data known as test data set as cor- 2008). Further, during pruning, contradicting and dupli-
rectly as possible. The joining of association rule and cating rules are discarded from the complete set of CARs.
classi¯cation had come to surface as a promising research The output of phase 2 is the set of CARs which represents
discipline named associative classi¯cation (AC) during the the classi¯er. Lastly, the classi¯er derived gets tested on
year 1998 in a paper titled \Integrating classi¯cation and new independent data set to measure its e®ectiveness in
association rule" (Liu et al., 1998). In AC mining, the forecasting the class of unseen test cases. The output of the
training phase is about searching for hidden knowledge last phase is the accuracy or error-rate of the classi¯er.

1450027-1
September 30, 2014 2:00:17pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

Research studies for instance (Veloso et al., 2007; goal of this paper is to survey and compare the state-of-
Wang et al., 2011) have shown that AC has two distin- the-art AC techniques with reference to di®erent proce-
guishing features over other traditional classi¯cation dures employed during the algorithm's lifecycle, i.e. (Data
approaches. The ¯rst one is that it produces very simple formats, training phase, building the classi¯er, rule ran-
knowledge (rules) that can be easily interpreted and kling, prediction, etc). This may enable other researchers
manually updated by the end-user. Secondly, this ap- to spot possible issues and research directions in this ¯eld
proach often ¯nds additional useful hidden knowledge for further improvement.
missed by other classi¯cation algorithms and therefore the The rest of the paper is structured as follows: the AC
error rate of the resulting classi¯er is minimised. The main problem, its solution scheme, the di®erent data represen-
reason behind producing the additional knowledge is that tation models and its main advantages and disadvantages
AC utilises association rule methods in the training phase are discussed in Sec. 2. Section 3 is devoted to the di®erent
(Liu et al., 1998; Thabtah et al., 2004) where all possible learning strategies employed in AC. Rule sorting and its
relationships among the attribute values in the training associated procedures are surveyed in Sec. 4, and Sec. 5
data set and the class attribute are found and extracted. highlights the di®erent methods employed to build the
Though, in some cases the possible numbers of the derived classi¯er and to prune unnecessary rules. Section 6 reviews
rules may become excessive (Li et al., 2001; Al-Maqaleh, the di®erent prediction methods in AC and possible new
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

2013). research directions are discussed in Sec. 7. Finally con-


There are a number of AC algorithms that have been clusions and further research works are given in Sec. 8.
proposed in the last decade including CBA (Liu et al.,
1998), Classi¯cation based on Multiple Association Rule
2. Associative Classi¯cation Mining
(CMAR) (Li et al., 2001), Association Rule Classi¯cation-
Associative Classi¯cation (ARC-AC) (Antonie and 2.1. The problem
Zaïane, 2002), CPAR (Yin and Han, 2003), Class Asso- We follow Abdelhamid et al. (2012a) in the de¯nition of
ciative Association Rule (CAAR) (Xu et al., 2004), neg- the AC problem in data mining. Given a training data set
ative-rules (Antonie and Zaïane, 2004), L3 (Baralis and D, which has n distinct attributes A1 ; A2 ; . . . ; An and C is
Torino, 2002), Multiclass Multilabel Associative Classi¯- a list of classes. The number of cases in D is denoted jDj.
cation (MMAC) (Thabtah et al., 2004), MCAR (Thabtah An attribute may be categorical (where each attribute
et al., 2005), Class based Associative Classi¯cation Ap- takes a value from a known set of possible values) or
proach (CACA) (Tang and Liao, 2007), Fitcare (Cerf continuous where each attribute takes a value from an
et al., 2008), Associative Classi¯cation based on Closed in¯nite set, e.g. (real or integer). For categorical attri-
Frequent Items Set (ACCF) (Li et al., 2008), Associative butes, all possible values are mapped to a set of positive
Classi¯cation with Negative Rules (CAN) (Kundu et al., integers. In the case of continuous attributes, any dis-
2008), Cluster Based Association Rule (CBAR) (Niu cretisation method can be applied. The goal is to construct
et al., 2009), Looking at the Class Association (LCA) a classi¯er from D, e.g. Cl: A ! C, which can forecast the
(Thabtah et al., 2010), Multiclass Associative Classi¯ca- class, of test cases where A is the set of attribute values
tion (MAC) (Abdelhamid et al., 2012a) and others. These and C is the set of classes.
algorithms employ di®erent methodologies for knowledge The majority of AC algorithms mainly depend on a
discovery, rule sorting, rule pruning and forecasting of test threshold inputted by the user called minimum support
cases. In this paper, the problem of AC is investigated and (minsupp). This threshold is used to separate ruleitems
the di®erent strategies employed in each step by the var- (De¯nition 4) that are statically ¯t and having large fre-
ious AC algorithms are compared. Also, advantages and quency in the training data set (frequent ruleitems) from
disadvantages of AC and its main di®erences with other others that have low frequency (infrequent ruleitems).
rule based classi¯cation approaches are discussed. Therefore, the AC algorithm must compute the ruleitem's
Despite the applicability of AC in di®erent real appli- support to decide its survival by comparing its support
cations, the reviews in this research domain are rare. In (De¯nition 6) in the training data set with the minsupp
fact, AC has been successfully applied in domain appli- threshold.
cations like website phishing detection (Abdelhamid et al., Any attribute value plus its class that passes minsupp
2013), automatic text classi¯cation (Abumansour et al., is known as a frequent ruleitem, and when the frequent
2010), credit card scoring (Li et al., 2001), Email classi¯- ruleitem belongs to a single attribute, it is said to be a
cation (Aburrous et al., 2010) and others. The primary frequent 1-ruleitem.

1450027-2
September 30, 2014 2:00:20pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

Another important threshold in AC is the minimum Table 1. Training data set.


con¯dence (minconf). For each frequent ruleitems dis-
Row number Att1 Att2 Class
covered, a typical AC algorithm computes its con¯dence
(De¯nition 8) to decide whether it can be converted into a 1 a1 b1 c2
candidate rule. Hereunder are the main de¯nitions related 2 a1 b1 c2
to the AC problem: 3 a2 b1 c1
4 a1 b2 c1
De¯nition 1. An AttributeValue is an attribute name Ai 5 a3 b1 c1
and its value ai , denoted (Ai , ai ). 6 a1 b1 c2
7 a2 b2 c1
De¯nition 2. The jth row or a training case in D is a list 8 a1 b2 c1
of attribute values ðAj1 ; aj1 Þ; . . . ; ðAjk ; ajk Þ, plus a class 9 a1 b2 c1
denoted by cj . 10 a1 b2 c2

De¯nition 3. An AttributeValueSet is a set of disjoint


attribute values contained in a training case, denoted Table 2. Frequent items derived by MCAR from
hðAi1 ; ai1 Þ; . . . ; ðAik ; aik Þi. Table 1.
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

De¯nition 4. A ruleitem r is of the form hantecedent, ci, Frequent attribute Support (%) Con¯dence (%)
where antecedent is an AttributeValueSet and c"C is a value
class. ha1 i; c2 40 57.10
ha1 i; c1 30 42.85
De¯nition 5. The actual occurrence (actoccr) of a
hb1 i; c2 30 60
ruleitem r in D is the number of cases in D that match hb2 i; c1 40 80
r 0 s antecedent. ha1 ; b1 i; c2 30 100
ha1 ; b2 i; c1 30 75
De¯nition 6. The support (supp) of a ruleitem r is the
number of cases in D that matches r 0 s antecedent, and
belongs to a class c. have been set to 30% and 50%, respectively. A typical AC
De¯nition 7. A ruleitem r passes the minsupp if, algorithm such as MCAR (Thabtah et al., 2005) ¯rstly
suppðrÞ=jDj  minsupp. Such a ruleitem is said to be a discovers all frequent ruleitems which hold enough sup-
frequent ruleitem. ports (Table 2). Once all frequent ruleitems are found,
then MCAR transforms the subset of which hold enough
De¯nition 8. The ruleitem's con¯dence is represented as con¯dence values into candidate rules. The bold rows
the frequency of the attribute value and its related class in within Table 2 are the candidate rules, and from those the
the training data set from the frequency of that attributes classi¯er is derived. A rule is considered part of the clas-
value in the training data. So for a ruleitem r, it passes the si¯er if it covers certain number of cases in the training
minconf if suppðrÞ=actoccrðrÞ  minconf. data set. So, a subset of the discovered candidate rules is
De¯nition 9. A rule is represented as: Antecedent ! c, chosen to form the classi¯er which in turn is evaluated
where antecedent is an AttributeValueSet and the against an independent data set to obtain its e®ectiveness.
consequent is a class.
2.3. Advantages of AC approach
2.2. Solution strategy
AC is a data mining research topic that has been exten-
As mentioned earlier, the majority of AC algorithms sively studied in the last decade and applied in di®erent
operate in three steps, step one involves rules discovery application domains including, text categorisation (Abu-
and production, and in step two, a classi¯er is built from mansour et al., 2010), bioinformatics (Clare and King,
the discovered rules found in step one and lastly the 2001), website security (Ye et al., 2008) and others. The
classi¯er is evaluated on test cases in step three. To ex- high applicability of this classi¯cation approach is mainly
plain the discovery of rules and building the classi¯er, due to several advantages o®ered such as the simplicity of
consider the training data set shown in Table 1, which the output, the high predictive accuracy of the classi¯er
represents three attributes (Att1, Att2) and the class and the end-user maintenance of the classi¯er where rules
attribute (Class). Assume that the minsupp and minconf can be easily sorted, added and removed. In this section,

1450027-3
September 30, 2014 2:00:20pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

Table 3(a). The general di®erences between AC and association rule mining.

Approach AC Association rule

Goal To predict the class in test To discover hidden relationships


data set among items
Learning Supervised Unsupervised
Class involvement Yes No
Rule ranking Essential No rule ranking
Rule pruning Essential No rule pruning except the basic
minsupp and minconf pruning
Prediction step The main step (Essential) No prediction is involved
Discretisation of continuos attributes Essential Not applicable

we shed the light on the main advantages, disadvantages usually happens when the minsupp is set to a very small
of AC mining and highlight its main di®erences with rule value or the input data set is highly correlated.
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.

based classi¯cation such as rule induction, covering and Another important advantage of AC is the simple
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

decision trees. chunks of knowledge output represented as \If-Then"


Some scholars consider AC a special case of the asso- rules. This surely enables the decision maker to easily
ciation rule mining since it produces only the correlations understand and maintain the classi¯er. Consider for in-
among attribute values and the class attribute in a data stance, a medical diagnosis system, where symptoms such
set, whereas association rule mining discovers all correla- as coughing, high temperature, blocked sinus, etc may
tions among attribute values treating the class as any relate to di®erent types of illness \cold, \°u", etc and are
other attribute. For instance, Liu et al. (1998) and Liu stored in a data set. When a new patient is going to be
et al. (2001) applied the Apriori algorithm on classi¯cation diagnosed by the physician, the physician utilises the
benchmarks and kept only the rules that their consequent medical diagnoses system to derive the correlation among
contain the class value, and simply ignored the remaining the patient attributes (age, gender, medical history, etc),
rules. These algorithms ¯lter out rules not having the class the patient current symptoms and the types of illness
values in their consequent. Other scholars consider AC a (class attribute). It would be advantageous if the corre-
standalone research topic in classi¯cation where at early lation in the medical diagnoses system is outputted in
research stages employed association rule in the rule dis- simple rules the physician can be able to use in order to
covery step and then added upon the classi¯er construc- come up quickly with the right diagnoses. This classi¯er
tion and class assignment steps. Latterly, AC evolved and also enables the physician to select the right set of rules
used new methodologies for rule discovery other than as- matching the patient's symptoms and using these with his
sociation rule such as Emerging Patterns (EPs) (Yu et al., own medical experience, he can come up with the appro-
2011), Information Gain (IG) (Su et al., 2008), etc. priate diagnosis. Overall, the physician is not interested in
Nevertheless, both sides agreed that AC had its own dis- a probability, black box or a complex decision tree since he
tinguishing characteristics. Table 3(a) depicts the general does not have time nor is he interested in breaking up the
di®erences between AC and association rule. complexity of the output. This example if limited shows
One of the primary advantages of AC is its ability to that di®erent types of end-user can manipulate and com-
discover additional hidden knowledge that other classi¯- prehend the outcome of AC.
cation approaches are unable to ¯nd. This additional
knowledge proved to enhance the classi¯cation accuracy of
2.3.1. Main di®erences between AC and rule
the outputted classi¯er if compared with traditional clas-
based classi¯cation
si¯cation approaches according to several experimental
studies, i.e. (Yin and Han, 2003; Yu et al., 2011; Wang There are di®erences between AC and rule based classi¯-
et al., 2011; Elsayed et al., 2012). Though, the additional cation approaches mainly in the way rules are found.
knowledge may contain redundant or con°icting rules in In covering classi¯cation such as Prism algorithm
which if no appropriate pruning is invoked can cause larger (Cendrowska, 1987), rules are derived locally and in a
problem called the exponential growth of rules (Thabtah greedy way in which Prism splits the training data set into
et al., 2011; Thabtah and Hammoud, 2013). This problem subsets with respect to class values. Then for each subset,

1450027-4
September 30, 2014 2:00:21pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

it looks for the rule that has the highest expected accuracy changes happened. Table 3(b) depicts the general di®er-
and produces it, and continues discovering the rules until ences between AC and other rule based classi¯cation
that subset becomes empty. The rules derived in this way approaches with reference to learning methodologies,
are considered local since they were derived from subsets classi¯er output format and other criteria.
of the training data set and not the whole set, and the
learning strategy is indeed greedy since the algorithm is 2.4. Data representation in AC
searching for the largest expected accuracy rule after
testing all attribute values in a certain subset. On the 2.4.1. Horizontal versus vertical
contrary, AC explores the complete training data set once Before the dissemination of the MMAC algorithm
aiming to build a global classi¯er (Thabtah et al., 2004). (Thabtah et al., 2004), there was only one data repre-
Precisely, it ¯nds the set of CARs from the complete sentation in AC adopted from association rule called
training data set. horizontal (Liu et al., 1998). In the horizontal data format,
Moreover, other classi¯cation approaches such as rule the training data set consists of a number of cases or rows
induction derive also local classi¯ers. The derived rules are in which each row has a number followed by the list of
local since when a rule is found, all cases associated with it attribute values. Table 1 that has been displayed earlier is
in the training data set are removed and the process an example of horizontal data format. The authors of
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

continues until a stopping condition is met, e.g. the rule MMAC have introduced the vertical data format in AC
discovered has unacceptable error rate (Thabtah et al., where the training data set gets converted into a table
2005). Moreover, searching for rules in these algorithms is similar to Table 4 in which each attribute value is repre-
exhaustive since for instance \Incremental Reduced Error sented by its locations (row numbers) in the training data
Pruning" (IREP) chooses the rules based on Foil-gain set. This representation is highly e®ective particularly in
(Quinlan and Cameron-Jones, 1993). In other words, the computing the support for each attribute value. There-
rule with the highest Foil-gain has higher rank in the ¯nal fore, on the contrary of the horizontal data format which
classi¯er. Unlike covering and rule induction approaches is often associated with computational costs such as the
in classi¯cation that require exhaustive search to build time required for merging disjoint ruleitems and ruleitems
local classi¯ers, AC searches the whole training data set support calculation, the discovery of frequent ruleitems in
aiming to build a global classi¯er. the vertical data format is accomplished by simple inter-
Lastly, decision trees such as C4.5 (Quinlan, 1993) and sections of disjoint attribute values locations.
C5 (Quinlan, 1998) derive the classi¯er as a tree where For example, the determination of frequent 2-ruleitem
each path from the root to the leaf represents a rule. In are based on intersecting disjoint frequent 1-ruleitem
this context, one cannot add or update the tree without locations. So for the candidate 2-ruleitem hðAttr1 ; a1 Þ;
having large impact on nodes and leaves within it. Alter- ðAttr2 ; b1 Þ; c2 i in Table 4 to determine its frequency is done
natively, if the end-user wishes to insert a new rule in a by intersecting the locations of ruleitems hðAttr1 ; a1 Þi and
classi¯er produced by an AC algorithm, he can do that in hðAttr2 ; b1 Þi, respectively. In other words, the set (1,2,
a straightforward manner without a®ecting the rules set. 4,6,8,9,10) is intersected with the set (1,2,3,5,6), and the
Whereas if the same process is applied on a decision tree, result of the intersection (1,2,6) denotes the row numbers
this necessitates reshaping the complete tree to re°ect in the training data in which the new candidate ruleitem

Table 3(b). Certain general di®erences for some rule-based classi¯cation approaches.

Classi¯cation Rule discovery Ranking methodology Pruning procedure Output format


approach name methodology

AC Association rule mining Con¯dence, support, rules Database coverage, lazy pruning Rules
generated ¯rst (rule length)
Decision tree Entropy & information No ranking Backward and forward pruning, Trees/Rules
gain e.g. pessimistic error
Covering Exhaustive search based No ranking Some algorithms use rule's Rules
greedy accuracy and others no pruning
Rule induction Greedy Certain mathematical Reduced error pruning, Rules
measure, e.g. (foil-gain) incremental REP

1450027-5
September 30, 2014 2:00:23pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

Table 4. Vertical data representation of Table 1.

(Attr1 , a1 ) (Attr1 , a2 ) (Attr1 , a3 ) (Attr2 , b1 ) (Attr2 , b2 ) (Class, c1 ) (Class, c2 )

1 3 5 1 4 3 1
2 7 2 7 4 2
4 3 8 5 6
6 5 9 7 10
8 6 10 8
9 9
10

hðAttr1 ; a1 Þ; ðAttr2 ; b1 Þ; c2 i has appeared. Then by locating RowId: The line number (row id) of the ¯rst occurrence
the row numbers of the class \C2" we simply ¯nd out that of an item in the original data set.
this candidate 2-ruleitem size, i.e. 3, denotes the support Once the original data is represented in ItemId format,
count. If the support count is larger than the minsupp then all intermediate data generated in the algorithm will
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.

then this candidate 2-ruleitem will become frequent, oth- keep the same representation. This makes the iterative
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

erwise it will be discarded. process of ¯nding frequent ruleitems simpler throughout


the algorithm operations. One more bene¯t of such a data
2.4.2. Rough set data representation representation is to reduce the amount of data to be
A recent hybrid AC approach that combines rough set communicated between the nodes running the algorithms
theory, association rule and covering classi¯cation ap- in the distributed implementation. Here is an example of
proach has been developed in Han et al. (2007). This al- how to initialise the training data set of Table 5.
gorithm employs rough set theory which is a knowledge MR-ARM uses two data structure formats to represent
discovery technique that normally discards redundant and intermediate data used in the algorithm; line space and
noisy attributes from training data sets in the data re- item space. An example of line space data format is the
presentation stage to simplify the mining process. In other data set initialised in Table 5.1, where data set is repre-
words, the rough set theory algorithm assumes that the sented in collection of lines. Each line has the format of:
training data set is a decision table that consists of attri- Line : classðlabelÞ; ðcolumnIds 0ÞrowId 0; . . . ;
bute values and the class attribute. A result of the decision
table is the subset of the attribute values and the class ðcolumnIds nÞrowId n
attribute that represents the whole table. In the hybrid Line : label; list of items ids
AC algorithm, a rough set algorithm is used to select the
This is a horizontal representation of data. The other
result of attribute-value pairs that can represent the
data representation used is the vertical representation or
complete training data set in a decision table context and
\item space" format (See Table 5.2). Frequent item is a
thus reducing the search space for ¯nding knowledge.
data structure which maps the classes with corresponding
2.4.3. Line and item spaces data representation lines for this ItemId. ItemId is set of occurrence lines with
their classes. As shown later (Sec. 3.11), this simple data
Recently, a novel data format based on switching between
the horizontal and vertical data representations inter-
changeably during the training phase (line and item spaces) Table 5. Data set.
was proposed by Thabtah and Hamoud (2013). Precisely,
TID Attributes Class
MR-ARM maps each case in the training data set to a
unique integer value. This value is the number of lines where 0 A B C M
the case occurs in the data set, and it is noted as RowId. It 1 C B C M
will be part of the ID of corresponding rules or frequent 2 C D C P
3 C D C R
ruleitems that ¯rst appeared in the data set at this line.
4 A B A P
Every frequent item id (ItemId) consists of two parts: 5 A D A R
Column ids and RowIdItemId ¼ ðColumn idsÞ RowId. 6 C D A R
7 C B D R
Column Ids: are the ids of attributes in the original data 8 A B A R
set which compose an item.

1450027-6
September 30, 2014 2:00:24pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

Table 5.1. Initial data in line space. 3.1. CBA based approaches
Line:Label Attributes Apriori is an association rule discovery algorithm that has
been proposed by Agrawal and Srikant (1994) and its
0:0 (0)0 (1)0 (2)0
name is based on the fact that it uses prior knowledge of
1:0 (0)1 (1)0 (2)0
2:2 (0)1 (1)2 (2)0 frequent itemsets. A frequent itemset is an item that has a
3:3 (0)1 (1)2 (2)0 frequency in the input database above the user minsupp
4:2 (0)0 (1)0 (2)4 threshold. The complete set of frequent itemsets are uti-
5:3 (0)0 (1)2 (2)4 lised to produce the association rules, and more precisely
6:3 (0)1 (1)2 (2)4
7:3 (0)1 (1)0 (2)7
any frequent itemset in the form X ! Y that holds
8:3 (0)0 (1)0 (2)4 enough con¯dence becomes a rule. In Apriori, the dis-
covery of frequent itemsets is implemented in a level-wise
fashion, where in each iteration, a complete database scan
is compulsory to generate the new candidate itemsets from
Table 5.2. Initial data in item space. frequent itemsets already found in the previous iteration.
Apriori uses the \downward-closure" property to mini-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.

Attribute Line:Label
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

mise the search space of the candidate itemsets by cutting


(0)0 0:0, 4:2, 8:3 down their size during each iteration.
(0)1 1:0, 2:2, 3:3, 6:3, 7:3 One of the ¯rst research studies that showed the uti-
(1)0 0:0, 1:0, 4:2, 7:3, 8:3
lisation of Apriori in solving classi¯cation benchmarks is
(1)2 2:2, 3:3, 5:3, 6:6
(2)0 0:0, 1:0, 2:2, 3:3 CBA. This algorithm implements the Apriori gen-
(2)4 4:2, 5:3, 6:3, 8:3 erate candidate function to ¯nd and produce the frequent
(2)7 7:3 ruleitems. The main di®erence between an itemset and a
ruleitem is that the ruleitem consists of attribute value
plus the class value (hattributes; valuesi, class), whereas
format allows ruleitems of higher degrees to be represented the itemset may be looked at as just an attribute value by
in the same way. itself. Once CBA ¯nds the complete set of frequent rulei-
tems, then a subset of which pass the minconf threshold is
converted into CARs.
3. Learning Approaches in Associative
Since CBA employs Apriori in its learning step, it has
Classi¯cation
inherited some of Apriori's de¯ciencies especially the re-
The ¯rst step in AC mining is about discovering and petitive data set scans and the exponential growth of rules
generating the CARs; therefore we can decompose it into (Li et al., 2001). In particular, since Apriori tests all cor-
two sub-steps (1) the discovery of frequent ruleitems, and relations among the items in the transactional database in
(2) the rule generation. Many scholars (Li et al., 2001; the learning step in order to ¯nd the rules, the expected
Zaki and Gouda, 2003; Thabtah et al., 2010) consider this numbers of candidate itemsets are often massive. This
step the most challenging step since it requires signi¯cant de¯nitely leads to the generation of large numbers of as-
search, computations and may necessitate multiple sociation rules, and in some cases especially with very low
training data set scans. For instance, CBA (Liu et al., minsupp the numbers of rules are in the orders of tens or
1998) algorithms scan the input data n times where n hundreds of thousands, which consequently limit their use
denotes the number of iterations required to ¯nd the in practical applications. So, after the dissemination of
complete set of frequent ruleitems. Generally, there are CBA, several AC algorithms have been proposed to
di®erent learning methodologies in AC many of which are overcome some of CBA's de¯ciencies that have been
adopted from association rule discovery such as Apriori inherited from Apriori. For instance, CBA (2) was dis-
level-wise search (Agrawal and Srikant, 1994), frequent seminated to overcome the problem of not generating
pattern growth (Han et al., 2000), tid-list intersections CARs for minority class labels in the training data set
(Zaki and Gouda, 2003), frequent closed itemsets (Zaki (The class balancing issue) (Liu et al., 2001). Further,
and Hsiao, 2002) and others. Further, there are other CMAR algorithm was developed to improve the searching
learning approaches that are standalone such as IG based, for frequent ruleitems and it introduced a compact data
and statistical ones. In this section the di®erent learning structure to achieve this goal. Moreover, LCA algorithm
approaches in AC are surveyed in detail. (Thabtah et al., 2010) was developed to minimise the

1450027-7
September 30, 2014 2:00:24pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

number of candidate itemsets joining which usually con- closed itemsets, instead of having to enumerate many
sumes time and memory resources. Lastly, MAC algo- possible subsets.
rithm (Abdelhamid et al., 2012a) has enhanced both the Few years ago, (Li et al., 2008) extended Charm to
pruning and prediction phase of CBA and added one tie handle classi¯cation benchmarks in an AC algorithm
breaking condition in the rule ranking. called ACCF. In particular, ACCF employed the concept
Currently, there are several AC algorithms that uses of closed itemsets of Charm to cut down the number of
CBA's style during the learning step to ¯nd frequent CARs produced so that decision makers can control the
ruleitems and generate the CARs including CBA (2) (Liu classi¯er and edit the rules. Experimental results against
et al., 2001), ARC-BC (Antonie and Zaïane, 2002), 18 di®erent data sets from the UCI data repository (Merz
NegativeRules (Antonie and Zaïane, 2004), lazy associa- and Murphy, 1996) showed that ACCF produced slightly
tive (Baralis et al., 2004), CAAR (Xu et al., 2004), En- better classi¯er with respect to accuracy as well as size
tropy associative (Su et al., 2008) and ACN (Kundu et al., than CBA.
2007, 2008). These algorithms have improved upon CBA
in one or more of its main steps including rule learning, 3.3. Combinatorial mathematics
sorting, pruning or prediction. For example, ARC-BC has
One recent AC approach for mining CARs which is based
been applied on unstructured textual data collection, and
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

on the theory of combinatorial mathematics was proposed


lazy AC algorithms such as L3G (Baralis et al., 2004) have
by Pal and Jain (2010). The basic idea behind this algo-
enhanced the accuracy of CBA by producing more
rithm comes from generating all possible combinations of
knowledge. Lastly, ACN and negative rules have discussed
attribute values in the input data set which is represented
the issue of deriving not only positive knowledge but also
as a bitmap and then counting the occurrences of each
knowledge with negation in the antecedent or consequent
element within the produced combinations. A combina-
part of the rule. More precisely, ACN was proposed to
tion is just an unordered set of a unique size consisting of a
mine a relatively large set of negative association rules and
number of elements (attribute values). To clarify the
then uses both positive and negative rules to build a
concept of generating the possible combinations of ele-
classi¯er. A positive rule is of the form X ) Y where X, Y
ments for set S, let's assume that S ¼ ðX; Y ; ZÞ. The
are a set of items and X \ Y ¼ . A negative rule is of the
possible number of combinations for S can be computed
form X ) Y where in addition to being a set of items, X
as 2 jsj and in this case 2 3 ¼ 8, and shown as (; X;
or Y will contain at least one negated item.
Y ; Z; XY ; XZ; YZ; XYZ). Now, the authors have enu-
merated each element using binary representation so ele-
3.2. Charm based approach ment \X" is represented as 100 and element \XYZ" is
A closely related approach to Apriori learning style that represented as 111. The algorithm works in two steps
reduces the number of candidate itemsets and improves where in step (1) it computes the support value for each
the searching for frequent itemsets called closed itemset combination to generate the candidate ruleitems (attri-
was proposed by Zaki and Hsiao (2002) and Li et al. bute value, class value) and then in step (2) it builds the
(2008). An itemset is said to be closed if none of its im- classi¯er by converting any ruleitems having con¯dence
mediate supersets has similar support value as that of the value larger than the minconf into CAR.
itemset. For instance, if fice, juice, crispg is an itemset This simple rule learning strategy which is based on
with a support value of 5, and all of its supersets have combinatorial mathematics is not novel since association
support values less than 5, then fIce, Juice, crispg becomes rule mining algorithms such as Apriori is also based on
closed itemset. Normally, closed itemsets are able to an- binary representation of the items within the transactional
swer common inquiries like \is a particular itemset database and uses e±cient pruning method based on the
frequent?" and, if so, \what is its support value in the downward closure property to reduce the search space for
input database?". One of the common algorithms for rules. The AC algorithm presented by Pal and Jain (2010)
mining closed itemsets is Charm (Zaki and Hsiao, 2002). has been tested only on one single data set from UCI re-
Charm usually explores the itemset and the transactional pository called \TicTac" which limits its use in applica-
database spaces rather than only the itermset space as tion domains. Lastly, the e±ciency of such an algorithm
Apriori. Moreover, it introduced an e±cient candidate was not evaluated especially on highly correlated classi¯-
searching method that skips many levels of the data cation data sets where we expect the number of attribute
structure (itemset tree) to quickly discover the frequent combination values to be numerous.

1450027-8
September 30, 2014 2:00:25pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

3.4. Imbalanced class distribution based Baralis et al. (2004) proposed a related multiple sup-
approach ports approach that looks at the current rules generated
for all class labels in iteration N in order to amend the
The classes in some classi¯cation data sets are unevenly
support value for class labels that have no rules repre-
distributed. This may result in the production of very
sentation by lowering their support. Therefore, ensuring
small number of rules and in some cases no rules at all for
rules appearance for most of class labels in the training
the low frequency class and numerous numbers of rules for
data set is a must.
the high frequency class(s) (Arunasalam and Chawla,
2006). This problem normally happens because of the
minsupp threshold which controls the rule discovery step 3.5. TID-list intersection based approach
in which if we set it to a value larger than certain class To reduce the number of passes over the input database in
frequency, there will be no rules representation for that horizontal mining algorithms, the Eclat algorithm has
class in the classi¯er, and several strong rules will be been presented by Zaki et al. (1997), which requires only a
simply ignored during the rule discovery step. Therefore, single database scan, addressing the question of whether
researchers have investigated the possibility of utilising all frequent itemsets can be derived in a single pass. Eclat
multiple supports (Liu et al., 2001; Baralis et al., 2004) or introduced the concept of vertical database representation
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

other measures such as Complement Class Support (CCS) in association rule (Table 4), where frequent itemsets are
(Arunasalam and Chawla, 2006) that may overcome the obtained by applying simple tid-lists intersections, with-
class imbalance issues in classi¯cation benchmarks. out the need for complex data structures. A tid-list of an
One possible solution to the class imbalance problem is item is the locations (row numbers) in which this item has
the abundance of the minsupp threshold from taking any appeared in the training data set. In 2003, a variation of
role in the rule generation and the use of new measures the Eclat algorithm, called dEclat, was proposed by Zaki
such as CCS that primarily takes into account positively and Gouda (2003). The dEclat algorithm uses a newer
correlated rules as shown in the equation below: layout called di®set, which stores the di®erences in the
CCS for a ruleðRÞðA ! CÞ transactions identi¯ers (tids) of a candidate itemset from
its generating frequent itemsets. This considerably reduces
¼ SupportðA [ C Þ=SupportðC Þ; ð1Þ
the size of the memory required to store the tids. The
where A is the conjunction of the attribute values in R's di®set approach avoids storing the complete tids of each
body and C frepresents the complement of class C. The itemset; rather the di®erence between the class and its
learning approach of Arunasalam and Chawla (2006) only member itemsets are stored. Two itemsets share the same
looks for strong correlation between the rule antecedent class if they share a common pre¯x. A class represents
(rule body) and consequent (class), meaning rules that items that the pre¯x can be extended with to obtain new
have low CCS are produced and other rules with high CCS class. For instance, for a class of itemsets with pre¯x
are discarded. Experimentations against eight data sets x; ½x ¼ fa1 ; a2 ; a3 ; a4 g, one can perform the intersection of
from UCI repository showed that CCS based algorithm xai with all xaj with j > i to get the new classes. From [x],
slightly outperformed CBA with respect to one error rate we can obtain classes ½xa1  ¼ fa2 ; a3 ; a4 g; ½xa2  ¼ fa3 ; a4 g;
measure and the results also revealed that the CCS based ½xa3  ¼ fa4 g.
algorithm performed well on imbalanced data sets when it In AC mining, MCAR (Thabtah et al., 2005) and
comes to predictive accuracy. MMAC (Thabtah et al., 2004) algorithms modi¯ed the
Another possible solution to the class imbalance tid-list intersection learning used in association rule to
problem is the enhancement performed on CBA algorithm handle classi¯cation benchmarks. We will explain the
by Liu et al. (2001) that considers the frequency of class learning strategy of MMAC in the multi-label classi¯ca-
labels in the input data set, and assigns each class di®erent tion (Sec. 3.9) since it is a multiple label algorithm. MCAR
support value. In other words, the original minsupp value consists of two main phases: Rules generation and a
is distributed to each class according to the class frequency classi¯er builder. In the ¯rst phase, the training data set is
in the input data set. So, a low frequency class gets a low scanned once to discover frequent 1-ruleitem, and then
minsupp to guarantee the production of rules for it. An MCAR combines ruleitems generated to produce candi-
evaluation study against 34 data sets from UCI repository date ruleitems involving more attributes. Any ruleitem
showed that on average, the error rate of CBA (2) is lower with support and con¯dence larger than minsupp and
than that of CBA and C4.5 algorithms. minconf, respectively, is created as a candidate rule. In the

1450027-9
September 30, 2014 2:00:25pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

second phase, rules created are used to build a classi¯er by support and con¯dence parameters. Though correlation is
considering their e®ectiveness on the training data set. not a casual thing and it only reveals statistical associa-
The frequent ruleitems discovery method of MCAR tion between a set of objects in an implication, e.g.
scans the training data set to count the frequencies of X ! Y . If we discover casual correlation between the rule
1-ruleitems, from which it determines those that hold antecedent and consequent, one can reveal consequential
enough support. During the scan, frequent 1-ruleitems are factors with reference to class labels in the data set.
determined, and their occurrences in the training data Therefore, unlike current AC algorithms which produce a
(rowIds) are stored inside an array in a vertical format large search space for frequent ruleitems during the rule
along with classes and their frequencies and any ruleitem discovery, the use of causality and EP in AC mining can
that fails to pass the support threshold is discarded. minimise the search space of the candidate ruleitems by
MCAR ¯nds frequent ruleitems of size t by appending only keeping ruleitems that have causal impact on the
disjoint frequent itemsets of size t  1 and intersecting class (Yu et al., 2011). In other words, when CARs are
their rowIds in the training data set. The result of this discovered the only attribute values considered in the
simple intersection gives a set of rowIds where both CARs are those that belong to this causal attribute values
itemsets occur together in the training data. This set along space instead of the combinations of all attributes values.
with the class array holding the class label frequencies This signi¯cantly minimises the demand on resources in-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

derived during the ¯rst scan, can be used to compute the cluding training time and memory in the rule discovery
support and con¯dence of the new ruleitem resulted from step.
the intersection. Experimentations on real scheduling data The ¯rst algorithm which employed EP was proposed
collections as well as UCI data repository showed that by Dong et al. (1999) and it is called Classi¯cation based
MCAR outperformed CBA and other classic classi¯cation on Aggregating Emerging Patterns (CAEP). An EP is an
algorithms such as RIPPER and C4.5 with respect to attribute value which has a support that changes from one
accuracy. data set to another, with a change rate larger than a
In Tang and Liao (2007), a vertical AC algorithm called constant : The support rate between two data sets for a
CACA was proposed. It scans the training data set, stores given attribute value is called the growth-rate, which can
data in vertical data format like MCAR, counts the fre- be computed as follows:
quency of every attribute value and arranges attributes in
Supportd 0 ðattÞ
descending order according to their frequencies. Any at- ; ð2Þ
Supportd ðattÞ
tribute which fails to satisfy the minsupp is removed in this
step. For the remaining attribute values, CACA intersects  are the data sets
where att is the attribute value,d and d
attributes locations to cut down the searching space of which the attribute value's support has changed. Given a
frequent pattern. For each attribute in a class group that minsupp threshold and a growth-rate, the algorithm ¯nds
passes, the minconf is inserted in an Ordered Rule Tree EPs that survive , also known as -attribute values. In
(OR-Tree) as a path from the root node and its support, mining EPs, the input data set is ¯rst divided into parts
con¯dence and class are stored at the last node in the path. based on the class labels, and a production of all -attri-
Limited experimental results suggested that CACA per- bute values from one part to another is implemented
forms better with reference to accuracy and computation (Dong et al., 1999).
time than MCAR on sample of the UCI data sets. Experimental studies (Dong et al., 1999) showed that
EP's based AC algorithms generate competitive classi¯ers
with respect to classi¯cation rate if compared to CBA,
3.6. Casual and EP approach
CMAR, CPAR and C4.5.
The majority of AC algorithms employ minsupp and
minconf which are mainly statistical correlation para-
meters to discover the rules. The minsupp is used to
3.7. CMAR and lazy based approaches
capture frequent attribute values (items) and the minconf Han et al. (2000) presented an association rule discovery
is hired to show the strong rules from the set of frequent method called Frequent Pattern Growth (FP-Growth)
attribute values. A di®erent AC approach based on the that converts the transactional database into a condensed
idea of causality and EP has been proposed by Yu et al. frequent pattern tree (FP-tree) in which each transaction
(2011) and Dong et al. (1999). Most of the current AC corresponds to one path in the tree containing the fre-
algorithms determine the correlation between rule ante- quent items in that transaction. Therefore, the new re-
cedent (attribute value) and consequent (class) based on presentation of the input database (FP-tree) can be seen

1450027-10
September 30, 2014 2:00:25pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

practical since frequent itemsets in each transaction are In general, most of the AC algorithms that employ the
known by the tree, and the FP-tree is usually smaller in CMAR learning strategy take the common attribute
size than the complete input database because of the items values contained in the rules into consideration. This in-
sharing among frequent itemsets. In addition, the number deed reduces the memory usage as well as the searching
of iterations over the input database necessary to build the time for frequent ruleitems if compared with CBA-like
FP-tree is just two rather than N as in Apriori where N algorithms such as CBA, CBA(2) and LCA. Experimental
equals the size of largest frequent itemset. Once the studies (Li et al., 2001; Ye et al., 2008) on UCI data re-
algorithm constructs the FP-tree, pattern growth heuris- pository and Malware security data collection demon-
tic kicks in to produce the rules from the FP-tree. For each strated that CMAR-like algorithms produce higher
frequent pattern X, the heuristic uses links in the tree to quality classi¯ers than CBA-based algorithms and they
derive other available patterns co-occurring with X, and may save more memory storage (Li et al., 2001; Thabtah
then the FP-growth algorithm concatenates X with the and Cowling, 2007). Nevertheless, one major de¯ciency of
other patterns extracted from the FP-tree. CMAR-like algorithms is that the CR-tree may not ¯t in
In AC mining, a modi¯ed version of the FP-growth has the main memory in cases when the input data is dense
been successfully implemented by a number of algorithms and huge in size.
including Malware detection AC (Ye et al., 2008), L3G
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

(Baralis et al., 2004), L3 (Baralis and Torino, 2002) and


CMAR (Li et al., 2001). Particularly, the ¯rst AC algo- 3.8. Greedy based approach
rithm that employed FP-growth is CMAR which saves the A learning strategy called ¯rst order inductive learner
rules in a pre¯x tree data structure known as a CR-tree. (FOIL) that produces rules for each class in the training
The CR-tree holds the rules in a descending order data set was produced by Quinlan and Cameron-Jones
according to the rule body support value in the training (1993). FOIL learns the rule locally in a greedy fashion
data set (frequency of the attribute values in the ante- and according to a measure called FOIL-gain. The algo-
cedent of the rule). Once a rule is extracted, it is inserted rithm generates the rules as follows: for each available
into the CR-tree as a path from the root and its support, class L, it splits the training data into two subsets, one
con¯dence and associated class are saved at the last node that contains all cases associated with L (positive cases)
in the path. When a new rule is about to be inserted into and one that holds all other cases associated with the rest
the tree and that rule contains common attribute values of the class labels (negative cases). Then FOIL initiates an
with another already existing rule in the tree, the path of empty rule (e.g. if empty then LÞ, and iterates over the
the existing rule is extended to re°ect the addition of the available attribute values to compute the FOIL-gain for
new rule. each attribute value belonging to L, it selects the attribute
In 2002, an AC algorithm called L3 has employed value with the largest FOIL-gain and adds it in the rule
CMAR learning strategy in rule generation, though this antecedent. The sample process is repeated until the
algorithm adds on CMAR, the concept of lazy pruning. constructed rule length reaches a certain value and the
The lazy pruning approach is discussed in Sec. 4.2. negative case set is not empty. Once the rule is con-
Recently, Ye et al. (2008) have evaluated the applicability structed, all associated positive cases that belong to the
of AC on the malware security benchmark problem. attribute value and class L are removed. FOIL continues
Malware is a general term that corresponds to all kinds of building rules for class L until all positive cases are cov-
unwanted software like trojans, spyware, viruses and ered (removed), once that occurs it considers another class
others. Since the detection of the malware id is challeng- and repeats the same process until all class labels are
ing especially from large data sets, the authors have considered.
adopted CMAR in order to improve the performance in- The key to success in FOIL learning strategy is the
volving the searching for correlations between the security Foil-gain measure which is about assessing the informa-
features and the class attribute. Experimentations using tion gained for a particular rule after adding an attribute
33,695 Windows PE (portable executable) ¯les of which value to that rule. The FOIL-gain measure for a certain
11,507 are recognised as benign executable while 22,188 attribute value (A1 , v1 ) can be calculated using the class
are malicious executable have been used to evaluate the information in the training data set. So, for class label L,
algorithm. The results revealed that this algorithm usu- the positive cases associated with it are denoted jP 0 j and
ally achieves the highest detection of malware if compared the negative cases of L are denoted jN 0 j. Once (A1 , v1 ) is
to decision tree (Quinlan, 1993). added by FOIL into a rule R, there will be jP j positive

1450027-11
September 30, 2014 2:00:26pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

and jNj negative training cases that correspond to R: data sets and using known AC algorithms including CBA,
ðA1; v1Þ ! c. CMAR and CPAR showed that Chen et al.'s (2005) al-
gorithm is competitive to these algorithms and in partic-
Foil-gainðA1 ; v1 Þ ular it slightly outperformed CMAR and CBA on the
 
jP j jP 0 j considered data sets.
¼ jP j log  log 0 : ð3Þ
jP j þ jNj jP j þ jN 0 j

It is clear that FOIL always looks for the largest FOIL- 3.9. Repetitive learning and multiple
gain attribute value in order to add it into the rule. labels approach
Though, there could be more than one attribute value The majority of current AC algorithms extract single label
with similar FOIL-gain which makes the selection of just classi¯ers in which the consequent of the rules contains
one attribute value questionable. This also can lead to only one class (Taiwiah and Sheng, 2013). In the searching
deterioration in the classi¯cation accuracy during the process for rules in the training data set, these algorithms
prediction step since a limited number of rules are often only consider the largest frequency class associated with
extracted by FOIL. Another problem associated with the attribute value and produce it in the potential rule
FOIL learning fashion is that the rules are derived from consequent. However, an attribute value may associate
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

parts of the training data set and not from the complete with multiple class labels with similar frequencies making
set, which makes them local rules and not global ones. extracting just one class in the rule highly undesirable and
In 2003, Yin and Han (2003) proposed an AC algo- questionable. This is since these class labels comprise im-
rithm called CPAR that enhances FOIL rule learning in portant and useful knowledge to the decision maker and
which once a rule such as R is constructed, CPAR does producing all of them is a de¯nite advantage.
not discard the positive cases associated with R instead The ¯rst AC that considers the production of multiple
weights of these cases are lowered by a multiplying factor. labels in the rule consequent is MMAC (Thabtah et al.,
This enhancement guarantees the production of more 2004). This algorithm proposed a recursive learning phase
rules as a training case is allowed to be covered by mul- that combines local classi¯ers derived during a number of
tiple rules instead of a single, and consequently the clas- iterations into a multiple label global classi¯er. For a given
si¯cation accuracy gets improved as well. Moreover, training data set T , MMAC operates similar to MCAR
CPAR ¯nds all attribute value with the largest FOIL-gain algorithm in the training step and extracts the ¯rst single
rather than just one as in FOIL so it can add multiple label classi¯er in iteration one. Then all training cases
attribute values into the rules and thus building rules associated with the derived rules are discarded, and the
simultaneously. remaining unclassi¯ed cases in the original training data
Furthermore, the searching process for the attribute set comprise a new data set T1 . In the next iteration, the
value with the largest FOIL-gain can be exhaustive and algorithm ¯nds all rules from T1 , builds another single
requires storage resources (e.g. main memory) especially label classi¯er, removes all cases in T1 which are associated
when the available number of attributes in the training with the generated rules and so forth. The results are n
data set is large. In this context, CPAR employs an e±- classi¯ers in which MMAC merges them to form a multi-
cient data structure to keep all necessary data about the label classi¯er. One distinguishing feature of MMAC
rule such as the positive and the negative cases before besides discovering additional knowledge often missed by
adding the attribute value into the rule antecedent and other AC approaches is that it can extract multi-label
after adding it into the rule. It has been shown that CPAR classi¯ers not only from multiple labels data sets but also
is highly competitive with reference to predictive accuracy from single label ones.
to other AC algorithms such as CBA and traditional A closely related multi-label AC algorithm called
classi¯cation algorithms such as RIPPER and C4.5 Ranked Multilabel Rule (RMR) (Thabtah and Cowling,
against the UCI data collection. 2007) solved the problem of rule overlapping and class
The di®erent steps in AC mining have been studied by ranking. This algorithm had proposed a post training
Chen et al. (2005) in order to come up with a new algo- heuristic that adjusts the position of the class labels in
rithm that can take an advantage from the previous each of the rule inside the classi¯er. More details on this
studies. The outcome was an algorithm that learns the algorithm are given in Sec. 3.11. Another multiple labels
rules using FOIL-gain measure, and then discards detailed AC classi¯cation algorithm called Correlated Lazy Asso-
rules and weakly correlated rules similar to CMAR algo- ciative Classi¯er (CLAC) (Veloso et al., 2007) that adopts
rithm with minor modi¯cations. Evaluation using ten UCI lazy classi¯cation and delays the reasoning process until a

1450027-12
September 30, 2014 2:00:27pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

test case is given. Similar to MMAC and RMR, CLAC Table 6. Partial training data adopted
allows the presence of multiple classes in the consequent of from Thabtah and Cowling (2007).
the rules. Unlike binary classi¯cation which does not Row Id Att1 Att2 Class
consider the correlation among classes, CLAC takes into
account classes relationships and training data over- 1 a b c1
lapping with these classes. The learning strategy used by 2 a b c1
3 a b c1
CLAC assigns a weight consisting of the con¯dence and
4 e b c1
support value of the rule(s) having the class and belonging 5 d b c1
to the test case, then the class labels applicable to the test 6 — b c2
case are sorted by their weights. CLAC then gives the test 7 — b c2
case the class with the largest weight, and considers the 8 — b c2
test case a new feature and iteratively assigns new class 9 e f c3
. — — —
labels to the test case until no more labels can be found.
Furthermore, this learning method deals with the small
disjuncts (rules that cover limited number of training (1, 2, 3). The deletion of the r1's training cases impacts
data), which removing them may reduce classi¯cation other candidate rules that share these cases such as r2 .
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

accuracy according to Veloso et al. (2007). Therefore, after r1 is inserted into the classi¯er, class c1 of
Empirical evaluations (Thabtah et al., 2004; Thabtah rule r2 would not be the largest frequency class anymore
and Cowling, 2007; Veloso et al., 2011) revealed that since some training cases of r2 are removed when r1 was
multi-label AC algorithms construct additional useful produced. In fact when r1 was derived, a new class of r2
rules that improve the classi¯cation accuracy of the becomes the largest frequency class, e.g. c2 , because it has
resulting classi¯ers if compared with other single label AC the largest representation among the remaining r2 rows in
such as CBA, CPAR and MCAR. the training data set. This rule overlapping problem is called
the \¯ttest class problem" (Thabtah and Cowling, 2007).
The RMR algorithm proposed a post training heuristic
3.10. Semi incremental and post training that adjusts the position (rank) of the class labels in the
approaches rules taking into consideration the rules overlapping in the
The majority of AC algorithms use the classi¯cation rules training cases. This heuristic operates as follow: Starting
discovered from the training data set for constructing the with the top ranked rule, it iterates over the training data
classi¯er which in turn is applied to predict the class of set removing all training cases applicable to the rule.
unseen test data. Though, in circumstances where there Then, the support and con¯dence of the lower ranked
are limited input data or the input data gets frequently rules' decrease since it shares training examples with the
updated, there should be a mechanism that can take into selected rule. This may result in adjusting the class labels
consideration (1) the new update(s) on the source data position(s) in the lower ranked rules and the largest fre-
and the classi¯ed resources (rules and the test data). quency class for some of these rules may not be the ¯ttest
Moreover, the problem of correlation between the class class any more. The process is repeated until all training
and the training cases may result in generating rules as- data cases are removed or the algorithm has iterated over
sociated with wrong class since these rules overlap in the all rules. This post training process is similar to covering
training cases. Precisely, the rule discovery strategies approach in classi¯cation in which it allows the training
employed by current AC algorithm are normally adopted case to be covered by just a single rule in the classi¯er
from association rule in which these algorithms allow the solving an important de¯ciency inherited from association
training case to be covered by multiple rules. So when a rule to AC which allows a training case to be covered by
rule is derived, other potential lower ranked rules may still multiple rules.
be able to cover the derived rule training cases, and thus Moreover, Wang et al. (2011) proposed an AC called
classes associated with many rules learned during the Adapting Associative Classi¯cation (ADA) that con-
learning step are not the most accurate ones. structs rules from both the input training data set as well
Consider Table 6, which contains two attributes and as the classi¯ed resources such as the training data set,
the class attribute. Assume that r1 : a ^ b ! c1 and r2 : current classi¯cation rules and test cases. Meaning the
b ! c1 are generated from Table 6, and r1 has a higher classi¯er is amended on the °y after the classi¯ed resources
rank than r2 . In current AC algorithm such as CBA when reach a certain amount. The authors have used a co-
r1 is generated, its training cases will be deleted, i.e. rows training method (Mei et al., 2006) to accomplish the task

1450027-13
September 30, 2014 2:00:27pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

of updating the classi¯er by re¯ning the newly discovered of K-means reduces the runtime of the algorithm by 30%.
knowledge from the existing classi¯cation rules. The Dhok and Varma (2010) developed a scheduler algorithm
co-training method used in ADA has been adopted from that uses pattern classi¯cation for the task assignment in
the semi-supervised learning of pattern context where the MapReduce framework. The developed scheduling algo-
labelled training documents are exercised to ¯gure out the rithm was able to cut down the response time of some
class labels of the unlabelled cases. More details can be workloads by considerable amount as compared to the
found in Mei et al. (2006). Overall, ADA can be considered original scheduler. The decision tree C4.5 classi¯cation
a semi-incremental AC algorithm since few training cases data mining algorithm (Wu et al., 2009) was implemented
or users set of frequent patterns (keywords) are only using the MapReduce framework to enforce parallel and
necessary to build the classi¯er instead of the complete distributed classi¯cation. After experimentations, the
training cases. Then, the classi¯ed cases as well as the results revealed that an increase in the number of nodes
classi¯cation rules are employed to update the classi¯er by positively impacts the classi¯cation modelling.
adding or removing rules. In AC mining, a new algorithm called MapReduce Mul-
An empirical study (Thabtah and Cowling, 2007) on ticlass Classi¯cation based Association Rule (MRMCAR)
multi-class and multi-label data sets from UCI data sets as which is based on a recent work (Thabtah and Hammoud,
well as scheduling showed that removing the overlapping 2013) can be seen as generalised version of MCAR algo-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

among the rules in the classi¯er by RMR algorithm out- rithm and distributable on MapReduce framework was
performed the MMAC algorithm with respect to classi¯- proposed by Thabtah and Hamoud (2013). It consists of
cation accuracy. Moreover, limited experimentations on four main steps, where each step may demand one or more
four data sets from the UCI data repository have been MapReduce jobs:
performed using ADA, CBA, CMAR and C4.5 algorithms
. Step One (Initialising): Representing the input data set
by Wang et al. (2011). The results showed similarity on
in a suitable format for the MapReduce framework, i.e.
the classi¯cation accuracy performance of the AC algo-
ItemId ¼ (ColumnId) RowId.
rithms and superiority over decision tree approach (C4.5).
. Step Two (Rule Discovery): This step includes ¯nding
frequent ruleitems, rule extraction and rule pruning.
3.11. Distributed MapReduce approach . Step Three (Constructing the classi¯cation model):
This step involves selecting high con¯dence and repre-
MapReduce is an emerging model, yet, not much research
sentative rules from the set of candidate rules extracted
on simulating the performance of MapReduce cluster has
in Step (2) to represent the classi¯cation model.
been done. To the best of our knowledge, MRPerf (Wang
. Step Four (Predicting test cases): In this step,
et al., 2009) and Mumak (Apache JIRA, 2009) are the
MRMCAR algorithm utilises a hybrid method consist-
only simulators targeting the MapReduce framework.
ing of single and multiple rules prediction methods.
Recently, MapReduce has been adopted by many search
enterprises such as Yahoo, Google and Amazon to enable In the learning phase, the MRMCAR maps each row in
building petabyte data centres comprising hundreds of the data set to a unique integer that represents the
thousands of nodes. These data centres are of low cost number of lines where the row occurs in the data set.
hardware and with a software infrastructure to allow Every frequent item id (ItemId) consists of two parts:
parallel processing analysis of the stored data. MapReduce column ids and RowId, i.e. ItemId ¼ (column ids) RowId.
model provides a software infrastructure to simplify Once the original data is represented in ItemId format,
writing applications that can access and process this then all intermediate data generated in the algorithm will
massive data. However, the cluster setup to get optimum keep the same representation. This makes the iterative
performance is not a trivial problem. It needs con¯gura- process of ¯nding frequent ruleitems simpler throughout
tion of tens of setup parameters and dynamic job para- the algorithm.
meters which a®ect every task execution. Frequent ruleitem discovery in MRMCAR works by
MapReduce programming paradigm has been recently repeating the transformation of the input data between
employed in data mining research because of its ability of the Line-space and the Frequent-item space until all fre-
performing parallel processing particularly during learning quent ruleitems are discovered. Data transformation
step and when the input data size is massive. For instance from a Line-space to a Frequent-space is performed using
Zhao et al. (2009) have implemented the known clustering the MapReduce methods \ToFrequent.Mapper" and
algorithm K-means utilising the MapReduce paradigm. \ToFrequent.Reducer". The input for the \ToFrequent.
The results showed that the MapReduce implementation Mapper" method is <line: label, list of ItemId>, and the

1450027-14
September 30, 2014 2:00:27pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

output is <ItemId, (Line: label)>, which then gets in- To describe the learning style of MRMCAR, we revisit
putted to the \ToFrequent.Reducer" and this method Table 5 and assume that the last attribute is the class
outputs <ItemId, FrequentItem>. On the other hand, attribute and the minsupp is 2 (support count). The
transforming the data from a Frequent-space into a Line- MRMCAR algorithm initially transforms the data into
space is performed using the methods \ToLine.Mapper" Line-space as shown in Table 5.1, and applies the
and \ToLine.Reducer". The \ToLine.Mapper" gets \ToFrequent.Mapper" and "ToFrequent.Reducer" meth-
<ItemId, FrequentItem> as an input and produces <Line ods to map the input data to entries in the Frequent-
Number:Label, ItemId> as an output, which in turn gets space. In this way and for each item in the Line-space the
inputted for the \ToLine.Reducer" and this method col- \ToFrequent.Mapper" method is invoked to emit list of
lects the ItemIds entries for a certain line and outputs- <ItemId, (Line,Label)>.
<line: label, list of ItemId> (Line-space).

(line 0) <0:0, (0)0, (1)0, (2)0> ToFrequentItem.Mapper <(0)0 ,(0:0)>, <(1)0, (0:0)>, <(2)0, (0:0)>.

(line 1) <1:0, (0)1, (1)0 , (2)0> => ToFrequentItem.Mapper =><(0)1,( 1:0)>,< (1)0 ,( 1:0)>,< (2)0 ,( 1:0)>
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

... etc.

Then, the output results from the Mapper are sorted For instance, for attribute values (keywords) \a" and \c",
and introduced to the Reducer grouped by the key value. the data o®ered to the Reducer are as follows:

<(0)0, 0:0 >,<(0)0, 4:2 >, <(0)0, 5:3 >,<(0)0, 8:3 > ToFrequentItem.reduce < (0)0 ,[ 0:0, 4:2, 5:3, 8:3]>

......... ToFrequentItem.reduce< (0)1 ,[ 1:0, 2:2, 3:3, 6:3, 7:3]>

For these particular attribute values, it is obvious that and (0)1 are marked as frequent with class label \3" since
(0)0 and (0)1 are frequent ruleitems with support values they appear in the training data set with it more than the
2/9 and 3/9, respectively. It should be noted that in the rest of the class labels (label \3" corresponds to \R" in
rule discovery step while determining the frequent rulei- original data set). This is the preliminary label choice
tems, MRMCAR considers the attribute value occurrence attached to this ruleitem. Now we have a frequent item set
with its largest frequency class, and for this reason (0)0 of size 1 (1-ruleitems).

(0)0 { sup=2 , conf=0.500, 0:[0] 2:[4] 3:[5, 8]}

(0)1 { sup=3 , conf=0.600, 0:[1] 2:[2] 3:[3, 6, 7]}

(1)0 { sup=2 , conf=0.400, 0:[0, 1] 2:[4] 3:[7, 8]}

(1)2 { sup=3 , conf=0.750, 2:[2] 3:[3, 5, 6]}

(2)0 { sup=2 , conf=0.500, 0:[0, 1] 2:[2] 3:[3]}

(2)4 { sup=3 , conf=0.750, 2:[4] 3:[5, 6, 8]}

1450027-15
September 30, 2014 2:00:28pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

As shown previously, in each frequent ruleitem, lines of format using the MapReduce methods \ToLineItem.
the same class value are grouped together. Once the fre- Mappe" and \ToLineItem.Reducer". So for ruleitems h\
quent ruleitems of size 1 are determined, then only their a 00 ; ri and h\b 00 ; ri which are frequent, their Line-space
occurrences are transformed into the Lin-space data representations are:

(0)0 { sup=2 , conf=0.500, 0:[0] 2:[4] 3:[5, 8]} => ToLineMapper =>

<0:0, (0)0>, <4:2, (0)0>,<5:3, (0)0>,<8:3, (0)0>

(0)1 { sup=3 , conf=0.600, 0:[1] 2:[2] 3:[3, 6, 7]} =>ToLineMapper =>

<1:0, (0)1>, <2:2, (0)1>,<3:3, (0)1>,<6:3, (0)1>,<7:3, (0)1>

The sample outputs are sorted and grouped by the line Arti¯cial Intelligence (AI) based on the Darwinian natural
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

number and then o®ered to the \ToLine.Reducer" which selection and mutation in Biological production. Nor-
will only accumulate the ItemIds and output them to line- mally, a GA method starts with an initial population of
space. So the lines would be similar to the previous lines objects, and it tests the ¯tness of the objects in the pop-
set of Table 5.1 excluding any attribute value which was ulation until a stopping criteria is met. During testing, it
discarded during the generation of frequent ruleitems. If performs selection, crossover and mutation operations on
no ItemIds were thrown with a certain line, then this line objects. In the GA algorithm, the input is di®erent con-
is dropped from the line space. In the next iteration, the tinuous attributes some of which are technical indicators
algorithm simply ¯nds frequent ruleitems of size N by (The relevant di®erence between two items). Then the
appending frequent ruleitems of size N  1. Particularly, algorithm discovers the relation sets among the items in
and for each two disjoint ItemIds in a single line within the the form of relation hitem; operator; itemi in which there
Line-space, the algorithm checks the possibility of joining are three di®erent items (constant, technical indicator,
them to one ItemId. attributei and the operator are restricted to (h; i). A
conjunction of the relation sets is the rule antecedent.
3.12. Genetic algorithm (GA) approach The GA algorithm cuts down the search space by
providing a relation pruning method that indicates which
When the training data set contains numerical attributes
pairs of items can be compared for which attributes in a
or the application domains produce continuous data type
attributes, AC algorithms tend to preprocess the input relation. During the rule discovery, a rule is encoded in
data using discretisation techniques in order to map the multi-level structure and represented as a chromosome.
continuous attribute to a set of ¯nite possible values. In The ¯rst level contains the number of items encoded and
addition, most of the current AC algorithms are unable to the value of the gene corresponds to the relation type of
discover the correlations among the numerical attributes the item. The algorithm produces the genes for the ¯rst
in application data like stock trading or any relevant data level and then the second level and considers discarding
with the same features. To be more speci¯c, the data in irrelevant relations. It should be noted that only the ¯rst
the stock trading application contain continuous attri- level genes are applied in the crossover to prevent pro-
butes such as quantities and prices for stocks sold over ducing useless rules, though, mutation is applied to genes
time and several technical indicators can be discovered in the ¯rst and second levels. All rules produced must pass
from the data to be laterally used by domain experts in the minsupp and the minconf thresholds, and then sorted
order to discover trading signals (Chien and Chen, 2010). according to CBA (Liu et al., 1998) sorting procedure.
In fact, the technical indicators can be used in the rule Limited experimentations on stock data collection
antecedent and the selling or buying are the class labels of gathered from ten di®erent companies have been carried
the rule. out with reference to accuracy. The results pointed out
An AC algorithm called GA-ACR that adopts GA that the GA-ACR algorithm outperforms a simple data
search strategy to build classi¯ers was proposed by Chien distribution algorithm. No comparisons of the GA algo-
and Chen (2010). GA is a common searching strategy in rithm and other AC algorithms are conducted in order to

1450027-16
September 30, 2014 2:00:29pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

Table 7. Summary of learning approaches in AC. to be informative if it has a gain above a certain threshold.
In this section, we highlight di®erent rule sorting proce-
Learning methodology Common AC algorithms
dures in AC.
CBA (Apriori candidate CAAR, CBA, negative rules
generation) (ARC-AC), CARGBA, CAN, etc 4.1. Con¯dence, support and rule cardinality
CMAR (FP-growth CMAR, L 3 , L 3 G
procedure
approach)
CPAR (Greedy) (Yin and Han; 2003) The ¯rst rule sorting procedure in AC was introduced by
Closed itemset (Charm) ACCF Liu et al. (1998) and it is based on rule's con¯dence,
Emerging patterns ICEP, ADT
support and the number of attributes in the rule's ante-
Multiple support CCS, CBA(2)
MCAR (TId List MCAR, MAC, CACA cedent. This procedure is displayed in Fig. 1. Using this
Intersection rule preference procedure has derived good quality classi-
approach) ¯ers with respect to accuracy according to some empirical
Multiple labels MMAC, RMR studies, (Liu et al., 2001) though the number of rules with
Test data training Calibrated AC, ADA similar con¯dence and support values are still massive.
Distributed MapReduce MR-ARM, MR-MCAR
Genetic algorithm GA-ACR
Consider for example two data sets (\Auto" and \Glass")
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

from the UCI data repository. Assume that the minsupp


and minconf are set to 2% and 40%, respectively. If we
apply a common AC algorithm such as MCAR, the
generalise the performance of the algorithm. Table 7 dis-
number of discovered rules with identical con¯dence from
plays the general learning methodologies in AC mining.
the \Auto" and \Glass" data sets are 2660 and 759, re-
spectively without rule pruning. When we apply the rule's
4. Rule Ranking Procedures con¯dence and support as tie breaking condition, we end
up with 2492 and 624 rules with similar con¯dence and
Classi¯cation algorithms are able to generalise their per-
support values. This example if limited shows clearly a
formance on test data cases by inductive biases since they
direct evidence that there are great number of rules that
have implicit assumptions of favouring one rule over an-
have common con¯dence and support and thus additional
other. For instance, a decision tree algorithms like C4.5
tie breaking conditions are needed to minimise the chance
have a clear bias in their searching for the best attribute
for rule arbitrary choices.
decision node, which is, the attribute selection method
There are a number of AC algorithms that employ the
based on IG. Moreover, these algorithms prefer smaller
rule sorting procedure shown in Fig. 1 including MAC,
e®ective sub-trees over complex ones by using backward
CBA(2), CARGBA, ACCF, CAAR and others. In 2005,
pruning. Probabilistic classi¯cation algorithms like Naïve
MCAR algorithm adds the rule's class distribution in the
Bayes (Duda and Hart, 1973) compute the probability for
training data set as a tie breaking condition beside the rule
each class in the training data set using joint probabilities
con¯dence, support and antecedent length. In particular,
of attribute values for a data case. An inductive bias in
if two rules have identical con¯dence, support and ante-
Naïve Bayes algorithm stands for the assumption that the
cedent length, MCAR favours the rule which is associated
conditional probability of a data case given a class is in-
with the class that has larger frequency in the training
dependent of the probabilities of other data cases given
data set. On the other hand, MAC algorithm proposed a
the same class (Liu et al., 2002).
rule ranking method that favours rules associated with low
In AC, an algorithm uses rule ranking to distinguish
frequency classes since these classes have small number of
rules in which it gives high con¯dence and support rules
rules. Experimental tests (Abdelhamid et al., 2012b;
higher rank. This is crucial since usually rules with higher
Thabtah and Cowling, 2007) on di®erent data sets from
rank are tested ¯rst during the predicting of test cases,
UCI data repository showed that this rule ranking pro-
and the resulting classi¯er accuracy depends heavily on
cedure positively impacts the classi¯ers produced in
them. There are several di®erent criteria in AC when
regards to accuracy and reduces the rule random selection
sorting rules. For instance, CBA based algorithms con-
during ranking.
sider the rule's con¯dence and support as main criteria for
rule favouring and MCAR adds on that the class distri-
bution condition when two or more rules have similar 4.2. Lazy ranking procedure
con¯dence and support values. Further, (Su et al., 2008) Lazy AC algorithms such as L3 often prefer rules that hold
have employed IG in rule preference in which a rule is said large number of attribute values in their antecedent.

1450027-17
September 30, 2014 2:00:30pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

Given two rules, R1 and R2, R1 precedes R2 if

1. The confidence of R1 is larger than that of R2.


2. The confidences of R1 and R2 are identical, but the support of R1 is larger than that of R2.
3. The confidence and support of R1 and R2 are identical, but R1 contains less number of attributes in its antecedent
than that of R2.

Fig. 1. CBA rule sorting procedure.

Given two rules, R1 and R2, R1 precedes R2 if

1. The confidence of R1 is larger than that of R2.


2. The confidences of R1 and R2 are identical, but the support of R1 is larger than that of R2.
3. The confidence and support values of R1 and R2 are identical, but R1 contains more number of
attributes in its antecedent than that of R2.

Fig. 2. L3 rule ranking method.

These kinds of rules are named speci¯c rules. In fact, lazy Decision tree algorithms such as C4.5 and C5 (Quin-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

algorithms try to hold almost all knowledge discovered lan, 1998) compute IG to assess which attribute goes into
even if redundancy exists among them aiming to maximise a decision node. The algorithm selects a root attribute
the predictive power of the ¯nal classi¯ers. Unlike CBA from the ones available in the training data set. As
rule ranking procedure, the L3 ranking procedure (Fig. 2) mentioned earlier, the choice is very important since it
mainly prefers speci¯c rules over general ones in order to a®ects the distribution of the available classes, and thus
give the speci¯c rules a higher chance in the prediction it is vital to select the best candidate as a root. C4.5
step since they are often more accurate than general rules. makes the selection of the root based on the most infor-
In the prediction phase, when the speci¯c rules are unable mative attribute and the process of selecting an attribute
to assign a class to the test case, then general rules with is repeated recursively at the so-called child nodes of the
smaller number of attributes in their antecedent are root, excluding the attributes that have been chosen be-
considered. fore, until the remaining training data cases cannot be
split any more. At that point, a decision tree is derived
4.3. Information gain where each node corresponds to an attribute and each arc
to a possible value of that attribute. Each path from the
IG is a mathematical measure mainly used in decision
root node to any given leaf in the tree corresponds to a
trees to decide which attribute goes into a root and
rule.
represents the expected amount of information required to
An AC method which utilises IG for rule sorting was
determine which class should be given to a new unclassi-
disseminated by (Su et al., 2008). Speci¯cally, the IG of
¯ed case. In other words, it measures how well a given
the rule r : Cond ! C is de¯ned as GainðrÞ ¼ GD 
attribute divides the training data cases into classes. The
Gcond  G cond ,
attribute with the highest information is chosen. In order
to de¯ne IG, someone ¯rst, has to measure the amount of where GD represents the IG of the training data set D and
information in an attribute using Entropy. is de¯ned as
Given a set of training data cases D of c classes, X m    
 Ci   Ci 
 
X GD ¼   D  log  D ; ð6Þ
EntropyðDÞ ¼ Pc log2 Pc ; ð4Þ i¼1

where Pc is the probability that D belongs to class c. The where jCi j represents the number of data cases which be-
IG of a set of data cases on attribute A is de¯ned as long to class Ci .
The IG of the rule antecedent (Gcond ) is de¯ned as
GainðD; AÞ  
X N1 N11 N11 N12 N12
¼ EntropyðDÞ  ððjDa j=jDjÞ  EntropyðDa ÞÞ; ð5Þ Gcond ¼  log  log ; ð7Þ
jDj N1 N1 N1 N1
where the sum is over each value a of all possible values of
where
attribute A, Da ¼ subset of D for which attribute A has
value a; jDa j ¼ number of data cases in Da , jDj ¼ number N1 ¼ jDj 
SupportðRÞ
;
of data cases in D. ConfidenceðRÞ

1450027-18
September 30, 2014 2:00:31pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

Table 8. Ranking models of AC algorithms.

Ranking models AC algorithms

Support, con¯dence, rules generated ¯rst CBA, CBA(2), Negative rules, ARC-AC,
CARGBA, ACCF, CAAR, CMAR, etc
Support, con¯dence, rules cardinality (longest rule) L 3, L 3G
Support, con¯dence, rules cardinality (shortest rule), MMAC, MCAR
rules class distribution (dominant class)
Support, con¯dence, rules cardinality (shortest rule), MAC
rules class distribution (minority class)
Information gain AC-IG

N11 ¼ jDj  SupportðRÞ and Lastly, recent algorithms consider information theory
based measures such as IG as the base for rule preferences.
N12 ¼ N1  N11 :
ð8Þ An experimental study (Thabtah et al., 2005) revealed
that using con¯dence, support and rule antecedent car-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

Finally, the training cases that do not match the rule dinality in rule ranking is an e®ective approach. Though,
antecedent are also considered as: recent studies (Abdelhamid et al., 2012b) and the example
" #
N Xm
jC j jC j discussed in Sec. 4.1 showed that imposing more tie
Gcond ¼
2
 i
log i
; where breaking conditions besides con¯dence and support may
jDj i¼1
N2 N2 ð9Þ
reduce the chance of randomisation in ranking which
N2 ¼ jDj  N1 : consequently limits the use of default class later on in
prediction step. The employment of mathematical mea-
So the rule (r) is said to be informative if r has support
sures such as Entropy and IG seems to be promising
and con¯dence greater than the minsupp and minconf as
towards improving the process of sorting the rules. Fi-
well as the GAIN ðrÞ > 0. After the rules are discovered,
nally, approaches that favour speci¯c rules may sometimes
the ranking procedure will be invoked where rules with
gain slight improvement in accuracy, however it su®ers
larger gain are placed at a higher rank. In cases when two
from holding a large number of rules, many of which are
or more rules have similar gain, then the algorithm eval-
never used and thus it consumes memory as well as
uates the con¯dence, support and rule antecedent length
training time. Table 8 depicts the general ranking models
similar to CBA rule preference procedure.
used in AC mining.
Lastly, it is worth to mention that the authors (Lan
et al., 2006) have utilised the dilated Chi-square method
for rule sorting instead of the con¯dence and support 5. Building the Classi¯er and Rule Pruning
thresholds. So, after rules are found the learning algorithm
evaluates the dilated Chi-square for each rule, and places Once the complete set of rules are found in the training
the rules with high values ¯rst. phase and then ranked, the AC algorithm has to decide
the way it should choose a subset of highly e®ective rules
to represent the classi¯er. There are di®erent ways used in
4.4. Discussion on rule ranking AC to build the classi¯er, for instance, CBA utilises the
Rule sorting is considered a pre-processing phase in AC database coverage rule pruning where rules that cover
mining which impacts the (1) classi¯er building process correctly a certain number of training cases are marked as
and (2) test cases prediction. As a matter of fact, without accurate rules and the remaining rules get discarded. L3
rule sorting, the algorithm will not be able to easily choose and L3G algorithms employ lazy pruning that stores pri-
the rules that can be employed in the prediction step. Rule mary and secondary rules in the classi¯er. Moreover,
preference has been de¯ned di®erently by AC algorithms. (Thabtah et al., 2010) has proposed di®erent rule pruning
CBA and its successors considered con¯dence and support methods based on exact rule matching and partial rule
the main criteria for rule preference and MCAR adds upon matching of the rule body and the training case. Lastly, a
CBA the class distribution of the rules if two or more rules pruning method that does not consider the similarity of
have identical con¯dence and support. On the other hand, the evaluated rule class and the training case class was
unlike CBA and MCAR, L3 algorithm prefers speci¯c rules developed by Abdelhamid et al. (2012a). This section
over general ones since they contain multiple general rules. discusses the di®erent procedures applied in selecting the

1450027-19
September 30, 2014 2:00:31pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

classi¯er rules in AC mining. Furthermore, di®erent Input: The complete set of discovered rules R sorted, and the training data set D
mathematical rule pruning methods including Pessimistic
Error Estimation, Chi-Square testing and others are sur- 1 For each rule ri in R do
veyed in this Section. 2 Mark all applicable cases in D that match ri’s body
3 If ri correctly classifies a case in D
5.1. Full and partial match rule pruning 4 Insert ri into the classifier
5 Discard all cases in D covered by ri
De¯nition 10. A rule is said to fully match a training
6 end if
case if the attribute values in the rule body are contained
7 If ri covers no cases in D
in the training case.
8 Delete ri
De¯nition 11. A rule is said to partially match a 9 end if
training case if at least one of the attribute values in the 10 end
rule body is contained in the training case. 11 If D is not empty
12 Generate a default rule for the largest frequency class in D
Di®erent rule pruning methods are discussed in this sec-
tion primarily those that consider partial or full matching 13 Mark the least error rule in R as a cutoff rule.
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

between the selected rule and the training case. In par- 14 end if

ticular, database coverage (Liu et al., 1998), High Prece- Fig. 3. The database coverage method.
dence (HP) and High Precedence Classify Correctly
(HCP) (Abumansour et al., 2010) are surveyed. The da-
tabase coverage method considers a rule signi¯cant if its the prediction step in cases when there is no classi¯er rule
body fully matches the training case attribute values and applicable to the test case.
the rule class is similar to that of the training case. A closely related method to the database coverage was
Whereas FMP is similar to the database coverage but it proposed by Abdelhamid et al. (2012a). In this method, a
abandons the class similarity condition. The HCP con- rule is inserted into the classi¯er if its body fully matches
siders a rule signi¯cant if it's body partially matches any the training case without having an identical class to the
of the training cases and the rule class is identical to that training case class. Once a rule is evaluated, all training
of the training case. Finally, the HP signi¯es a rule if its cases covered by it are removed and the process continues
body partially matches any of the training cases without until all rules are evaluated or the training data set
checking the class value. becomes empty. After proposing CBA, several AC algo-
rithms have successfully employed the database coverage
5.1.1. Database coverage like methods in building the classi¯er, i.e. CBA (2), ARC-BC, CAAR,
The database coverage is the ¯rst pruning method in AC ACN and ACCF.
that has been applied by CBA to select the classi¯er. This
method is simple and e®ective and it evaluates the com-
5.1.2. High classify pruning method (HCP)
plete set of discovered rules against the training data set Many rules found in the training step cannot be used to
aiming to keep only high e®ective and accurate rules. forecast test cases, and thus some discovered rules are
Figure 3 depicts the database coverage method in which deleted. This rule evaluation method, High classify prun-
for each rule starting with the highest ranked rule, all ing method (HCP) (Abumansour et al., 2010) (Fig. 4),
training cases covered by the rule are marked for deletion goes over the complete set of rules after ranking and
and the rule gets inserted into the classi¯er. In cases where applies each rule against the training data set. If the rule
a rule cannot cover a training case (the rule body does not covers (partially matches) a training case and has a
match any training case attribute values) then the rule is common class to that of the training case, it will be
discarded. inserted into the classi¯er and all training cases covered by
The database coverage method terminates when the the rule are removed. The method repeats the same pro-
training data set becomes empty or if there are no more cess for each remaining rule until the training data set
rules to be evaluated. In that case, the remaining uncov- becomes empty and it considers the rules within the
ered training cases are used to generate the default class classi¯er during the prediction step.
rule which represents the largest frequency class in the The distinct di®erence between this method and the
remaining unclassi¯ed cases in the training data set. It database coverage is that a rule is added into the classi¯er
should be noted that the default class rule is ¯red during if it partially covers at least one training case, whereas in

1450027-20
September 30, 2014 2:00:32pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

Input: Given a set of generated rules R, and training data set T


Output: classifier (Cl)

1 R′ = sort(R);
2 For each rule ri in R′ Do
3 Find all applicable training cases in T that partially match ri’s condition
4 If ri correctly classifies a training case in T
5 Insert the rule at the end of Cl
6 Remove all training cases in T covered by ri
7 end if
8 If ri cannot correctly cover any training case in T
9 Remove ri from R
10 end if
11 end for

Fig. 4. HCP rule evaluation method.


by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

the database coverage, a rule body must fully match the covers at least one training case regardless if it classi¯es
training case in order to be part of the classi¯er. that case correctly or not. On the other hand, in the HCP,
a rule must classify a training case correctly in order to be
5.1.3. HP method considered in the classi¯er.
The HP method (Abumansour et al., 2010) (Fig. 5) allows
a rule to be inserted into the classi¯er if its body partially
5.1.4. Lazy methods
matches the training case regardless the class similarity Lazy AC scholars (Baralis et al., 2004), believed that
between the rule class and that of the training case. So, pruning should be limited to rules that incorrectly cover the
once rules are extracted and ranked, this method iterates training cases during building the classi¯er. This is since
over the rules starting with the highest sorted one, all these rules are the only ones that lead to misclassi¯cation
training cases covered by the selected rule are discarded on the training data set, and therefore they are the only
and the rule is inserted into the classi¯er. Any rule that ones that should be discarded. Unlike database coverage
does not cover a training case is removed. The loop ter- based methods, which prune any rule that do not cover a
minates when either the training data set is empty or all training case, lazy AC algorithms store these rules in a
rules are tested. compact set aiming to use them during the prediction step.
The di®erence between HP and HCP methods is that in The lazy pruning occurs when the complete set of rules
the HP, a rule gets inserted into the classi¯er if it partially are discovered and ranked in descending order in which

Input: Given a set of generated rules R, and training data set T


Output: classifier (Cl)

1 R ′ = sort(R);
2 For each rule ri in R ′ Do
3 Find all applicable training cases in T that partially match ri’s condition
4 Insert the rule at the end of Cl
5 Remove all training cases in T covered by ri
6 If ri cannot correctly cover any training case in T
7 Remove ri from R
8 end if
9 end for

Fig. 5. HP rule evaluation method.

1450027-21
September 30, 2014 2:00:33pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

longer rules (those with more attribute values) are 5.3. Mathematical based pruning
favoured over general rules. For each rule, if the selected
5.3.1. Pessimistic error estimation
rule covers correctly a training case (has a common class
to that of the training case), it will be inserted into the Pessimistic error estimation is mainly used in data mining
primary rule set, and all of its corresponding training cases within decision trees (Quinlan, 1993) in order to decide
will be deleted. Whereas, if a higher ranked rule covers whether to replace a sub-tree with a leaf node or to keep
correctly the current selected rule training case(s), the the sub-tree unchanged. The method of replacing a sub-
selected rule will be inserted into the secondary rule set. tree with a leaf is called sub-tree replacement, and the
Lastly, if the selected rule does not cover correctly any error is computed using the pessimistic measure on the
training case, it will be removed. The process is repeated training data set. To clarify, the probability of an error at
until all discovered rules are tested or the training data set a node v,
becomes empty. At that time, the output of this lazy Nv  Nv;c þ 0:5
pruning will be two rules sets, a primary set which holds qðvÞ ¼ ; ð10Þ
Nv
all rules that correctly cover a training case, and a sec-
ondary set which contains rules that has never been used where
during the pruning since some higher ranked rules have
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.

Nv is the number of training cases at node v,


J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

covered their training cases. Nv;c is the number of training cases belonging to the
The main distinguishing di®erence between the data- largest frequency class at node v.
base coverage and lazy pruning is that the secondary rules
set which is held in the main memory by the lazy methods The error rate at sub-tree T ,
is completely removed during building the classi¯er by the P
l2leafsðT Þ Nl  Nl;c þ 0:5
database coverage. In other words, the classi¯er resulting qðT Þ ¼ P : ð11Þ
l2leafsðT Þ Nl
from CBA based algorithms which employ the database
coverage pruning does not contain the secondary rules set The sub-tree T is pruned if qðvÞ  qðT Þ.
of the lazy pruning, and thus it is often smaller in size. The pessimistic error estimation has been exploited
This is indeed an advantage especially in applications that successfully in decision tree algorithms including C4.5 and
necessitates a concise set of rules so the end user can easily See5. In AC mining, the ¯rst algorithm which has
control and maintain the classi¯er. employed pessimistic error pruning is CBA. For a rule R,
Empirical studies, (Baralis et al., 2004) against large CBA removes one of the attribute value in its antecedent
number of UCI data sets revealed that using lazy algo- to make a new rule R 0 , then it compares the estimated
rithms such as L3 and L 3 G sometimes decrease the error error of R 0 with that of R. If the expected error of R 0 is
rate more than CBA like algorithms. Though, the large smaller than that of R, then the original rule R gets
classi¯ers derived by lazy algorithms and the main mem- replaced with the new rule R 0 .
ory usage cost limit their use.
5.3.2. Chi-square testing
5.2. Long rules pruning The chi-square test ( 2 ) is normally applied to decide
whether there is a signi¯cant di®erence between the ob-
A rule ¯ltering method that discards long rules (speci¯c served frequencies and the expected frequencies in one or
rules) that have con¯dence values larger than their subset more categories. It is de¯ned as a known discrete data
(general rules) was proposed by Li et al. (2001). This rule hypothesis in mathematics that tests the relationship be-
pruning method eliminates rules redundancy since many tween two objects in order to decide whether they are
of the discovered rules have common attribute values in correlated (Witten and Frank, 2002). The evaluation
their antecedents. As a result, the classi¯er may contain using  2 for a group of objects to decide their indepen-
redundant rules and this becomes obvious particularly dence or correlation is given as:
when the classi¯er size is large. The ¯rst algorithm that
X
n
ðOi  Ei Þ 2
uses the long rules pruning was CMAR, in which when the 2 ¼ ; ð12Þ
rule is about to be inserted in the classi¯er, a test is issued i¼1
Ei
to check whether the rule can be removed or any of the
where
existing rules may be deleted. There are some AC methods
that employ this type of pruning, including ARC-BC and Oi is the observed frequencies,
negative rules. Ei is the expected frequencies.

1450027-22
September 30, 2014 2:00:34pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

If the observed frequencies and the expected frequencies Table 9. Summary of rule pruning models in AC.
are remarkably di®erent, the assumption that they are
Pruning method AC algorithms
related is declined.
The ¯rst AC algorithm that employed a weighted Database coverage CBA, CBA(2), CAAR,
version of  2 is CMAR. It evaluates the correlation be- Negative rules (ARC-AC),
tween the antecedent and the consequent of the rule and CARGBA, CAN, CMAR, etc
Redundant rules CMAR, CAEP
removes rules that are negatively correlated. A rule R :
CPAR CPAR
Antecedent ! c is removed if the class c is not positively Chi-square CMAR
correlated with the antecedent. In other words, if the re- Pessimistic error CBA, CBA(2), ADT, SARC
sult of the correlation exceeds a certain threshold, this Lazy pruning L 3, L3G
indicates a positive correlation and R will be kept. Oth- Partial matching MAC, MMAC
erwise, R will be deleted since negative correlation exists HCP, HP MCAR
in R. To clarify, for R, assume Support (c) denote the
number of training cases associated with class c and
Support (Antecedent) denote the number of training cases test cases. This step is called class prediction or forecast-
associated with the R's antecedent. Also assume that jT j ing. There are several di®erent methods for class alloca-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

denote the size of the training data set. The weighted tion in AC some of which employs the highest ranked rule
chi-square denoted Max 2 of R is de¯ned as: in the classi¯er (Liu et al., 1998; Thabtah and Cowling,
 2007) and others use multiple rules (Li et al., 2001;
Max 2 ¼ minfSupportðAntecedentÞ; SupportðcÞg Thabtah et al., 2011; Abdelhamid et al., 2012a). In this
2 section we discuss on the di®erent prediction methods
SupportðAntecedentÞSupportðcÞ employed by the current AC algorithms.
 jT ju;
jT j
ð13Þ 6.1. One rule class forecasting
where The basic idea of the one rule prediction (Fig. 6) was
1 introduced in CBA algorithm. This method works as
u ¼ follows: Once the classi¯er is constructed and rules within
SupportðAntecedentÞSupportðcÞ
1 it are sorted in descending manner according to con¯-
þ dence and support thresholds, and a test case is about to
SupportðAntecedentÞðjT j  SupportðcÞÞ
1
þ
ðjT j  SupportðAntecedentÞÞSupportðcÞ Input: Classifier (R), test data set (Ts), array Tr
1 Output: error rate Pe
þ :
ðjT j  SupportðAntecedentÞÞðjT j  SupportðcÞÞ Given a test data set (Ts), the classification process works as follow:

A recently developed AC algorithm called Statistical


1 For each test case ts Do
Associative Rule Classi¯cation (SARC) (Jabez, 2011) has
2 For each rule r in the set of ranked rules R Do
employed chi-square in the rule pruning step while learn-
ing the rules in which any potential rules that are nega- 3 Find all applicable rules that match ts body and store them in Tr

tively correlated according to chi-square are discarded. 4 If Tr is not empty Do


The test of rule signi¯cance in this algorithm is performed 5 If there exists a rule r that fully matches ts condition
after the rule has already passed the con¯dence and sup- 6 assign r’s class to ts
port tests. This exhaustive search procedure cuts down the 7 end if
size of the derived classi¯er according to a study by Jabez 8 else assign the default class to ts
(2011) if compared with CBA on 8 UCI data sets. Table 9 9 end if
shows the pruning methods used in AC. 10 empty Tr
11 end

6. Class Forecasting Methods 12 end


13 compute the total number of errors of Ts;
The last step in the life cycle of any classi¯cation data
mining algorithm is to allocate the appropriate class to Fig. 6. CBA prediction method.

1450027-23
September 30, 2014 2:00:34pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

be forecasted, CBA iterates over the rules in the classi¯er 6.3.1. Dominant class and highest con¯dence
and assigns the class associated with the highest sorted method(s)
rule that matches the test case body to the test case.
Two closely related prediction methods that use multiple
In cases when no rules matches the test case body,
rules to forecast test cases were proposed by Thabtah
CBA takes on the default class and assigns it to the test
et al. (2011). The ¯rst method is called \Dominant Class",
case.
which marks all rules in the classi¯er that are applicable to
After the dissemination of CBA algorithm, a number of
the test case, then divides them into groups according to
other AC algorithms have employed its prediction method
class labels, and assigns the test case the class of the group
(Baralis et al., 2002; Thabtah et al., 2005; Tang and Liao,
which contains the largest number of rules as shown in
2007; Li et al., 2008; Kundu et al., 2008 and Niu et al.,
Fig. 7. In cases where no rule is applicable to the test case,
2009).
the default class will be used.
The second prediction method is called \Highest Group
6.2. Predictive con¯dence forecasting Con¯dence", which works similar to the \Dominant
Class" method in the way of marking and dividing the
A rule's con¯dence is the main criteria for choosing the
applicable rules into groups based on the classes. However,
right classi¯er rule to use for test cases prediction. However,
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

the \Highest Group Con¯dence" computes the average


Do et al. (2005) argued that rules con¯dence computed
con¯dence value for each group and assigns the class of the
from the training data alone is not enough to discriminate
highest average group con¯dence to the test case. In cases
among rules in the classi¯er. Therefore, there should be
where no rule matches the test case, the default class will
another criteria for rule selection in prediction beside the
be ¯red.
con¯dence value such as the predictive con¯dence calcu-
lated from the test data set and for each rule in the clas- 6.3.2. CPAR class forecasting method
si¯er. The predictive con¯dence represents the average
prediction accuracy for the rule when forecasting test data The CPAR algorithm is the ¯rst AC technique that used
case. For instance, for a rule (R): ListOfItems ! l, assume Laplace Accuracy to assign the class to the test cases
A is the test cases matching R body and belonging to class during prediction. Once all rules are found and ranked,
label L and B is the test cases matching only R's body. and a test case (t) is about to be predicted, CPAR iterates
Now, when R is applied on the test data set, R will cor- over the rule set and marks all rules in the classi¯er that
rectly predict (A) test cases with prediction accuracy of may cover t. If more than one rule is applicable to t,
(A=B) which is simply the con¯dence value of (R) on the
test data set. This is simply the de¯nition of the predictive Input: Classifier (R), test data set (Ts), array Tr
accuracy of the rule that has been implemented on a re- Output: error rate Pe
cently AC algorithm named AC-S (Do et al., 2005). This
Given a test data (Ts), the classification process works as follow:
measure is employed to select the right rules for prediction
1 For each test case ts Do
instead of the con¯dence value computed from the train-
2 Assign=false
ing data set. Empirical experiments showed that AC-S
3 For each rule r in the set of ranked rules R Do
algorithm is very competitive to common AC algorithms
like CBA and CMAR. 4 Find all applicable rules that match ts body and store them in Tr
5 If Tr is not empty Do
6 If there exists a rule r that matches any ts condition
6.3. Group of rules class forecasting 7 countperclass +=1
The single-rule prediction methods described earlier work 8 end if
¯ne especially when there is just one rule applicable to the 9 else assign the default class to ts and Assign=true
test case. However, in circumstances when more than one 10 end if
rule with close con¯dence values is applicable to the test 11 end
case, the decisions of such methods are questionable since 12 If Assign = false then assign the dominant class count to ts
the selection of a single rule to make the class assignment 13 empty Tr
is inappropriate. Thus, using the group of rules that match
14 end
the test case for class prediction in these circumstances is
15 compute the total number of errors of Ts;
more appropriate. In this subsection, the di®erent multi-
ple rules prediction methods are discussed. Fig. 7. Dominant class prediction method.

1450027-24
September 30, 2014 2:00:35pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

CPAR divides them into groups according to the classes, Table 10. Summary of class forecasting methods in AC.
and calculates the average expected accuracy for each
Method name Common algorithms
group. Finally, it assigns t the class with the largest av-
erage expected accuracy value. The expected accuracy for One rule full matching with CBA, CBA(2), ADT,
each a rule (R) is obtained as follows: class similarity CAAR, L 3 G, L 3 , etc
Multiple rules label based on CMAR
ðpc ðRÞ þ 1Þ weighted chi-square
LaplaceðRÞ ¼ ; ð14Þ
ðptot ðRÞ þ pÞ Multiple label based Laplace CPAR
expected accuracy
where Aggregated rules scores CAEP
One rule full matching without MAC
p is the number of classes in the training data set class similarity
ptot ðRÞ is the number of training cases matching r Dominant factor multiple label Negative Rules ARC-BC,
antecedent ARC-BC
pc ðRÞ is the number of training cases covered by R that Group of rules full matching Enhanced MAC
without class similarity
belong to class c. Highest group con¯dence Modi¯ed MCAR
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.

Laplace accuracy has been successfully used by CPAR


J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

algorithm to ensure that the largest rule(s) accuracy


contribute in class assignment for test cases, which tests (Li et al., 2001) showed that classi¯cation procedures
therefore positively a®ect the classi¯cation accuracy. that employ a group of correlated rules for prediction
Fitcar (Cerf et al., 2008) is another AC algorithm that slightly improve the prediction rate when compared to
employed the prediction procedure of CPAR which is other methods.
based on multiple rules. Empirical evaluation using dif-
ferent UCI data sets revealed that CPAR achieves slightly 6.4. Discussion on class forecasting
higher classi¯cation accuracy than CBA and decision methods
trees.
There is a de¯nite advantage of using just one rule in
predicting test cases since only the highest applicable rule
6.3.3. CMAR class forecasting method in the classi¯er has been used for class allocation of test
The ¯rst AC algorithm that employed weighted Chi- cases which is simple and e±cient approach. Further, the
square (Max 2 ) is CMAR. It chooses all applicable rules measure of choosing the rule for prediction represents a
to the test case and evaluates their correlations. The likelihood that a test data belongs to the appropriate class
correlation measures the strength of the rules based on the (Thabtah and Cowling, 2007). Though, utilising just a
support and class frequency in the training data set. single rule for class assignment has been criticised, seeing
CMAR class assignment method works as follows: that there could be multiple rules applicable to a test case
Given a test case t, and ranked rules R in the classi- with slightly di®erent con¯dence values. Moreover, and
¯er, the subset of rules, Rc that may cover t is selected by for data sets that are unbalanced, using just one rule may
the algorithm. If all rules in Rc have an identical class, be unsuccessful since there will be very large numbers of
then that class will be given to t. Though, if the rules in rules for the majority class(s) and few numbers or no rules
Rc have di®erent classes, CMAR divides them into for the minority class(s) (Li et al., 2001; Liu et al., 2003).
groups based on the classes and computes the strength of Thus, some scholars (Antonie and Zaïane, 2004; Abdel-
each group. The strength of each group is computed hamid et al., 2012a) suggested using a group of rules for
using the support and the correlation (Max 2 ) between class assignment of test cases mainly due to majority
the rules in a group (Sec. 5.3.2 gives details on Max 2 ). decisions and to overcome de¯ciencies associated with
Lastly, CMAR allocates the class of the largest group single rule prediction methods. Table 10 displayed com-
strength to t. mon class forecasting methods in AC.
After the introduction of CMAR, a few AC algorithms
have exploited its prediction method (Ye et al., 2008; 7. Future Work
Baralis et al., 2004). Furthermore, Antonie and Zaïane
(2002) used a closely related prediction method of CMAR, 7.1. Immune systems based AC
where the class of the subset of rules in Rs with the domi- One of the e®ective learning approaches that has been
nant class gets assigned to the test case t. Experimental originated from the Natural Immune System (NIS) and

1450027-25
September 30, 2014 2:00:35pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

have successfully applied in optimization, online security each class rules set before evaluating the complete set of
and data mining is Arti¯cial Immune System (AIS). As a rules on the training data to determine the classi¯er.
matter of fact, AIS has been utilised in classi¯cation Empirical evaluations using a limited number of UCI
problem in last decade and devised a competitive perfor- data sets indicated that the AIS proposed algorithms
mance results in accuracy rate. Examples of known clas- (Elsayed et al., 2012; Do et al., 2009) are highly compet-
si¯cation algorithms that are based on AIS are clonal itive in accuracy and execution time to the \Predictive
selection and negative selection (Do et al., 2009). We be- Apriori" algorithm which is a simpli¯ed version of CBA
lieve that AIS can be used in AC especially to minimise that primarily uses Apriori algorithm for extracting the
the search space for rules by reducing the number of rules without pruning.
candidate rules. Hereunder, two attempts in using AIS
within AC have been outlined.
There have been some initial attempts to adapt the
7.2. Calibration
learning methodology of NIS especially the clonal selection Accuracy is one of the main metrics used in classi¯cation
in AC context that have resulted in an algorithm named algorithms in data mining to favour an algorithm over
arti¯cial immune system-associative classi¯cation (AIS- others for certain data sets. In fact, most of the classi¯-
AC) (Do et al., 2009). The AIS-AC algorithm was pro- cation problems such as credit card scoring, website clas-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

posed in 2005 and extended in 2009 and follows the evo- si¯cation, weather forecasting, etc, use accuracy or its
lutionary process by reducing the search space of the complement one-error-rate as the main evaluation metric
candidate rules by keeping just high predictive rules. This to distinguish among classi¯cation algorithms. Though,
process is accomplished by extracting frequent 1-ruleitems certain applications like cost-sensitive classi¯cation,
after passing over the initial training data set, and gen- Information Retrieval ranking in search engines and text
erating the possible candidate ruleitems at iteration N categorization for digital libraries, may require additional
from results derived at iteration N  1 and so forth. The information beside classi¯cation accuracy such as class
minsupp and minconf are utilised as sharp lines to dis- membership probabilities per test. So in calibrated AC
criminate among ruleitems at each iteration. Further, two approach, the derived rules per test data are used to de-
new parameters are introduced named Clonal rate and scribe the training data set and these rules are utilised to
Max generation. The clonal rate (de¯ned below) denotes compute the class membership probabilities. When the
the rate at which items in the candidate rules at a given rules are accurate, calibrated AC algorithms assumes that
generation are extended, and is proportional to the rule the estimated class membership probabilities are also
con¯dence. accurate and can be generalised.
There are many classic rule based and non-rule based
n  Clonal rate
Clonal rate ¼ P n ; ð15Þ approaches in classi¯cation that have employed calibra-
i¼1 confðri Þ
tion. Examples are SVM, decision trees and Statistical and
where n is the number of rules at the current iteration, and probabilistic (Witten and Frank, 2002). In AC, one cali-
the clonal rate is a prede¯ned user parameter. Once the brated approach has been used AC (Veloso et al., 2011).
candidate rules are extracted, they are tested on the We believe that calibration is an important issue that
training data keeping only those that have one or more should be studied extensively in AC simply since initial
training example(s) coverage. The algorithm terminates results revealed good predictive performance if compared
once the complete training data set is covered or the to other current algorithms. Furthermore, for multiple
Max generation condition has been met (often set to 10). label classi¯cation including the class membership proba-
The candidate rules that have training data coverage are bilities are much more useful than single label classi¯ca-
kept in the classi¯er. The AIS-AC algorithm applies the tion because of two reasons. Firstly, in multi-label
rules in the classi¯er on the test data similar to CBA classi¯cation, the input data instance may belong to sev-
prediction method. eral classes and therefore we can assign weights or class
Recently, another AIS based on AC called AC-CS was memberships in particular when classes overlap in the
proposed in (Elsayed et al., 2012). This algorithm follows training data. Thus, the decision maker can distinguish
the same track of the previously described AIS-AC and it easily to which the input data belongs to or can merge
uses the same strategies in deriving the rules and classi- multiple classes together to come up with new class label.
fying test data. One simple di®erence between AC-CS and Secondly, some of the rules in the classi¯er will be con-
AIS-AC is that AC-CS builds the candidate rules in gen- nected to set of classes and therefore calibration can assist
erations per class rather than at once and then merges in prioritising these classes (Ranking).

1450027-26
September 30, 2014 2:00:36pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

7.3. Non-con¯dence based learning 8. Conclusions


The key element, which controls the number of rules Associative classi¯cation (AC) is an integration of asso-
produced in AC is the support threshold. If the support is ciation rule discovery and classi¯cation in data mining
set to a large value, normally the number of extracted that recently attracted several scholars since it derives
rules is very limited, and many rules with high con¯dence high accurate classi¯ers that contain simple chunk of
will be missed. This may lead to discarding important knowledge. In this paper, we reviewed common approa-
knowledge that could be useful in the classi¯cation step. ches in the literature related to each step in AC mining
To overcome this problem, one has to set the support including data representation, learning the rules, rule
threshold to a very small value. However, this usually ranking, building the classi¯er and predicting class labels
involves the generation of massive number of classi¯cation for test cases, and critically compared the di®erent
rules, where many of which are useless since they hold low methods in each step. For data representation, algo-
support and con¯dence values. This large number of rules rithms that employ vertical layout or semi-vertical (dis-
may cause severe problems such as over¯tting. tributed AC) such as MCAR, CACA and MAC are more
Xu et al. (2004) argued that the rule con¯dence which appropriate for rule learning than those which utilise
is the main criteria for selecting the classi¯er could be horizontal data layouts like CBA (2) and LCA. This is
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

misleading in some cases especially since the rule with the since these algorithms avoid repeatedly scanning the
largest con¯dence is chosen to predict the test case in the original databases and employ e±cient search methods
test data set. So, instead of computing the con¯dence from based on TIDs intersections to ¯gure out frequent rulei-
the training data set as most AC methods, the test data tems. Moreover, cutting down the number of candidate
should be considered in favouring rules during the pre- rules seems to be a necessity for the success and appli-
diction phase. Therefore, the authors proposed a measure cability of AC algorithms on real applications. Recent
of rule goodness called \predictive con¯dence" which is studies revealed new attempts to develop rule pruning
based on statistical information in the test data set (the methods particularly pruning based on mathematical
frequencies of the test cases applicable to a rule). The formulas like IG besides database like pruning \partial
new predictive con¯dence based AC approach is called matching". These provide promising research directions
AC-S. This approach is required to calculate the rule to accomplish this task. Furthermore, calibrated AC like
(R) \confidence decrease " ¼ RðConfðTrainingÞÞ R(Conf CLAC algorithm prunes by minimising the search space
(Training)) − in order to estimate the predictive con¯- for candidate rules when it comes to classifying test data.
dence for each rule before predicting test cases. Finally, despite the computation time for class allocation
The AC-S algorithm depends on several parameters procedures that are based on group of rule prediction,
that must be known at the time of prediction and for each algorithms that employ this type of prediction such as
test case before the algorithm chooses the most applicable CMAR, CPAR and MAC are more accurate than single
rule to the test case. Precisely, the support and con¯dence rule based procedure approach (CBA, MCAR). It is the
for each candidate rule must be computed and from both ¯rm belief of the authors that due to the rapid advances
the training and testing data sets so that AC-S can be able in hardware technology and storage like cloud services
to estimate the predictive accuracy for each rule. This infrastructure, processing large amounts of data is no
indeed is time consuming and can be a burden in cir- longer a huge set back due to the fact that services such
cumstances where the training data set is highly corre- as processors can be let directly from cloud service pro-
lated. Further, it is impractical to estimate the support viders. Thus, new AC research areas like distributed AC
and con¯dence for each rule in the testing data set in become feasible in this era.
advance since we do not know which rule will be used for In near future, we intend to develop a new AC algo-
prediction. Yet, we can utilise the test data during the rithm for structured and unstructured textual documents
prediction step to narrow down candidate rules. This can that not only generate single label classi¯ers but also
be seen as a new research path for enhancing the current multi-label ones.
\predictive con¯dence" approach. A comparison between
AC-S and other known AC algorithms such as CBA, CBA
(2) and CMAR was conducted against some UCI data References
sets. The results of the accuracy showed that AC-S is Abdelhamid, N, A Ayesh and F Thabtah (2013). Asso-
competitive to CBA, though CBA (2) and CMAR algo- ciative classi¯cation mining for website phishing clas-
rithms derived higher quality classi¯ers than AC-S. si¯cation. In Proc. ICAI '2013, pp. 687–695, USA.

1450027-27
September 30, 2014 2:00:36pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

Abdelhamid, N, A Ayesh, F Thabtah, S Ahmadi and DaWaK 2008, LNCS 5182, Song, I-Y, J Eder and TM
W Hadi (2012a). MAC: A multiclass associative clas- Nguyen (eds.), pp. 293–304.
si¯cation algorithm. Journal of Information and Chen, J, J Yin and J Huang (2005). Mining correlated
Knowledge Management, 11(2), 1250011-1–1250011-10. rules for associative classi¯cation. In Proc. ADMA.
Abdelhamid, N, A Ayesh and F Thabtah (2012b). An pp. 130–140.
experimental study of three di®erent rule ranking for- Chien, Y and Y Chen (2010). Mining associative classi¯-
mulas in associative classi¯cation mining. In Proc. 7th cation rules with stock trading data — A GA-based
Int. Conf. for Internet Technology and Secured method. Knowledge-Based Systems, 23, 605–614.
Transactions (ICITST-2012). Clare, A and R King (2001). Knowledge discovery in
Abumansour, H, W Hadi, L McCluskey and F Thabtah multi-label phenotype data. In Proc. PKDD '01, De
(2010). Associative text categorisation rules pruning Raedt, L and A Siebes (eds.), Vol. 2168, Lecture Notes
method. In Proc. Linguistic and Cognitive Approaches in Arti¯cial Intelligence, pp. 42–53.
to Dialog Agents Symposium (LaCATODA-10), Dhok, J and V Varma (2010). Using pattern classi¯cation
Rzepka, R (ed.), at the AISB 2010 convention, pp. 39– for task assignment in mapreduce. In Proc. 10th IEEE/
44. April 2010, UK. ACM Int. Conference CCGrid 2010. Melbourne,
Aburrous, MA Hossain, K Dahal and F Thabtah (2010). Australia.
Intelligent phishing detection system for e-banking Do, TD, SC Hui, ACM Fong and B Fong (2009). Asso-
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

using fuzzy data mining. Expert Systems with Appli- ciative classi¯cation with arti¯cial immune system.
cations: An International Journal, 7913–7921. IEEE Transactions on Evolutionary Computation,
Al-Maqaleh, B (2013). Discovering interesting association 13, 217–228.
rules: A multi-objective genetic algorithm approach. Do, TD, SC Hui and ACM Fong (2005). Associative
International Journal of Applied Information Systems, classi¯cation with prediction con¯dence. In Proc. 2005
5(3), 47–52. Int. Conf. Machine Learning and Cybernetics, Vol. 4,
Apache JIRA (2009). Mumak Hadoop MapReduce Sim- pp. 1993–1998.
ulator, https://issues.apache.org/jira/browse/MAPRE Dong, G, X Zhang, L Wong and J Li (1999). CAEP:
DUCE-728. Classi¯cation by aggregating emerging patterns. In
Agrawal, R and R Srikant (1994). Fast algorithms for DS'99, pp. 30–42.
mining association rule. In Proc. 20th Int. Conf. Very Duda, R and P Hart (1973). Pattern Classi¯cation and
Large Data Bases-VLDP, pp. 487–499. Scene Analysis. John Wiley & Son.
Antonie, M and O Zaïane (2004). An associative classi¯er Elsayed, S, S Rajasekaran and R Ammar (2012). AC-CS:
based on positive and negative rules. In Proc. 9th ACM An immune-inspired associative classi¯cation algo-
SIGMOD Workshop on Research Issues in Data rithm. ICARIS, pp. 139–151.
Mining and Knowledge Discovery, pp. 64–69. Paris, Han, J, TY Lin, J Li and N Cercone (2007). Constructing
France. associative classi¯ers from decision tables. In Proc. Int.
Antonie, M and O Zaïane (2002). Text document cate- Conf. Rough Sets, Fuzzy Sets, Data Mining, and
gorization by term association. In Proc. IEEE Int. Granular-Soft Computing — RSFDGrC, pp. 305–313.
Conf. Data Mining, pp. 19–26. Maebashi City, Japan. Han, J, J Pei and Y Yin (2000). Mining frequent patterns
Arunasalam, B and S Chawla (2006). CCCS: A top-down without candidate generation. In Proc. 2000 ACM
associative classi¯er for imbalanced class distribution. SIGMOD Int. Conf. Management of Data, pp. 1–12.
In KDD 2006, 517–522. Jabbar, MA, BL Deekshatulu and P Chandra (2013).
Baralis, E and P Garza (2012). I-prune: Item selection for Knowledge discovery using associative classi¯cation for
associative classi¯cation. International Journal of heart disease prediction. Advances in Intelligent Sys-
Intelligent Systems, 27(3), 279–299. tems and Computing, 182, 29–39.
Baralis, E, S Chiusano and P Graza (2004). On support Jabez, C (2011). A statistical approach for associative
thresholds in associative classi¯cation. In Proc. 2004 classi¯cation. European Journal of Scienti¯c Research,
ACM Symp. Applied Computing, pp. 553–558. Nicosia, 58(2), 140–147.
Cyprus. Jensen, D and P Cohen (2000). Multiple comparisons
Baralis, E and P Torino (2002). A lazy approach to in induction algorithms. Machine Learning, 38(3),
pruning classi¯cation rules. In Proc. 2002 IEEE 309–338.
ICDM'02, p. 35. Kundu, G, M Islam, S Munir and M Bari (2008). ACN: An
Cendrowska, J (1987). Prism: An algorithm for inducing associative classi¯er with negative rules, In 11th IEEE
modular rules. International Journal of Man-Machine Int. Conf. Computational Science and Engineering.
Studies, 27(4), 349–370. pp. 369–375.
Cerf, L, D Gay, N Selmaoui and F Boulicaut (2008). Kundu, G, S Munir, M Md. Islam and K Murase (2007).
A parameter-free associative classi¯cation method. A novel algorithm for associative classi¯cation. In Proc.

1450027-28
September 30, 2014 2:00:36pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

Associative Classi¯cation Approaches

Int. Conf. Neural Information Processing—ICONIP, In Proc. Knowledge Acquisition and Modeling Work-
pp. 453–459. shop — KAM Workshop, pp. 1060–1063.
Yu, K, X Wu, W Ding and H Wang (2011). Causal as- Tang, Z and Q Liao (2007). A new class based associative
sociative classi¯cation, In Proc. 11th IEEE Int. Conf. classi¯cation algorithm. IMECS 2007, pp. 685–689.
Data Mining (ICDM '11), December 11–14, 2011, Taiwiah, CA and V Sheng (2013). A study on multi-label
Vancouver, Canada, 914–923. classi¯cation. Advances in Data Mining. Applications
Lan, Y, D Janssens, G Chen and G Wets (2006). Im- and Theoretical Aspects. Lecture Notes in Computer
proving associative classi¯cation by incorporating novel Science, Volume 7987, pp. 137–150.
interestingness measures. Expert System Applications, Thabtah, F and S Hammoud (2013). MR-ARM: A
31(1), 184–192. MapReduce association rule mining. Journal of Parallel
Li, X, D Qin and C Yu (2008). ACCF: Associative clas- Processing Letter, 23, 1350012.
si¯cation based on closed frequent itemsets. In Proc. Thabtah, F, W Hadi, N Abdelhamid and A Issa (2011).
Fifth Int. Conf. Fuzzy Systems and Knowledge Discov- Prediction phase in associative classi¯cation. Interna-
ery — FSKD, pp. 380–384. tional Journal of Software Engineering and Knowledge
Li, W, J Han and J Pei (2001). CMAR: Accurate and Engineering, 21(6), 855–876.
e±cient classi¯cation based on multiple-class associa- Thabtah, F, Q Mahmood, L McCluskey and H Abdel-
tion rule. In Proc. IEEE Int. Conf. Data Mining — jaber (2010). A new classi¯cation based on association
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

ICDM, pp. 369–376. algorithm. Journal of Information and Knowledge


Liu, B, Y Ma and C-K Wong (2001). Classi¯cation using Management, 9(1), 55–64.
association rules: Weakness and enhancements. In Thabtah, F and P Cowling (2007). A greedy classi¯cation
Kumar, V et al. (eds.), Data Mining for Scienti¯c algorithm based on association rule. Applied Soft
Applications. Computing, 7(3), 1102–1111.
Liu, B, W Hsu and Y Ma (1998). Integrating classi¯cation Thabtah, F, P Cowling and Y Peng (2005). MCAR:
and association rule mining. In Proc. Knowledge Dis- Multi-class classi¯cation based on association rule ap-
covery and Data Mining Conference-KDD, pp. 80–86. proach. In Proc. 3rd IEEE Int. Conf. Computer Sys-
New York, NY. tems and Applications, pp. 1–7. Cairo, Egypt.
Liu, Y, Y Yang and J Carbonell (2002). Boosting to cor- Thabtah, F, P Cowling and Y Peng (2004). MMAC: A
rect inductive bias in text classi¯cation. In Proc. new multi-class, multi-label associative classi¯cation
Eleventh Int. Conf. Information and Knowledge approach. In Proc. Fourth IEEE Int. Conf. on Data
Management, pp. 348–355. McLean, VA. Mining (ICDM'04), pp. 217–224. Brighton, UK (Nomi-
Mei, Q, D Xin, H Cheng, J Han and CX Zhai (2006). Gen- nated for the Best paper award).
erating semantic annotations for frequent patterns with Veloso, A, W Meira, M Zaki, M Goncalves and H Mossri
context analysis, In Proc. 12th ACM SIGKDD Int. Conf. (2011). Calibrated lazy associative classi¯cation. In-
Knowledge Discovery and Data Mining, pp. 337–346. formation Sciences: An International Journal, 13(181),
Philadelphia, PA, USA 2656–2670.
Merz, C and P Murphy (1996). UCI repository of machine Veloso, A, W Meira, M Gonçalves and M Zaki (2007).
learning databases. Irvine, CA, University of California, Multi-label lazy associative classi¯cation. In Proc.
Department of Information and Computer Science. Principles of Data Mining and Knowledge Discovery —
Niu, Q, S Xia and L Zhang (2009). Association classi¯- PKDD, pp. 605–612.
cation based on compactness of rules, In Proc. Second Wang, X, K Yue, W Niu and Z Shi (2011). An approach
Int. Workshop on Knowledge Discovery and Data for adaptive associative classi¯cation. Expert Systems
Mining — WKDD, pp. 245–247. with Applications: An International Journal, 38(9),
Pal, PR and RC Jain (2010). Combinatorial approach of 11873–11883.
associative classi¯cation. International Journal of Wang, G, AR Butt, P Pandey and K Gupta (2009).
Advanced Networking and Applications, 2(1), 470–474. Using realistic simulation for performance analysis of
Quinlan, J (1998). Data mining tools See5 and C5.0. mapreduce setups, In Proc. 1st ACM Workshop on
Technical Report, RuleQuest Research. LargeScale System and Application Performance
Quinlan, J (1993). C4.5: Programs for Machine Learning. LSAP 09, p. 19.
San Mateo, CA: Morgan Kaufmann. Witten, I and E Frank (2002). Data Mining: Practical
Quinlan, J and R Cameron-Jones (1993). FOIL: A mid- Machine Learning Tools and Techniques with Java
term report. In Proc. European Conf. Machine Implementations. San Francisco: Morgan Kaufmann.
Learning, pp. 3–20. Vienna, Austria. Wu, G, H Li, X Hu, Y Bi, J Zhang and X Wu (2009).
Su, Z, W Song, D Cao and J Li (2008). Discovering in- MReC4.5: C4.5 Ensemble Classi¯cation with MapRe-
formative association rules for associative classi¯cation. duce. In Proc. ChinaGrid Annual Conf. pp. 249–255.

1450027-29
September 30, 2014 2:00:36pm WSPC/188-JIKM 1450027 ISSN: 0219-6492
FA1

N. Abdelhamid and F. Thabtah

Xu, X, G Han and H Min (2004). A novel algorithm for Zaki, M and CJ Hsiao (2002). CHARM: an e±cient al-
associative classi¯cation of images blocks. In Proc. gorithm for closed itemset mining. In Proc. 2002 SIAM
fourth IEEE Int. Conf. Computer and Information Int. Conf. Data Mining (SDM'02), pp. 457–473.
Technology, pp. 46–51. Zaki, M, S Parthasarathy, M Ogihara and W Li (1997).
Ye, Y, Q Jiang and W Zhuang (2008). Associative New algorithms for fast discovery of association rules.
classi¯cation and post-processing techniques used for In Proc. 3rd KDD Conf., pp. 283–286.
malware detection. In Proc. 2nd Int. Conf. Anti- Zhao, Z, H Ma and Q He (2009). Parallel k-means clus-
Counterfeiting, Security and Identi¯cation, 2008– tering based on MapReduce. Cloud Computing. Lecture
ASID, pp. 276–279. Notes in Computer Science, Vol. 5931, p. 674. Springer-
Yin, X and J Han (2003). CPAR: Classi¯cation based on Verlag Berlin Heidelberg.
predictive association rule. In Proc. — the SIAM Int. Zhu, Y, W Luo, G Chen and J Ou (2012). A multi-label
Conf. Data Mining — SDM, pp. 369–376. classi¯cation method based on associative rules.
Zaki, M and K Gouda (2003). Fast vertical mining using Journal of Computational Information Systems, 8(2)
di®sets. In Proc. Ninth ACM SIGKDD Int. Conf. 791–799.
Knowledge Discovery and Data Mining, pp. 326–335.
by FLINDERS UNIVERSITY LIBRARY on 01/10/15. For personal use only.
J. Info. Know. Mgmt. 2014.13. Downloaded from www.worldscientific.com

1450027-30

You might also like