A Meta-Learning Framework For Detecting Financial Fraud
A Meta-Learning Framework For Detecting Financial Fraud
Author(s): Ahmed Abbasi, Conan Albrecht, Anthony Vance and James Hansen
Source: MIS Quarterly , December 2012, Vol. 36, No. 4 (December 2012), pp. 1293-1327
Published by: Management Information Systems Research Center, University of
Minnesota
REFERENCES
Linked references are available on JSTOR for this article:
https://www.jstor.org/stable/41703508?seq=1&cid=pdf-
reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Financial fraud can have serious ramifications for the long-term sustainability of an organization , as well as
adverse effects on its employees and investors , and on the economy as a whole. Several of the largest
bankruptcies in U.S. history involved firms that engaged in major fraud. Accordingly , there has been
considerable emphasis on the development of automated approaches for detecting financial fraud. However ,
most methods have yielded performance results that are less than ideal. In consequence , financial fraud
detection continues as an important challenge for business intelligence technologies.
In light of the need for more robust identification methods , we use a design science approach to develop
MetaFraud, a novel meta-learning framework for enhanced financial fraud detection. To evaluate the pro-
posed framework, a series of experiments are conducted on a test bed encompassing thousands of legitimate
and fraudulent firms. The results reveal that each component of the framework significantly contributes to its
overall effectiveness. Additional experiments demonstrate the effectiveness of the meta-learning framework
over state-of-the-art financial fraud detection methods. Moreover, the MetaFraud framework generates
confidence scores associated with each prediction that can facilitate unprecedented financial fraud detection
performance and serve as a useful decision-making aid. The results have important implications for several
stakeholder groups, including compliance officers, investors, audit firms, and regulators.
Keywords: Fraud detection, financial statement fraud, feature construction, meta-learning, business
intelligence, design science
Recent developments in business intelligence (BI) tech- enhance fraud detection capabilities over existing techniques.
nologies have elevated the potential for discovering patterns Further, we provide (3) a confidence-level measure that iden-
associated with complex problem domains (Watson and tified a large subset of fraud cases at over 90 percent legiti-
Wixom 2007), such as fraud (Bolton and Hand 2002). mate and fraud recall, making fraud detection using public
Broadly, BI technologies facilitate historical, current, and information practicable for various stakeholders.
predictive views of business operations (Shmueli et al. 2007),
and may suggest innovative and robust methods for predicting The remainder of this paper is organized as follows. The next
the occurrence of fraud (Anderson-Lehman et al. 2004). In section reviews previous efforts to identify financial fraud,
fact, fraud detection is recognized as an important application and shows the need for more robust methods. The subsequent
area for predictive BI technologies (Brachman et al. 1996; section introduces the meta-learning kernel theory, and
Michalewicz et al. 2007). Since BI tools facilitate an im- describes its usefulness in the context of financial fraud. It
proved understanding of organizations' internal and external also introduces six hypotheses by which the design artifact
environments (Chung et al. 2005), enhanced financial fraud was evaluated. The design of the MetaFraud framework is
detection methods could greatly benefit the aforementioned then outlined, and details of the financial measures derived
stakeholder groups: investors, audit firms, and regulators. from publicly available information used to detect fraud are
The research objective of this study is to develop a business presented. The fifth section describes the experiments used
intelligence framework for detecting financial fraud using to evaluate the MetaFraud framework, and reports the results
publicly available information with demonstratively better of the hypothesis testing. A discussion of the experimental
performance than that achieved by previous efforts. results and their implications follows. Finally, we offer our
conclusions.
Annual Statement-
Green and Choi 5 financial and 3 Neural net 95 firm-years; Overall: 71.7%;
(1997) accounting measures 46 fraud, Fraud: 68.4%
49 non-fraud
Fanning and Cogger 26 financial and 36 Discriminant analysis, 204 firm-years; Overall: 63.0%;
(1998) accounting measures Logistic regression, Neural 102 fraud, Fraud: 66.0%
net 102 non-fraud
Beneish (1999a) 8 financial measures Probit regression 2,406 firm-years; Overall: 89.5%;
74 fraud, Fraud: 54.2%
2,332 non-fraud
Spathis (2002)a 10 financial measures Logistic regression 76 firm-years; Overall: 84.2%;
38 fraud, Fraud: 84.2%
38 non-fraud
Spathis et al. (2002)a 10 financial measures Logistic regression, 76 firm-years; Overall: 75.4%;
UTADIS 38 fraud, Fraud: 64.3%
38 non-fraud
Lin et al. (2003) 6 financial and 2 Logistic regression, 200 firm-years; Overall: 76.0%;
accounting measures Neural net 40 fraud, Fraud: 35.0%
160 non-fraud
Kirkos et al. (2007)a 10 financial measures Bayesian net, 76 firm-years; Overall: 90.3%;
ID3 decision tree, Neural 38 fraud, Fraud: 91.7%
net 38 non-fraud
Gaganis (2009)a 7 financial measures Discriminant analysis, 398 firm-years; Overall: 87.2%;
Logistic regression 199 fraud, Fraud: 87.8%
Nearest neighbor, Neural 199 non-fraud
Cecchini et al. (2010) 23 financial variables SVM using custom 3,319 firm-years; Overall: 90.4%
used to generate ratios financial kernel 132 fraud, Fraud: 80.0%
3,187 non-fraud
Dikmen and 10 financial measures Three-phase cutting plane 126 firm-years; Overall: 67.0%;
Kûçûkkocaoglu algorithm 17 fraud, Fraud: 81.3%
(2010)b
Dechow et al. (201 1) 7 financial measures Logistic regression 79,651 firm-years; Overall: 63.7%;
293 fraud Fraud: 68.6%
79,358 non-fraud
aData taken from Greek firms. bData taken from Turkish firms.
1995; Spathis 2002; Summers and Sweeney 1998). While fraud classification methods. The MetaFraud framework and
most prior studies using larger feature sets did not attain good its components are discussed in detail in the following
results (e.g., Fanning and Cogger 1997; Kaminski et al. 2004), sections.
Cecchini et al. (2010) had greater success using 23 seed
financial variables to automatically generate a large set of
financial ratios.
Using Meta-Learning as a Kernel Theory
The most commonly used classification methods were logistic for Financial Fraud Detection
regression, neural networks, and discriminant analysis, while
decision trees, Bayesian networks, and support vector mach- As mentioned in the previous section, and evidenced by
ines (SVM) have been applied in more recent studies Table 2, prior studies using data from U.S. firms have gener-
(Cecchini et al. 2010; Gaganis 2009; Kirkos et al. 2007). The ally attained inadequate results, with fraud detection rates
number of fraud firms in the data sets ranged from 38 firms to typically less than 70 percent. These results have caused
293 firms. Most studies used a pair-wise approach in which some to suggest that data based on financial statements is
the number of non-fraud firms was matched with the number incapable of accurately identifying financial fraud (at least in
of fraud firms (Fanning and Cogger 1998; Gaganis 2009; the context of U.S. firms). In one of the more recent studies
Kirkos et al. 2007; Persons 1995; Spathis 2002; Summers and on U.S. firms, Kaminski et al. (2004, p. 17) attained results
Sweeney 1998). However, a few studies had significantly only slightly better than chance, causing the authors to state,
larger sets of non-fraud firms (Beneish 1999a; Cecchini et al. "These results provide empirical evidence of the limited
2010; Dechow et al. 201 1). ability of financial ratios to detect fraudulent financial
reporting." A more recent study conducted by researchers at
In terms of results, the best performance values were achieved Pricewaterhouse Coopers attained fraud detection rates of 64
by Cecchini et al. (2010), Gaganis (2009), Kirkos et al. percent or lower (Bay et al. 2006). The limited performance
(2007), and Spathis (2002). Only these four studies had over- of these and other previous studies suggests that the financial
all accuracies and fraud detection rates of more than 80 measures and classification methods employed were insuffi-
percent. The latter three were all conducted using a datacient.
set Prior studies have generally relied on 8 to 10 financial
composed of Greek firms (mostly in the manufacturing
ratios, coupled with classifiers such as logistic regression or
neural networks. In light of these deficiencies, more robust
sector), which were governed by Greek legislation and Athens
Stock Exchange regulations regarding what is classifiedapproaches
as for detecting financial fraud are needed (Cecchini
et al. 2010).
unusual behavior by a firm. Given the differences in auditing
and reporting standards between Greece and the United
Design
States, as well as the larger international community (Barth et science is a robust paradigm that provides concrete
al. 2008), it is unclear how well those methods generalize/
prescriptions for the development of IT artifacts, including
translate to other settings. Cecchini et al. used an SVMconstructs, models, methods, and instantiations (March and
Smith 1995). In the design science paradigm, "Methods
classifier that incorporated a custom financial kernel. The
define processes. They provide guidance on how to solve
financial kernel was a graph kernel that used input financial
variables to implicitly derive numerous financial ratios. Their
problems, that is, how to search the solution space" (Hevner
approach attained a fraud detection rate of 80 percent on a et
25 al. 2004, p. 79). Several prior studies have utilized a design
fraud-firm test set. science approach to develop BI technologies encompassing
methods and instantiations (Abbasi and Chen 2008a; Chung
With respect to the remaining studies, Beneish (1999a) et
at-al. 2005). Accordingly, we were motivated to develop a
tained an overall accuracy of 89.5 percent, but this was
framework for enhanced financial fraud detection (i.e., a
primarily due to good performance on non-fraud firms, method).
whereas the fraud detection rate was 54.2 percent. With the
exception of Cecchini et al. no other prior study on U.S. firms
When creating IT artifacts in the absence of sufficient design
has attained a fraud detection rate of more than 70 percent.guidelines, many studies have emphasized the need for design
theories to help govern the development process (Abbasi and
Our analysis of these studies motivated several refinements
Chen 2008a; Markus et al. 2002; Storey et al. 2008; Walls et
al. 1992). We used meta-learning as a kernel theory to guide
that are incorporated in our meta-learning framework, namely
(1) the inclusion of organizational and industry-level context
the development of the proposed financial fraud detection
information, (2) the utilization of quarterly and annual
framework (Brázdil et al. 2008). In the remainder of this
statement-based data, and (3) the adoption of more robust
section, we present an overview of meta-learning and discuss
how meta-learning concepts can be utilized to address the bias specifies the representation of the space of hypotheses;
aforementioned research gaps, resulting in enhanced financial it is governed by the quantity and type of attributes
fraud detection capabilities. Testable research hypotheses are incorporated (i.e., the feature space). Procedural bias pertains
also presented. A framework based on meta-learning for to the manner in which classifiers impose constraints on the
financial fraud detection is then described and evaluated. ordering of the inductive hypotheses.
of contextual information generally omitted by previous H la: Combining yearly financial measures with organi-
studies are organizational and industry-level contexts. zational context features will outperform the use of
yearly financial measures alone.
Organizational context information can be derived by com-
paring a firm's financial performance relative to its perfor- Hl b : Combining yearly financial measures with industry-
mance in prior periods. Auditors commonly compare firms' level context features will outperform the use of
financial measures across consecutive time periods in order to yearly financial measures alone.
identify potential irregularities (Ameen and Strawser 1994;
Green and Choi 1997). Further, prior financial fraud detec- HI c: Combining yearly financial measures with industry-
level and organizational context features will
tion studies suggest that utilizing measures from the preceding
outperform the use of yearly financial measures
year can provide useful information (Cecchini et al. 2010;
alone.
Fanning and Cogger 1998; Persons 1995; Virdhagriswaran
and Dakin 2006). Consideration of data across multiple time
Hid : Combining quarterly financial measures with
periods can reveal organizational trends and anomalies that
organizational context features will outperform the
are often more insightful than information derived from
use of quarterly financial measures alone.
single-period snapshots (Chen and Du 2009; Coderre 1999;
Green and Calderon 1995; Kinney 1987). Hie: Combining quarterly financial measures with
industry-level context features will outperform the
Industry-level context information can be derived by com- use of quarterly financial measures alone.
paring a firm's financial performance relative to the perfor-
mance of its industry peers. Prior studies have found that H If Combining quarterly financial measures with
certain industries have financial statement irregularity patterns industry-level and organizational context features
that are unique and distinctly different from other industries will outperform the use of quarterly financial
(Maletta and Wright 1996). Beasley et al. (2000) analyzed measures alone.
tional and industry-level context information will result in ences, and that concealment methods differ dramatically
between the end of the first three quarters and the end of the
improved financial fraud detection performance in terms of
year-end quarter. In their analysis, one large company that
overall accuracy and class-level f-measure, precision, and
recall. manipulated its financial statements used unsupported topside
entries - entries that remove the discrepancy between actual H2a : Combining yearly and quarterly statement-based
operating results and published financial reports - as the features will outperform the use of only yearly
major way to commit fraud at the end of the first three quar- features in terms of fraud detection performance.
ters, but harder-to-detect, sophisticated revenue and expense
frauds at the end of the year. Another company used topside H2b : Combining yearly and quarterly statement-based
entries at the end of the first three quarters but shifted losses features will outperform the use of only quarterly
and debt to unconsolidated, related entities at the end of the features in terms of fraud detection performance.
year.
As a specific example, consider the cash flow earnings differ- Improving Procedural Bias Using Stacked
ence ratio for the Enron fraud (Figure 1). Figure la shows Generalization and Adaptive Learning
this ratio for Enron and its industry model, over a two-year
period, using only annual numbers. While Enron's values are Prior financial fraud detection studies have used several dif-
slightly lower than the industry model's, the graph exhibits no ferent classification methods, with logistic regression, neural
recognizable pattern. Figure lb shows this ratio for Enron networks, and discriminant analysis being the most common.
and its industry model on a quarterly basis. Enron's figures However, no single classifier has emerged as a state-of-the-art
are primarily positive for the first three quarters, and then technique for detecting financial fraud (Fanning and Cogger
sharply negative in the fourth quarter. Throughout the first 1998; Gaganis 2009; Kirkos et al. 2007). Therefore, the need
three quarters, Enron's management was using various ac- remains for enhanced classification approaches capable of
counting manipulations to make their income statement look improving procedural bias. Meta-learning strategies for
better (Albrecht, Albrecht, and Albrecht 2004; Kuhn and enhancing procedural bias include stacked generalization and
Sutton 2006). At the end of the year, Enron's management adaptive learning (Brázdil et al. 2008). Stacked generaliza-
"corrected" the discrepancy by shifting those losses to off- tion involves the use of a top-level learner capable of
balance sheet, nonconsolidated special purpose entities (now effectively combining information from multiple base learners
called variable interest entities). This difference in manipula- (Vilalta and Drissi 2002; Wolpert 1992). Adaptive learning
tion methods between quarters is not apparent when analyzing entails constant relearning and adaptation (i.e., dynamic bias
annual data, but it shows up in the cash flow earnings differ- selection) to changes in the problem environment, including
ence ratio when used with quarterly data. concept drift (Brázdil et al. 2008).
classifiers frequently provide complementary information that H 3 a: When yearly context-based features are used ' stack
could be useful if exploited in unison (Tsoumakas et al. classifiers will outperform individual classifiers in
2005). Prior financial fraud detection studies utilizing terms of fraud detection performance.
multiple classification methods have also observed some
levels of noncorrelation between the classifiers' predictions, H3b: When quarterly context-based features are used,
even when overall accuracies were equivalent. For instance, stack classifiers will outperform individual
Fanning and Cogger (1998) attained the best overall accuracy classifiers in terms of fraud detection performance.
using a neural network, yet the fraud detection rates were con-
siderably higher (i.e., 12 percent) when using discriminant
analysis. While comparing logistic regression and neural Adaptive Learning
network classifiers, Lin et al. (2003) noted that the two classi-
fiers achieved somewhat comparable overall accuracies; While fraud lies behind many of the largest bankruptcies in
however, logistic regression had 1 1 percent better perfor- history, there are considerable differences between the types
mance on non-fraudulent firms, while the neural network of frauds committed and the specific obfuscation tactics
obtained 30 percent higher fraud detection rates. Similarly, employed by previous firms. For instance, the $104 billion
Gaganis (2009) observed equally good overall results using a WorldCom fraud utilized a fairly straightforward expense
UT ADIS scoring method and a neural network; however, the
capitalization scheme (Zekany et al. 2004). In contrast, the
respective false positive and false negative rates for the two
$65 billion Enron fraud was highly complex and quite unique;
methods were exact transpositions of one another. These
the use of special-purpose entities as well as various other
findings suggest that methods capable of conjunctively lever-
tactics made detection very difficult (Kuhn and Sutton 2006).
aging the strengths of divergent classifiers could yield
These examples illustrate how financial fraud cases can be
improved financial fraud detection performance.
strikingly different in terms of their complexity and nature.
Effective financial fraud detection requires methods capable
Stacked generalization (also referred to as stacking ) provides
of discovering fraud across a wide variety of industries,
a mechanism for harnessing the collective discriminatory
reporting styles, and fraud types over time.
power of an ensemble of heterogeneous classification
methods (Wolpert 1992). Stacking involves the use of a top-
Fraud detection is a complex, dynamic, and evolving problem
level classification model capable of learning from the
(Abbasi et al. 2010; Bolton and Hand 2002). Given the
predictions (and classification biases) of base-level models in
adversarial nature of fraud detection, the classification mech-
order to achieve greater classification power (Brázdil et al.
2008; Hansen and Nelson 2002; Ting and Witten 1997; anisms used need constant revision (Abbasi et al. 2010).
Wolpert 1992). As Sigletos et al. (2005, p. 1751) noted, "The Adaptive learning methods have the benefit of being able to
success of stacking arises from its ability to exploit the relearn, in either a supervised or semi-supervised capacity, as
diversity in the predictions of base-level classifiers and thus new examples become available (Brázdil et al. 2008; Fawcett
predicting with higher accuracy at the meta-level." This and Provost 1997). The ability to adaptively learn is espe-
ability to learn from underlying classifiers makes stacking cially useful in the context of financial fraud because the risk
more effective than individual classifier-based approaches or environment surrounding fraud is expected to change, making
alternate fusion strategies that typically combine base-level it more difficult to detect (Deloitte 2010). Virdhagriswaran
classifications using a simple scoring or voting scheme and Dakin (2006, p. 947) noted that adaptive learning could
(Abbasi and Chen 2009; Dzeroski et al. 2004; Hu and greatly improve fraud detection capabilities by "identifying
Tsoukalas 2003; Lynam and Cormack 2006; Sigletos et al. compensatory behavior" by fraudsters "trying to camouflage
2005). Consequently, stacking has been effectively utilized their activities." Similarly, Fawcett and Provost (1997, p. 5)
in related studies on insurance and credit card fraud detection, observed that "it is important that a fraud detection system
outperforming the use of individual classifiers (Chan et al. adapt easily to new conditions. It should be able to notice
1999; Phua et al. 2004). new patterns of fraud." An adaptive, learning-based classifier
that is aware of its changing environment and able to con-
Given the performance diversity associated with fraud detec- stantly retrain itself accordingly, should outperform its static
tion classifiers employed in prior research, the use of stacked counterpart.
generalization is expected to be highly beneficial, facilitating
enhanced financial fraud detection capabilities over those H4: The use of an adaptive learning mechanism, capable of
achieved by individual classifiers. We predict this perfor- relearning as new information becomes available, will
mance gain will be actualized irrespective of the specific outperform its static counterpart in terms of fraud
context-based feature set utilized.
detection performance.
Collective Impact Attributable to Improving presented in the previous section, namely (1) the use of
Declarative and Procedural Bias organizational and industry contextual information, (2) the
use of quarterly and annual data, and the use of more robust
Based on the previous four hypotheses, we surmise that finan-
classification methods using (3) stacked generalization and
cial fraud detection methods incorporating meta-learning
(4) adaptive learning. In this section, we demonstrate how
principles pertaining to the improvement of declarative and
our meta-learning framework fulfills each of these require-
procedural bias are likely to provide enhanced discriminatory
ments to enhance financial fraud detection.
potential (Brázdil et al. 2008). Specifically, we expect that
the use of industry and organizational context information
The MetaFraud framework utilizes a rich feature set,
derived from both yearly and quarterly statements for declara-
numerous classification methods at the base and stack level,
tive bias improvement (H1-H2), coupled with and stacked
an adaptive learning algorithm. Each component of the
generalization and adaptive learning for procedural bias (shown in Figure 2) is intended to enhance
framework
improvement (H3-H4), will facilitate improvementsfinancial
in overall
fraud detection capabilities. Beginning with a set of
financial fraud detection capabilities. Accordingly,
yearly andwequarterly seed ratios, industry-level and organi-
hypothesized that, collectively, a meta-learning framework
zational context-based features are derived to create the yearly
that incorporates these principles will outperform existing
and quarterly feature sets (bottom of Figure 2). These feature
state-of-the art financial fraud detection methods.
sets are intended to improve declarative bias. The features are
used as inputs for the yearly and quarterly context-based
H5: A meta-learning framework that includes appropriate
classifiers. The classifications from these two categories of
provisions for improving declarative and procedural bias
classifiers are then used as input for a series of stack classi-
in concert will outperform existing methods in terms of
fiers. The adaptive, semi-supervised learning algorithm,
fraud detection performance.
shown at the top of Figure 2, uses the stack classifiers'
predictions to iteratively improve classification performance.
Prior studies have effectively used ensemble approaches in
The stack classifiers and adaptive learning algorithm are
concert with semi-supervised learning (Balean et al. 2005;
intended to improve procedural bias.
Ando and Zhang 2007; Zhou and Goldman 2004). For
instance, Zhou and Li (2005) markedly improved the per-
As a simple example, think of Stack Classifierl as an SVM,
formance of underlying classifiers on several test beds, in
which takes input from bottom (1) yearly context-based
various application domains, by using a three-classifier
classifiers and (2) quarterly context-based classifiers; such as
ensemble in a semi-supervised manner. It is, therefore, con-
SVM, J48, BayesNet, NaiveBayes etc. Stack_Classifier2
ceivable that such ensemble-based semi-supervised methods
might be a J48 classifier, which accepts inputs from the same
could also facilitate improved procedural bias for financial
bottom yearly and quarterly context-based classifiers: SVM,
fraud detection. However, given the reliance of such methods
J48,
on voting schemes across base classifiers (Balean et al.BayesNet,
2005; NaiveBayes etc. Output from the stack classi-
fiers is aggregated and input to the adaptive learner. The four
Zhou and Li 2005), we believe that ensemble semi-supervised
learning methods will underperform meta-learningcomponents
strategies of the framework are closely related to the
research
that harness the discriminatory potential of stacked hypotheses. Each of these components is explained
generali-
below.
zation and adaptive learning.
1. Asset Quality Index (AQI): AQI is the ratio of non- 5 . Depreciation Index (DEPI) : DEPI is the ratio of the rate
current assets other than property, plant, and equipment, of depreciation in period t-1 as compared to period t.
to total assets, for time period t relative to time period t-1. Fictitious assets accelerate the depreciation rate, resulting
An AQI greater than 1 indicates that the firm has in smaller values for DEPI (Beneish 1999a; Cecchini et
potentially increased its involvement in cost deferral, a
al. 2010; Dikman and Kûçiikkocaoglu 2010).
possible indicator of asset overstatement fraud (Beneish
1999a; Dikman and Kûçiikkocaoglu 2010).
6. Gross Margin Index (GMI): GMI is the ratio of the gross
margin in period t-1 to the gross margin in period t. A
2. Asset Turnover (AT): AT is the ratio of net sales to total
GMI greater than 1 suggests that gross margins have
assets. When revenue fraud is being committed, net sales
deteriorated, a condition rarely encountered when a firm
are often increased artificially and rapidly, resulting in a
is engaging in revenue fraud (Beneish 1999a; Lin 2003).
large ^rvalue (Cecchini et al. 2010; Kirkos et al. 2007;
Spathis 2002; Spathis et al. 2002).
7. Inventory Growth (IG): IG assesses whether inventory
3. Cash Flow Earnings Difference (CFED): CFED has grown in period t as compared to period t-1. IG is
assesses the impact of accruals on financial statements used to detect whether ending inventory is being over-
(Beneish 1999a; Dechow et al. 201 1). This ratio is often stated to decrease cost of goods sold and increase gross
positive when revenue fraud is occurring or when margin (Cecchini et al. 2010; Dikman, and Kûçiik-
employees are engaging in cash theft. kocaoglu 2010; Persons 1995).
8. Leverage (LEV): LEV is the ratio of total debt to declarative bias in situations where the hypothesized set
total assets in period t relative to period t-1. LEV is generated by a particular feature set needs to be expanded
used to detect whether firms are fictitiously (Brázdil et al. 2008). In prior BI studies, feature construction
including assets on the balance sheet without any was used to derive complex and intuitive financial ratios and
corresponding debt (Cecchini et al. 2010; Beneish metrics from (simpler) seed accounting variables (Piramuthu
1999a; Kirkos, et al 2007; Persons 1995; Spathis et al. 1998; Zhao et al. 2009). These new features were often
2002; Spathis et al. 2002). generated by combining multiple seed measures using arith-
metic operators such as multiplication and division (Langley
9. Operating Performance Margin (OPM): OPM is et al. 1986; Zhao et al. 2009).
calculated by dividing net income by net sales.
When fraudulent firms add fictitious sales revenues, We used subtraction and division operators to construct new
they often fall to the bottom line without additional features indicative of a firm's position relative to its own prior
costs, thus inflating the value of OPM (Cecchini et performance (organizational context) or its industry (industry-
al. 2010; Persons 1995; Spathis 2002; Spathis et al. level context). First, the organizational context features were
2002). constructed by computing the difference between (-) and the
ratio of (/) the firms' seed financial ratios (described in the
10. Receivables Growth (RG): RG is the amount of previous section) in the current time period relative to their
receivables in period t divided by the amount in values for the same ratios in the previous time period.
period t-1. Firms engaging in revenue fraud often
add fictitious revenues and receivables, thereby Second, to generate the industry-level context features, we
increasing RG (Cecchini et al. 2010; Dechow et al. developed industry-representative models designed to charac-
terize what is normal for each industry. Each firm's industry
201 1; Summers and Sweeney 1998).
affiliation was defined by its North American Industry
Classification System (NAICS) code. NAICS was used since
11. Sales Growth (SG): SG is equal to net sales in
it is now the primary way Compustat and other standards
period t divided by net sales in period t-1. In the
bodies reference industries. Two types of models were devel-
presence of revenue fraud, the value of SG generally
oped. Top-5 models were created by averaging the data from
increases (Beneish 1999a; Cecchini et al. 2010;
the five largest companies in each industry-year (in terms of
Gaganis 2009; Dikman and Ktiçíikkocaoglu 2010;
sales), and then generating the 12 seed financial ratios from
Persons 1995; Summers and Sweeney 1998).
these averaged values. Hence, each industry had a single
corresponding top-5 model. Closest-5 models were created
12. SGE Expense (SGEE): SGEE is calculated by
for each firm by averaging the data from the five companies
dividing the ratio of selling and general administra-
from the same industry-year that were most similar in terms
tive expenses to net sales in period t by the same
of sales. Hence, each firm had a corresponding closest-5
ratio in period t-1. When firms are engaging in
model. The intuition behind using these two types of models
revenue fraud, SGE expenses represent a smaller
was that the top-5 models represent the industry members
percentage of the artificially inflated net sales, there-
with the greatest market share (and therefore provide a single
by causing SGEE to decrease (Beneish 1999a;
reference model for all firms in the industry), while closest-5
Cecchini et al. 2010; Dikman and Ktiçíikkocaoglu
models represent the firms' industry peers. As revealed in the
2010).
evaluation section, both types of models had a positive impact
on fraud detection performance.
Yearly and Quarterly Context-Based Feature Sets For a given model, total assets were calculated as the average
of total assets of the five companies, while the accounts
The yearly and quarterly context-based feature sets used the receivable was the average accounts receivable of the same
aforementioned seed ratios to derive industry-level and companies. Multiple firms were used to construct each model
organizational context features. The context features were in order to smooth out any non-industry-related fluctuations
developed using feature construction: the process of applying attributable to individual firms (Albrecht, Albrecht, and Dunn
constructive operators to a set of existing features in order to
2001). On the other extreme, using too many firms produced
generate new features (Matheus and Rendell 1989). Feature models that were too aggregated. Therefore, in our prelimi-
construction facilitates the fusion of data and domain knowl- nary analysis, we explored the use of different numbers of
edge to construct features with enhanced discriminatory firms and found that using five provided the best balance.
potential (Dybowski et al. 2003). In meta-learning, feature The industry-level context features were then constructed by
construction is recommended as a mechanism for improving computing the difference between (-) and the ratio of (/) the
___
R1-C1 , R2-C2,...R12-C12 12
Industry-level context: Closest-5 Model
R1-P1 , R2-P2....R12-P12 12
Organizational context
Total 84
firms' see
firms' ratios in a particular quarter against those from the pre-
and viouscloses
quarter (e.g., R1Q2/R1Q1 denotes the ratio of a firm's
Asset Quality Index in quarter 2 as compared to quarter 1).
This resulted in a quarterly
Table 3 feature set composed
sh of 336
attributes.
firm, we
as 48 indu
correspo
Yearly and Quarterly Context-Based Classifiers
industry
feature R
Quality In
The yearly and quarterly context-based feature sets were
additional
coupled with an array of supervised learning classification
using the
methods. Prior studies have mostly used logistic regression
resulted
and neural network classifiers (Fanning and Cogger 1998; i
Green and Choi 1997; Lin et al. 2003; Persons 1995; Spathis
Table 4
2002). However, additional sh
classification methods have also
of the 12 seed financial ratios was derived from all four
attained good results for financial fraud detection (e.g., Kirkos
quarterly statements, resulting in 48 core features (R1Q1-
et al. 2007), as well as related fraud detection problems (e.g.,
R12Q4). These were used to generate 96 top-5 model-based
Abbasi and Chen 2008b; Abbasi et al. 2010), including
industry-level context features (e.g., R1Q1-T1Q1 and
support vector machines, tree classifiers, and Bayesian
R1Q1/T1Q1), and 96 closest-5 model-based features (e.g.,
methods. Given the lack of consensus on best methods, as
R1Q1-C1Q1 and R1Q1/C1Q1). Furthermore, 96 organiza-described earlier, a large number of classifiers were used in
tional context features were constructed by comparingorder
the to improve overall fraud detection performance. More-
over, the use of a large set of classifiers also provided a highly Adaptive Learning
useful confidence-level measure, which is described in the
evaluation section. The ability of adaptive learning approaches to dynamically
improve procedural bias are a distinct advantage of meta-
Accordingly, we incorporated several classifiers in addition learning (Brázdil et al. 2008), particularly for complex and
to logistic regression and neural networks. Three support evolving problems such as fraud detection (Fawcett and
vector machines (SVM) classifiers were utilized: linear, Provost 1997). We propose an adaptive semi-supervised
polynomial, and radial basis function (RBF) kernels (Vapnik learning (ASL) algorithm that uses the underlying generalized
1 999). Two Bayesian classifiers were used: Naïve Bayes and stacks. ASL is designed to exploit the information provided
Bayesian Networks (Bayes 1958). Various tree-based classi- by the stack classifiers in a dynamic manner; classifications
fiers were employed, including the J48 decision tree, Naïve are revised and improved as new information becomes
Bayes Tree (NBTree), ADTree, Random Forest, and REPTree available. When semi-supervised learning is used, a critical
(Breiman 2001; Freund and Mason 1999; Kohavi 1996; problem arises when misclassified instances are added to the
Quinlan 1986). Two rule-based classifiers were also training data (Tian et al. 2007). This is a major concern in the
included: nearest neighbor (NNge) and JRip (Cohen 1995; context of financial fraud detection, where models need to be
Martin 1995). These 14 classifiers were each run using the updated across years (i.e., semi-supervised active learning),
yearly and quarterly feature sets, resulting in 28 classifiers in
since classification models can incorporate incorrect rules and
total: 14 yearly context-based classifiers and 14 quarterly
assumptions, resulting in amplified error rates over time.
context-based classifiers.
ASL addresses this issue in two ways. First, the expansion
process is governed by the stack classifiers' predictions. Only
test instances that have strong prediction agreement across the
Stacked Generalization
top-level classifiers in the generalized stacks are added to the
training data. Second, during each iteration, the training data
In the third component of the framework, we utilized stacked
set is reset and all testing instances are reclassified in order to
generalization to improve procedural bias, where the classi-
provide error correction.
fications from the underlying individual classifiers were used
as input features for a top-level classifier (Brázdil et al. 2008;
A high-level description of ASL's steps is as follows:
Hansen and Nelson 2002; Hu and Tsoukalas 2003). All 14
classifiers described in the previous section were run as top- the bottom-level classifiers and run them on the
1. Train
level classifiers, resulting in 14 different stack arrangements.entire test bed.
Stacking can be highly effective when incorporating large
quantities of predictions from underlying classifiers as input
2. Train the top-level classifiers in the generalized stack,
features (Lynam and Cormack 2006). Accordingly, for each
using the training and testing data generated by the
stack, we utilized all 28 individual classifiers as inputs for the
bottom-level classifiers.
top-level classifier: 14 yearly context-based classifiers and 14
quarterly context-based classifiers.
3. Reset the training data to include only the original
training instances.
The testing data for the top-level classifiers was composed of
the individual (i.e., bottom-level) classifiers' classifications on
the testing instances. The training data for the top-level 4. Rank the test instances based on the top-level classifiers'
predictions.
classifiers was generated by running the bottom-level classi-
fiers using 10-fold cross-validation on the training instances
5. If the stopping rule has not been satisfied, add the d test
(Dzeroski and Zenko 2004; Ting and Witten 1997). In other
words, the training data was split into 10 segments. In each instances with the highest rank to the training data (with
class labels congruent with the top-level classifiers'
fold, a different segment was used for testing, while the
remaining 9 segments were used to train the bottom-levelpredictions) and increment d. Otherwise go to step 7.
classifiers. The bottom-level classifiers' classifications from
the test instances associated with these 10 secondary folds 6. If d is less than the number of instances in the test bed,
collectively constituted the top-level classifiers' training data. repeat steps 1-5, using the expanded training data for
This approach was necessary to ensure that feature values in steps 1 and 2.
the training and testing instances of the stack classifiers were
consistent and comparable (Abbasi and Chen 2009; Witten 7. Output the predictions from the top-level classifiers in
and Frank 2005). the generalized stacks.
Given training examples T= [tl9 t2, . . *„], training class labels L = [1/j, /2, . . /J, and testing instances R = [r„ r2, . . rm]
Let c denote the number of classification algorithms utilized (in this study, c- 14)
Initialize variable d to track number of items from R to add to T, where pisa predefined constant and d = p
While d <m
Derive yearly classifiers' test prediction matrix Y= 'yx = Yearly X{T, L, R), ...,yc = Yearly C(T, L, R)] and training data cross-
validation prediction matrix W=[wx = Yearly X(T, L), ...,wc = Yearly C{T, L)]
Derive quarterly classifiers' test prediction matrix QW= [q¡ = Quarterly(T, L, R), qc = Quarterly C(T, L, Ä)] and training data
cross-validation prediction matrix V=[vl = Quarterly X{T, L ), vc = Quarterly C(T, L )]
Derive top-level stack classifiers' prediction matrix S = [s, = Stackx{[W, V], L , [Y, Q ]), . . sc = Stackc([W, V], L, [Y, Q])'
Reset training data to original set of instances T=[tl9t2, . . /„], training class labels L = [/l5 /2, . . /„]
c c
Compute
L 1=1 1=1 J
, Í Uf'p' = c
Compute test instance weights X=[x{, . . xm] where x = '
[0 , otherwise
m
if Yjxi-d
/= i
[ l, if pvi > 0
Determine the instances' class labels Z = [z,, . . zj where z, = i ^
Add selected test instances to training data T= [tx, t2, ...,tn, F], / = [/ l9l2, ln9 Z]
Increment selection quantity variable d= d+p
Else
Exit Loop
End If
Loop
Output S
During each iteration, the training set is used to train (and run) Figure 3 shows the detailed mathematical formulation of the
the bottom-level classifiers on the entire test bed. The top- ASL algorithm. In each iteration, the yearly and quarterly
level classifiers are then run using the training and testing context-based classifiers are run with the training data T and
instance feature values generated by the bottom-level classi- class labels Y. These yearly and quarterly classifiers are each
fiers. The testing instances are ranked based on the top-level run in two ways. First, they are trained on T and run on the
classifiers' predictions, where instances with greater predic- testing data R to generate the two m x c test data prediction
tion agreement across the classifiers are given a higher rank. matrices Y and Q. Next, they are run on T using 10-fold cross
The selected instances are added to the original training data, validation in order to generate the two n x c training data
where the number of instances added is proportional to the matrices ( W and V) for the generalized stacks' top-level
iteration number (i.e., an increasing number of test instances classifiers (as described earlier). The predictions from the
are added during each subsequent iteration). Test instances top-level classifiers are used to construct the stack prediction
are added with the predicted class label (as opposed to the matrix S. Once the stack predictions have been made, the
actual label), since we must assume that the actual labels of training set is reset to its original instances in order to allow
the test instances are unknown (Chapelle et al. 2006). The error correction in subsequent iterations in the event that an
instances added in one iteration are not carried over to the erroneous classification has been added to the training set.
Next, the top-level classifiers' predictions for each instance
next one. The steps are repeated until all testing instances are
added during an iteration or the stopping rule has beenare aggregated across classifiers (in P ), and only those
reached. instances with unanimous agreement (i.e., ones deemed
legitimate or fraudulent by all top-level classifiers) are given 1 995 and 20 1 0. Based on these AAERs, fraudulent instances
a weight of 1 in X. If the number of instances in X with a for the fiscal years ranging from 1 985 to 2008 were identified.
value of 1 is greater than or equal to the selection quantity The information gathered from the AAERs was verified with
variable d9 we add d of these test instances to our training set other public sources (e.g., business newspapers) to ensure that
T with class labels that correspond to the top-level classifiers' the instances identified represented bona fide financial state-
predictions (Z). We then increment d so that a larger number ment fraud cases. Consistent with prior research, firms com-
of instances will be added in the following iteration. If there mitting fraud over a two-year period were treated as two
are insufficient unanimous agreement instances in X , we do separate instances (Cecchini et al. 2010; Dikmen and
not continue since adding ones where the top-level classifiers Kûçûkkocaoglu 2010; Persons 1995). Thus, the 815 fraudu-
disagree increases the likelihood of inserting misclassified lent instances were associated with 307 distinct firms.
instances into the training set. Otherwise, the process is
repeated until all testing instances have been added to the The legitimate instances encompassed all firms from the same
training set (i.e., d > m).
industry-year as each of the fraud instances (Beneish 1999a;
Cecchini et al. 2010). After removing all non-fraud firm-year
instances in which amendments/restatements had been filed
Evaluation as well as ones with missing statements (Cecchini et al. 20 1 0),
8,191 legitimate instances resulted. As noted by prior studies,
although the legitimate instances included did not appear in
Consistent with Hevner et al. (2004), we rigorously evaluated
any AAERs
our design artifact. We conducted a series of experiments to or public sources, there is no way to guarantee
that
assess the effectiveness of our proposed financial fraud none of them have engaged in financial fraud (Bay et al.
2006; Dechow et al. 201 1; Kirkos et al. 2007).
detection framework; each assessed the utility of a different
facet of the framework. Experiment 1 evaluated the proposed
Consistent with prior work (Cecchini et al. 2010), the test bed
yearly and quarterly context-based feature sets in comparison
was
with a baseline features set composed of annual statement-split into training and testing data based on chronological
based financial ratios (HI). Experiment 2 assessed theorder (i.e., firm instance years). All instances prior to 2000
effec-
were used
tiveness of using stacked classifiers. We tested the efficacy for training, while data from 2000 onward was
of combining yearly and quarterly information over used for testing. The training data was composed of 3,862
using
firm-year instances (406 fraudulent and 3,456 legitimate),
either information level alone (H2) and also compared stacked
while the
classifiers against individual classifiers (H3). Experiment 3 testing data included 5 , 1 44 firm-year instances (409
evaluated the performance of adaptive learning versus fraudulent
a static and 4,735 legitimate). All 14 classifiers described
learning model (H4). Experiments 4 and 5 assessed in the
thesection "Yearly and Quarterly Context-Based Classi-
fications"
overall efficacy of the proposed meta-learning framework in were employed. For all experiments, the classifiers
were trained on the training data and evaluated on the 5,144-
comparison with state-of-the-art financial fraud detection
instance test set.
methods (H5) and existing ensemble semi-supervised learning
techniques (H6).
For financial fraud detection, the error costs associated with
We tested the hypotheses using a test bed derivedfalse
fromnegatives (failing to detect a fraud) and false positives
(considering a legitimate firm fraudulent) are asymmetric.
publicly available annual and quarterly financial statements.
Moreover,
The test bed encompassed 9,006 instances (815 fraudulent these costs also vary for different stakeholder
groups.of
and 8,191 legitimate), where each instance was composed For investors, prior research has noted that investing
the information for a given firm, for a particular year.inHence,
a fraudulent firm results in losses attributable to decreases
in
for each instance in the test bed, the 12 financial ratios stock value when the fraud is discovered, while failing to
invest in a legitimate firm comes with an opportunity cost
(described earlier in the section "Financial Fraud Detection
(Beneish 1999a). Analysis has revealed that the median drop
Feature Sets") were derived from the annual and quarterly
financial statements for that year. in stock value attributable to financial fraud is approximately
20 percent, while the average legitimate firm's stock appre-
ciates
The data collection approach undertaken was consistent withat a rate of 1 percent, resulting in an investor cost ratio
of 1:20 (Beneish 1999a, 1999b; Cox and Weirich 2002).
the approaches employed in previous studies (e.g., Cecchini
From the regulator perspective, failing to detect fraud can
et al. 2010; Dechow et al. 2011). The fraudulent instances
result
were identified by analyzing all of the SEC Accounting in significant financial losses (Albrecht, Albrecht, and
and
Auditing Enforcement Releases (AAERs) posted betweenDunn 2001; Dechow et al. 201 1). On the other hand, false
positives come with unnecessary audit costs. According to
the Association of Certified Fraud Examiners (2010), the uated the effectiveness of the yearly and quarterly context-
median loss attributable to undetected financial statement based feature sets (described earlier and in Tables 3 and 4) in
fraud is $4.1 million (i.e., cost of false negatives), while the comparison with a baseline feature set composed of the 12
median audit cost (i.e., cost of false positives) is $443,000 ratios described earlier. For the baseline, these 12 ratios were
(Charles et al. 2010). For regulators, this results in an derived from the annual statements, as done in prior research
approximate cost ratio of 1 : 1 0. Accordingly, in this study we (Kaminski et al. 2004; Kirkos et al. 2007; Summers and
used cost ratios of 1:20 and 1:10 to reflect the respective Sweeney 1998). The three feature sets were run using all 14
situations encountered by investors and regulators. classifiers described in the previous section.
It is important to note that, consistent with prior work, we Table 5 shows the results for the baseline classifiers. Tables
only consider error costs (Beneish 1999a; Cecchini et al. 6 and 7 show the results for the yearly and quarterly context-
20 1 0). In the case of the regulator setting, the cost breakdown based classifiers (i.e., the 14 classifiers coupled with the 84
is as follows: and 336 yearly and quarterly context-based features, respec-
tively). Due to space limitations, we report only the overall
• True Negatives: Legitimate firms classified as legitimate AUC and legitimate/fraud recall (shaded columns) and
(no error cost) legitimate/fraud precision when using the 1:20 investor cost
• True Positives: Fraudulent firms classified as fraudulent setting (Cecchini et al. 2010). Results for the regulator cost
(no error cost since the audit was warranted) setting can be found in Appendix A.
• False Negatives: Fraudulent firms classified as legiti-
mate (fraud-related costs of $4.1 million) For all three feature sets, the best AUC results were attained
• False Positives: Legitimate firms classified as fraudulent using NBTree and Logit. These methods also provided the
(unnecessary audit costs of $443,000) best balance between legitimate/fraud recall rates for the
investor cost setting. In comparison with the baseline classi-
Due to space constraints, in the following three subsections fiers, the yearly and quarterly context-based classifiers had
we only reported performance results for the investor situa-higher overall AUC values, with an average improvement of
tion, using a cost ratio of 1 :20. Appendices A, B, and C con-over 10 percent. For the quarterly and yearly context-based
tain results for the regulator situation (i.e., using a cost setting classifiers, the most pronounced gains were attained in terms
of 1:10). However, in the final two subsections, when com-of fraud recall (17 percent and 23 percent higher on average,
paring MetaFraud against other methods, results for both respectively). The results for the various yearly and quarterly
stakeholder groups' cost settings are reported. The evaluationcontext-based classifiers were quite diverse: six of the quar-
metrics employed included legitimate and fraud recall for the terly classifiers had fraud recall rates over 80 percent while
two aforementioned cost settings (Abbasi et al. 2010; three others had legitimate recall values over 90 percent.
Cecchini et al. 2010). Furthermore, area under the curveClassifiers such as SVM-Linear and REPTree were able to
(AUC) was used in order to provide an overall effectivenessidentify over 85 percent of the fraud firms (but with false
measure for methods across cost settings. Receiver operating positive rates approaching 40 percent). Conversely, tree-
characteristic (ROC) curves were generated by varying the based classifiers such as J48, Random Forest, and NBTree
false negative cost between 1 and 100 in increments of 0.1, had false positive rates below 1 0 percent, but with fraud recall
while holding the false positive cost constant at 1 (e.g., 1:1, rates of only 25 to 50 percent. This diversity in classifier per-
1:1.1,1:1.2, etc.), resulting in 99 1 different cost settings. Forformance would prove to be highly useful when using stacked
each method, AUC was derived from these ROC curves. generalization (see the subsection "Evaluating Stacked
Moreover, all hypothesis testing results in the paper also Classifiers").
incorporated multiple cost settings (including the investor and
regulator settings).
Context-Based Classifiers Versus
Baseline Classifiers
SVM-Lin 0.694 99^3 36^9 117 96Ü ÃDTree 0.773 977 58^5 14^9 84.1
LogitReg 0791 97^0 701 17.8 74.8 RandForest 0.785 96.4 71.3 17.2 69.2
J48 0.669 94.1 82.0 16.3 40.6 NBTree 0.814 96.6 73.6 18.6 69.7
BayesNet 0.752 96.5 73.4 18.3 68.7 REPTree 0.624 96.0 61.1 13.5 70.2
NaiveBayes 0.716 98.2 55.2 14.6 88.5 JRip 0.626 95.3 69.7 14.6 60.2
SVM-RBF 0.645 94.2 63.3 11.5 55.3 NNge 0.703 94.2 70.3 12.7 50.1
SVM-Poly 0.729 98.5 52.5 14.2 90.7 NeuralNet 0.619 95.9 59.2 13.1 70.9
SVM-Lin 0.733 987 603 16^5 907 ADTree 0.741 98^0 53^9 MA 87.3
LogitReg 0.780 96^5 6Õ4 16.6 70.7 RandForest 0.689 93.6 93.4 25.0 25.7
J48 0.739 95.5 90.8 31.9 49.9 NBTree 0.724 93.7 92.3 24.1 28.4
BayesNet 0.645 95.6 67.4 14.5 64.1 REPTree 0.761 97.9 52.4 13.6 86.8
NaiveBayes 0.724 96.7 50.7 12.3 80.2 JRip 0.670 95.6 63.3 13.4 66.0
SVM-RBF 0.745 98.2 60.3 16.0 87.5 NNge 0.703 93.2 80.6 12.3 31.5
SVM-Poly 0.742 97.1 55.0 13.4 80.7 NeuralNet 0.652 94.4 65.6 12.1 55.0
Fraud Precision < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001
2 R7 / P7
3 R3 0.0249 3 R7Q3 - C7Q3 0.0503
4 R9 0.0234 4 R8Q3 - T8Q3 0.0499
5 R2 - T2
6 R8 - C8
7 R2-C2 0.0188 7 R1Q3-R1Q2 0.0482
8 R1/P1
9 R7 - T7
10 R8-P8
11 R7
12 R7/T7
13 R1-C1
14 R8 / T8
15 R8-T8 0.0147 15 R8Q4 / T8Q4 0.0427
SVM-Lin
LogitReg
J48
BayesNet
NaiveBayes
SVM-RBF 0.853 97.9 72.0 20.2 82.4 NNge 0.827 967 ŤŤ0 208 69.7
SVM-Poly 0.797 97.5 71.7 19.3 78.5 NeuralNet 0.823 97Ü 73Ü 206 78.2
SVM-Lin
LogitReg
J48
BayesNet
NaiveBayes 0.805 97.7 67.6 17.9 81.9 JRip 0.837 992 66Ü 19 J 93.9
SVM-RBF
SVM-Poly
SVM-Lin 0.904 98ÍÕ 8ÕÕ 261) 81.2 ADTree 0.896 99^2 76^0 25^0 92.4
LogitReg 0.875 97.8 84.9 30.9 78.2 RandForest 0.857 96.7 95.2 53.0 62.8
J48 0.839 97.2 77.6 22.3 74.3 NBTree 0.887 98.8 77.0 25.0 89.0
BayesNet 0.893 98.1 77.9 24.3 82.2 REPTree 0.888 98.6 75.3 23.5 88.0
NaiveBayes 0.851 97^4 906 3Õ8 71.6 JRip 0.870 98^8 802 273 88.8
SVM-RBF 0.894 98Ü 745 23^3 90.0 NNge 0.865 96.5 88.4 32.1 63.3
SVM-Poly 0.865 97.5 88.9 " 36.5 73.8 NeuralNet ~Õ865~ 98.7 " 80.3 27.8 88.0
Legit Fraud
Metrics Precision Recall Precision Recall
H2a: Combined versus Yearly < 0.001 < 0.001 < 0.001 < 0.001
H2b: Combined versus Quarterly < 0.001 0.001 <0.001 <0.001Ť
ŤOpposite to hypothesis
that particular instance's 14 classification scores. Since the Figure 4 shows the analysis results. The first column shows
classifiers assigned a "-1" to instances classified as fraudulent |*|; the absolute value of the aggregated score x for an
and "1" to instances considered legitimate, the aggregated instance. Columns two, three, five, and six show legitimate/
scores varied from -14 to 14 (in intervals of 2). Hence, a score fraud precision and recall percentages attained on instances
of 12 meant that 13 classifiers considered the instance with that particular score. Columns four, seven, and eight
legitimate and 1 deemed it fraudulent. The absolute valuedepict
|jt| the percentage of the legit, fraud, and total instances in
the test bed covered by that score, respectively. The chart at
of an aggregated score x represents the degree of agreement
the bottom left shows plots of the precision rates (i.e.,
between stack classifiers. For each |jc|, class-level precision
columns 2 and 5) and recall rates (columns 3 and 6) for
and recall measures were computed as follows: All instances
instances with that score. The chart at the bottom right shows
with positive aggregated scores (i.e., x > 0) were considered
legitimate (i.e., scores of 2, 4, 6, 8, 10, 12, and 14) while
cumulative precision and recall rates if using that score as a
those with negative scores (i.e., x < 0) were considered threshold. The results can be interpreted as follows: 32.7
fraudulent. These score-based predictions were compared percent of all instances in the test set had a score of -14 or 14.
These instances accounted for 33.4 percent of all legitimate
against the actual class labels to generate a confusion matrix
test instances and 24 percent of all fraud test instances. Of
for each |x| . From each confusion matrix, legit/fraud precision
these instances, 96.5 percent of the legitimate instances were
and recall were computed, resulting in performance measures
correctly classified (i.e., x = 14) while the remaining 3.5 per-
for every level of classifier agreement (i.e., minimal agree-
cent were misclassified as fraudulent (i.e., x = -14).
ment of |jc| = 2 to maximal agreement |x| = 14). It is important
to note that the 5.8 percent of instances in the test bed that had
an aggregated score of 0 were excluded from the analysis The results reveal that these aggregated scores provide a nice
mechanism
(i.e., test instances where the classifiers were evenly split for assessing the confidence level of a particular
between predictions of legitimate and fraudulent). classification. When all 14 combined stacks agreed (i.e., the
absolute value of the sum of their classification scores was various stages of the supply chain (e.g., manufacturing,
14), the legit and fraud recall rates were 96.5 and 90.8 per- wholesale, retail), with fraud recall of between 94 and 100
cent, respectively. Moreover, this accounted for 32 percent ofpercent on such firms. The yearly context and baseline classi-
the test instances. Looking at the chart on the left, as ex- fiers also performed best on these sectors, although with
pected, lower scores generally resulted in diminished recall lower detection rates. The enhanced performance on manu-
rates (one exception was fraud recall, which increased when facturing firms is consistent with previous work that has also
using a score of 12). Based on the chart on the right, the attained good fraud detection rates on such data ( Kirkos et al.
performance degradation was quite gradual for thresholds 2007; Spathis 2002). The combined stacks had relatively
greater than 4. For example, using a threshold of 6 or better lower fraud recall rates on firms in the information and
resulted in legitimate and fraud recall rates of over 90 percent finance/insurance sectors (73 percent and 77 percent),
on instances encompassing 79.2 percent of the test bed. although still higher than the yearly context (approximately
Using the aggregated scores as a confidence-level measure 60 percent) and baseline classifiers (less than 50 percent).
can be useful for prioritizing regulator or investor resources. Analysis of the test bed revealed that fraud firms in these two
Moreover, the stack scores can also be exploited in a semi- sectors were 25 percent more prevalent in the test set, as
supervised learning manner to further improve performance, compared to the training data. Since fraud is often linked to
as discussed in the next subsection. financial distress (Chen and Du 2009; Zhao et al. 2009), the
increased number of fraud firms from the information sector
We also analyzed the combined stack classifiers' fraud detec- appearing in the year 2000 onward could be related to the dot-
tion rates for different sectors by categorizing the 409 fraud com bubble burst. Similarly, fraud in the financial sector
firm-years in the test set based on their top-level NAICS continues to grow and play a pivotal role in the economy
classifications (also referred to as business sectors). While (Stempel 2009). Such temporal changes in fraud patterns
the 20 top-level classifications (e.g., manufacturing, whole- attributable to overall economic and industry-level conditions
sale, retail, construction, mining, etc.) are more general than further underscore the importance of adaptive learning
the 1,017 bottom-level industry classifications in the NAICS approaches (discussed in the next subsection).
hierarchy, which were used to build the industry-level models,
aggregating results at the top-level provided interesting in-
sights. Figure 5 shows the results for all top-level NAICS H3: Yearly and Quarterly Stack Classifiers
Versus Context-Based Classifiers
codes with at least 10 fraud firm-years in the test set. The
table on the left lists the results for the combined stack classi-
fiers, while the chart on the right shows the combined stacks' We conducted paired t-tests to compare the performance of
results relative to the yearly context and baseline classifiers. the yearly stacks against the yearly context-based classifiers
The combined stacks performed best on merchandise firms at (H3a) and the quarterly stacks against the quarterly context-
Legit Fraud
Metrics Precision Recall Precision Recall
H3a: Yearly Stack-Individual < 0.001 < 0.001 < 0.001 0.01 1
H3b: Quarterly Stack-Individual < 0.001 0.482 0.008 < 0.001
NaiveBayes
SVM-RBF 0.906 99Ü) 86/7 36Ü 900 NNge 0.884 97^2 94~6 5 2A 68.7
SVM-Poly 0.895 97Í ÍãÕ Š70 T5A NeuralNet 0.876 9&9 84Ü 33^1 88.8
stack for each year of the test bed (2000-2008). ASL had cularly for fraud recall. Moreover, ASL outperformed the
higher legit/fraud recall rates across years. Margins seemed dynamic stack in legitimate recall for all 7 years and on fraud
to improve in later years. The improved performance on both recall for 6 of 7 years, while dynamic ASL dominated the
legitimate and fraud firms suggests that adaptive learning is dynamic stack with respect to legitimate/fraud recall.
important not only to better detect fraud firms, but also in Appendix F provides additional analysis which shows that
order to react to changes in non-fraud firms. The biggest ASL outperformed dynamic stacks for any window length
improvements in fraud recall were in the information and between 1 and 5 years. The results suggest that ASL is able
financial and insurance sectors (over 5 percent), the two to effectively leverage existing knowledge, including knowl-
sectors where the combined stacks underperformed. Given edge gained during the detection process, toward enhanced
the adversarial nature of fraud detection (Virdhagriswaran and subsequent detection of firms.
Dakin 2006), one may expect fraud detection rates for static
models to deteriorate over time, as evidenced by the To illustrate this point, we analyzed fraud firm-years correctly
decreasing performance for the combined stacks from 2004 detected by ASL that were not identified by the combined
onward. In contrast, ASL' s fraud performance holds steady stacks (we refer to these instances as y firm-years). Sensi-
across these years. tivity analysis was performed to see how fraud firms
previously added to the classification models (x firm-years)
However, this analysis assumes that no new training data is subsequently impacted ASL' s prediction scores for each of
made available during the test time period 2000-2008 (i.e., these y firm-years. This was done by failing to add each x
only data through 1999 was used for training). In order to firm-year to the model, one at a time, in order to assess their
evaluate the impact of having additional training data on the individual impact on the ASL prediction scores for the >> firm-
performance of ASL and the combined stacks, we used an years. One example generated as a result of the analysis is
expanding window approach. For a given test year a , all depicted in Figure 8. The four gray nodes represent^ firm-
instances up to and including year a -2 were used for training years: fraud instances correctly identified by ASL (that were
(e.g., test year 2002 was evaluated using training data through misclassified by the combined stacks). White nodes represent
2000). A two-year window was used since prior research has x firm-years: fraud instances from the test set that were added
noted that the median time needed to detect financial fraud is to the classification models by ASL. Node labels indicate
26.5 months (Beneish 1999b). The dotted lines in Figure 7 firms' names and top-level NAICS categories. The x-axis
show the results for these "dynamic" ASL and stack classi- displays the years associated with these firm-years (2000-
fiers. Based on the results, as expected, the dynamic stacks 2005). A directed link from a white node to a gray one can be
outperformed the static combined stacks. The gains in legiti- interpreted as that x firm-year influencing the prediction score
mate and fraud recall were most pronounced for 2005 onward. of the y firm-year. The nature and extent of the influence is
Dynamic ASL also improved performance over ASL, parti- indicated by the number along the link. For example, a "2"
means that the absence of that particular* firm-year from the studies using public data all had fraud detection rates of less
classification model worsened the >> firm-year's ASL score by than 70 percent. The details regarding the three comparison
two. For example, the addition of MCSi in 2002 improved approaches are as follows.
the prediction score for Natural Health in 2005 by one.
Kirkos et al. used a set of 10 ratios/measures in combination
Figure 8 provides several important insights regarding the with three classifiers: ID3 decision tree, neural network, and
adaptive learning component of the MetaFraud framework. Bayesian network. We replicated their approach by including
It reveals that new fraud firm-year detections were not neces- the same 10 ratios/measures: debt to equity, sales to total
sarily attained simply by adding one or two additional fraud assets, sales minus gross margin, earnings before income tax,
cases. Rather, they were sometimes the result of a complex working capital, Altman's Z score, total debt to total assets,
series of modifications to the training models. In some cases, net profit to total assets, working capital to total assets, and
these correct predictions were the culmination of modifica- gross profit to total assets. We also tuned the three classi-
tions spanning several years of data. For example, the detec- fication algorithms as they did in their study (e.g., the number
tion of Interpublic Group and Natural Health in 2005 of hidden layers and nodes on the neural network).
leveraged 6-10 jc firm-years between 2000 and 2005. How-
ever, firms from the same year also played an important role, Gaganis used 7 financial ratios in combination with 10 classi-
as evidenced by the positive impact Diebold and Dana fication algorithms. We replicated his approach by including
Holding had on Natural Health in 2005. Interestingly, busi- the same seven ratios: receivables to sales, current assets to
ness sector affiliations also seemed to play a role in the adap- current liabilities, current assets to total assets, cash to total
tive learning process: several of the links in the figure are assets, profit before tax to total assets, inventories to total
between firms in the same sector. For example, three of the assets, and annual change in sales. We ran many of his classi-
firms that influenced Natural Health were also from the fication methods that had performed well, including neural
wholesale trade sector, while one or two of the firms thatnetwork, linear/polynomial/RBF SVMs, logistic regression,
impacted Charter Communications and Interpublic Group k-Nearest neighbor, and different types of discriminant
were from the information and professional services sectors,analysis. The parameters of all techniques were tuned as was
respectively. It is also important to note that not all x firm-done in the original study. We included the results for the
years' additions to the models had a positive impact. Forthree methods with the best performance: linear S VM, Neural
example, both Veritas Solutions and Cornerstone (from 2000)Net, and Logit.
worsened Bennett Environmental' s prediction score by one.
Thus, Figure 8 sheds light on how ASL was able to improveCecchini et al. used an initial set of 40 variables. After pre-
the detection of fraudulent firms. It is important to note thatprocessing, the remaining 23 variables were used as input in
in our analysis, consistent with prior work, we represent each their financial kernel. Following their guidelines, we began
fraud firm based on the year in which the fraud occurredwith the same 40 variables and removed 19 after prepro-
(Cecchini et al. 201 1). An interesting future direction wouldcessing, resulting in 21 input variables for the financial
be to also consider when the fraud was actually discovered,kernel.
and to use this information to retrospectively identify pre-
viously undetected frauds committed in earlier years. All comparison techniques were run using the same training
and testing data used in our prior experiments. ROC curves
were generated, and AUC for the ROC curves was computed
Evaluating MetaFraud in Comparison with (as with the previous experiments). For MetaFraud, we
Existing Fraud Detection Methods generated a final prediction for each test case by aggregating
the predictions of the 14 ASL classifiers to derive a single
We evaluated MetaFraud in comparison with three priorstack score for each instance (as described earlier). We used
the results of the ASL classifiers since these results are based
approaches that attained state-of-the-art results: Kirkos et al.
(2007), Gaganis (2009), and Cecchini et al. (2010). Each ofon the amalgamation of all four phases of MetaFraud and
these three approaches was run on our test bed, in comparison therefore signify the final output of the meta-learning frame-
with the overall results from the proposed meta-learning work.
framework. Kirkos et al. and Gaganis each attained good
results on Greek firms, with overall accuracies ranging fromTable 16 and Figure 9 show the experiment results. Table 16
73 to 90 percent and 76 to 87 percent, respectively. Cecchinidepicts the results for MetaFraud (MF) and the comparison
et al. attained excellent results on U.S. firms, with fraud methods using both the regulator (1:10) and investor (1:20)
detection rates as high as 80 percent. In comparison, priorcost settings. In addition to legitimate and fraud recall, we
Setting AUC Prec. Ree. Prec. Ree. Cost Prec. Ree. Prec. Ree. Cost
MetaFraud (MF) "o.931 98.3 90.2 41.8 81.5 ~ $100.5 " 98.4 ~
Cecchini 0.818 97.8 79.8 25.4 79.5 $149.2 98.0 74.7 21 .9 82.2 $620.5
various cost settings. With respect to the comparison Cecchini did garner slightly better fraud recall rates (relative
methods, the financial kernel outperformed Kirkos et al. and to MF), the legitimate recall values were 4 to 5 percent lower
Gaganis in terms of overall AUC and legit/fraud recall for for both cost settings. Consequently, MF had a better finan-
both cost settings, as evidenced by Table 16. While the best cial impact than MF-Cecchini. Overall, the results demon-
results for the financial kernel were somewhat lower than strate the efficacy of MetaFraud as a viable mechanism for
those attained in the Cecchini et al. study, the overall ac- detecting financial fraud.
curacy results for the Kirkos et al. and Gaganis methods were
significantly lower than those reported in their studies (i.e.,
1 5-20 percent). As previously alluded to, this is likely due to H5: MetaFraud Versus Comparison Methods
the application of these approaches on an entirely different set
of data: U.S. firms from various sectors as opposed to Greek We conducted paired t-tests to compare the performance of
manufacturing firms (Barth et al. 2008). MetaFraud against the seven comparison settings shown in
Table 16. We compared cost settings of 5 through 50 in
In order to further assess the effectiveness of the MetaFraud increments of 1 (n = 46). MetaFraud significantly outper-
framework, we ran MetaFraud using the ratios/measures formed the comparison methods on precision and recall (all p-
utilized by Kirkos et al., Gaganis, and Cecchini et al. We values < 0.001). We also ran t-tests to compare MF-Kirkos,
called these MF-Kirkos, MF-Gaganis, and MF-Cecchini. For MF-Gaganis, and MF-Cecchini against their respective
all three feature sets, we derived the industry and organiza- settings from Table 1 6. Once again, using MetaFraud signifi-
tional context measures from the quarterly and annual state- cantly improved performance, with all p-values less than
ments. For instance, the 7 Gaganis ratios were used to gener- 0.001 . The t-test results support H5 and suggest that the pro-
ate 49 annual and 196 quarterly attributes (see Tables 3 and posed meta-learning framework can enhance fraud detection
4 for details). Similarly, the 2 1 Cecchini et al. measures were performance over the results achieved by existing methods.
used to develop 147 annual and 588 quarterly features. We
then ran the combined stack and ASL modules and computed
a single performance score for all 991 cost settings (i.e., 1:1 Evaluating MetaFraud in Comparison with
to 1:100), as done in the previous experiment. Table 17 and Existing Semi-Supervised Learning Methods
Figure 10 show the experiment results.
In order to demonstrate the effectiveness of the procedural
Based on the table, the results for MF-Cecchini, MF-Gaganis, bias improvement mechanisms incorporated by MetaFraud
and MF-Kirkos were considerably better than the best com- over existing ensemble-based semi-supervised learning
parison results reported in Table 16 for both cost settings. methods, we compared MetaFraud against Tri-Training (Zhou
Figure 10 shows the ROC curves for MF-Cecchini, MF- and Li 2005). Tri-Training has outperformed existing semi-
Kirkos, and MF-Gaganis and the comparison techniques. MF supervised learning methods on several test beds, across
is also included to allow easier comparisons with the results various application domains. It uses an ensemble of three
from Table 1 6 and Figure 9. Using MetaFraud improved per- classifiers. In each iteration, the predictions of all test
formance for all three feature sets over the best techniques instances where two classifiers j and k agree are added to the
adopted in prior studies. MF-Cecchini outperformed MF- third classifier's training set (i.e., classifier i) with the pre-
Kirkos and MF-Gaganis. The lower performance of MF- dicted class labels, provided that the estimated error rates for
Kirkos and MF-Gaganis relative to MF-Cecchini was attribu- instances where j and k agree has improved since the previous
table to the fact that the ratios of these methods were less iteration. We ran Tri-Training on the base ratios as well as
effective on nonmanufacturing firms. Interestingly, the MF- those proposed by the three comparison studies. In order to
Cecchini ROC curve was very similar to the one generated isolate the impact of MetaFraud' s procedural bias improve-
using MetaFraud with the 1 2 ratios (i.e., MF). This is because ment methods, we ran Tri-Training using the context-based
the measures employed by Cecchini et al. (2010) include features for all four sets of ratios (as done with MetaFraud).
many of these baseline ratios. Their financial kernel impli- For Tri-Training, we evaluated various combinations of
citly developed 8 of the 12 ratios (see "Financial Ratios" for classifiers and found that the best results were generally
details): asset turnover (R2), depreciation index (R5), inven- attained when using Bayes Net, Logit, and J48 in conjunction.
tory growth (R7), leverage (R8), operating performance For these three classifiers, we then ran all 991 cost settings as
margin (R9), receivables growth (RIO), sales growth (RI 1), done in the H5 experiments. Consistent with the settings used
and SGE expense (R12). Moreover, they also used variations by MetaFraud's ASL algorithm, Tri-Training was run on test
of asset quality index (Rl) and gross margin index (R6). On instances in chronological order (i.e., all year 2000 instances
the flip side, while the larger input feature space for MF- were evaluated before moving on to the 2001 data).
Setting AUC Prec. Ree. Prec. Ree. Cost Prec. Ree. Prec. Ree. Cost
MF -Cecchini ~0922 98.2 86.0 33.6~ 81.7 ~ $116.7 ~983~ 84.8 32.1 ~83.4 $485.7
MF - Kirkos
MF - Gaganis
Table 18 and Figure 11 show the experiment results, in- mation, whether at the yearly or quarterly level, can improve
cluding overall AUC, legit/fraud recall, and error cost per performance over just using annual-statement-based ratios
firm-year for both cost settings. Figure 1 1 shows the ROC devoid of context information. In experiment 2, combining
curves for MF-Cecchini, MF-Kirkos, and MF-Gaganis in yearly and quarterly information yielded the best results a
comparison with the three Tri-Training classifiers. Based on they provide complementary information (H2). Experiment
the results, it is evident that MetaFraud outperformed Tri- 2 also supported the notion that the ability of stack classifiers
Training, both in terms of best performance results (see the to exploit disparate underlying classifiers enables them to im-
top of Table 16 and Table 17, versus Table 18) and across prove classification performance (H3). Experiment 3 re
cost settings. While some settings of Tri-Training improved vealed that the proposed adaptive semi-supervised learning
performance over the original methods run with those fea- algorithm further improved performance by leveraging th
tures, the performance gains were small in comparison to the underlying classifications with the highest confidence level
improvements garnered by MF-Cecchini, MF-Gaganis, MF- (H4). Experiments 4 and 5 showed that, collectively, th
Kirkos, and MF.
proposed meta-learning framework was able to outperform
comparison state-of-the-art methods (H5 and H6).
Our
The results from experiment 1 (and HI) demonstrated thatresearch contribution is th
framework
incorporating industry-level and organizational context infor- for financial fraud
and development of financial fraud detection systems that Association of Certified Fraud Examiners. 2010. "2010 Global
integrate predictive and analytical business intelligence tech- Fraud Study: Report to the Nations on Occupational Fraud and
nologies, thereby allowing analysts to draw their own Abuse," Association of Certified Fraud Examiners, Austin, TX.
conclusions (Bay et al. 2006; Coderre 1999; Virdhagriswaran Balean, M. F., Blum, A., and Yang, K. 2005. "Co-Training and
Expansion: Towards Bridging Theory and Practice," in Ad-
and Dakin 2006). By combining a rich feature set with a
vances in Neural Information Processing Systems 1 7, L. K. Saul,
robust classification mechanism that incorporates adaptive
Y. Weiss, and L. Bottou (eds.), Cambridge, MA: MIT Press, pp.
learning, meta-learning-based systems could provide fraud 89-96.
risk ratings, real-time alerts, analysis and visualization of
Bankruptcydata.com. "20 Largest Public Domain Company Bank-
fraud patterns (using various financial ratios in the feature rutcy Filings 1980-Present (http://www.bankruptcydata.com/
sets), and trend analyses of fraud detection patterns over time Research/Largest Overall All-Time; accessed July 8, 2010).
(utilizing the adaptive learning component). As business Barth, M., Landsman, W., and Lang, M. 2008. "International
intelligence technologies continue to become more pervasive Accounting Standards and Accounting Quality," Journal of
(Watson and Wixom 2007), such systems could represent a Accounting Research (46:3), pp. 467-498.
giant leap forward, allowing fraud detection tools to perform Bay, S., Kumaraswamy, K., Anderle, M. G., Kumar, R., and Steier,
at their best, when they are combined with human expertise. D. M. 2006. "Large Scale Detection of Irregularities in Ac-
counting Data," in Proceedings of the 6th IEEE International Cohen, W. W. 1995. "Fast Effective Rule Induction," in Pro-
Conference on Data Mining , Hong Kong, December 18-22, ceedings of the 12th International Conference on Machine
pp. 75-86. Learning , Tahoe City, CA, July 9-12, pp. 1 15-123.
Bayes, T. 1958. "Studies in the History of Probability and Cox, R. A. K., and Weirich, T. R. 2002. "The Stock Market
Statistics: XI. Thomas Bayes' Essay Towards Solving a Problem Reaction to Fraudulent Financial Reporting," Managerial
in the Doctrine of Chances," Biometrika (45), pp. 293-295. Auditing Journal (17:7), pp. 374-382.
Beasley, M. S., Carcello, C. V., Hermanson, D. R., and Lapides, Dechow, P., Ge, W., Larson, C., and Sloan, R. 201 1. "Predicting
P. D. 2000. "Fraudulent Financial Reporting: Consideration of Material Accounting Misstatements," Contemporary Accounting
Industry Traits and Corporate Governance Mechanisms," Research (28:1), pp. 1-16.
Accounting Horizons (14:4), pp. 441-454. Deloitte. 2010. "Deloitte Poll: Majority Expect More Financial
Beneish, M. D. 1999a. "The Detection of Earnings Manipulation," Statement Fraud Uncovered in 2010 2011," April 27
Financial Analysts Journal (55:5), pp. 24-36. (http://www.deloitte.com/view/en_US/us/Services/Financial-
Beneish, M. D. 1999b. "Incentives and Penalties Related to Advisory-Services/7ba0852e4de38210VgnVCM200000bb42f0
Earnings Overstatements that Violate GAAP," The Accounting 0aRCRD.htm; accessed July 8, 2010).
Review (74:4), pp. 425-457. Dikmen, B., and Kûçûkkocaoglu, G. 2010. "The Detection of
Bolton, R. J., and Hand, D. J. 2002. "Statistical Fraud Detection: Earnings Manipulation: The Three-Phase Cutting Plane Algo-
A Review," Statistical Science , (17:3), pp. 235-255. rithm Using Mathematical Programming," Journal of Fore-
Brachman, R. J., Khabaza, T., Kloesgen, W., Piatetsky-Shapiro, G., casting (29:5), pp. 442-466.
Dull, R., and Tegarden,D. 2004. "Using Control Charts to Monitor
and Simoudis, E. 1996. "Mining Business Databases," Commu-
Financial Reporting of Public Companies," International Journal
nications of the ACM (39: 11), pp. 42-48.
Brázdil, P., Giraud-Carrier, C., Soares, C., and Vilalta, R. 2008. of Accounting Information Systems (5:2), pp. 109-127.
Dybowski, R., Laskey, K. B., Myers, J. W., and Parsons, S. 2003.
Metalearning : Applications to Data Mining , Berlin: Springer-
"Introduction to the Special Issue on the Fusion of Domain
Verlag.
Knowledge with Data for Decision Support "Journal of Machine
Breiman, L. 2001. "Random Forests," Machine Learning (45:1),
Learning Research (4), pp. 293-294.
pp. 5-32.
Dzeroski, S., and Zenko, B. 2004. "Is Combining Classifiers with
Carson, T. 2003. "Self-interest and Business Ethics: Some Lessons
Stacking Better than Selecting the Best One?" Machine Learning
of the Recent Corporate Scandals," Journal of Business Ethics
(54:3), pp. 255-273.
(43:4), pp. 389-394.
Fanning, K. M., and Cogger, K. O. 1998. "Neural Network Detec-
Cecchini, M., Aytug, H., Koehler, G., and Pathak, P. 2010.
tion of Management Fraud Using Published Financial Data,"
"Detecting Management Fraud in Public Companies,"
International Journal of Intelligent Systems in Accounting and
Management Science (56:7), pp. 1 146-1 160.
Finance Management (7), pp. 21-41.
Chai, W., Hoogs, B., and Verschueren, B. 2006. "Fuzzy Ranking
Fawcett, T., and Provost, F. 1997. "Adaptive Fraud Detection,"
of Financial Statements for Fraud Detection," in Proceedings of
Data Mining and Knowledge Discovery (1), pp. 291-3 16.
the IEEE International Conference on Fuzzy Systems ,
Freund, Y., and Mason, L. 1999. "The Alternating Decision Tree
Vancouver, BC, July 16-21, pp. 152-158.
Learning Algorithm," in Proceedings of the 16th International
Chan, P. K., Fan, W., Prodromidis, A. L., and Stolfo, S. J. 1999.
Conference on Machine Learning , Bled, Slovenia, June 27-30,
"Distributed Data Mining in Credit Card Fraud Detection," IEEE
pp. 124-133.
Intelligent Systems (14:6), pp. 67-7 4.
Gaganis, C. 2009. "Classification Techniques for the Identification
Chan, P., and Stolfo, S. 1993. "Toward Parallel and Distributed
of Falsified Financial Statements: A Comparative Analysis,"
Learning by Meta-Learning," in Proceedings of the Knowledge International Journal of Intelligent Systems in Accounting and
Discovery in Databases Workshop , pp. 227-240. Finance Management (16), pp. 207-229.
Chapelle, O., Schölkopf, B., and Zien, A. 2006. Semi-Supervised Giraud-Carrier, C., Vilalta, R., and Brázdil, P. 2004. "Introduction
Learning , Cambridge, MA: MIT Press. to the Special Issue on Meta-Learning," Machine Learning (54),
Charles, S. L., Glover, S. M., and Sharp, N. Y. 2010. "The Asso- pp. 187-193.
ciation Between Financial Reporting Risk and Audit Fees Before Green, B. P., and Calderon, T. G. 1995. "Analytical Procedures
and After the Historic Events Surrounding SOX," Auditing: A and the Auditor's Capacity to Detect Management Fraud,"
Journal of Practice and Theory (29:1), pp. 15-39. Accounting Enquiries: A Research Journal (5:2), pp. 1-48.
Chen, W., and Du, Y. 2009. "Using Neural Networks and Data Green, B. P., and Choi, J. H. 1997. "Assessing the Risk of
Mining Techniques for the Financial Distress Prediction Model," Management Fraud through Neural Network Technology,"
Expert Systems with Applications (36), pp. 4075-4086. Auditing (16:1), pp. 14-28.
Chung, W., Chen, H., and Nunamaker Jr., J. F. 2005. "A Visual Hansen, J. V., and Nelson, R. D. 2002. "Data Mining of Time
Framework for Knowledge Discovery on the Web: An Empirical Series Using Stacked Generalizes," Neurocomputing (43), pp.
Study of Business Intelligence Exploration," Journal of Manage- 173-184.
ment Information Systems (21:4), pp. 57-84. Hevner, A. R., March, S. T., Park, J., and Ram, S. 2004. "Design
Coderre, D. 1999. "Computer- Assisted Techniques for Fraud Science in Information Systems Research," MIS Quarterly
Detection," The CPA Journal (69:8), pp. 57-59. (28:1), pp. 75-105.
Hu, M. Y., and Tsoukalas, C. 2003. "Explaining Consumer Choice Mitchell, T. M. 1997. Machine Learning, New York: McGraw-
through Neural Networks: The Stacked Generalization Ap- Hill.
proach," European Journal of Operational Research ( 1 46:3), pp. Persons, O. S. 1995. "Using Financial Statement Data to Identify
650-660. Factors Associated with Fraudulent Financial Reporting,"
Kaminski, K. A., Wetzel, T. S., and Guan, L. 2004. "Can Financial Journal of Applied Business Research (11 :3), pp. 38-46.
Ratios Detect Fraudulent Financial Reporting," ManagerialPhua, C., Alahakoon, D., and Lee, V. 2004. "Minority Report in
Auditing Journal (19:1), pp. 15-28. Fraud Detection: Classification of Skewed Data," ^ CM SIGKDD
Kinney, W. R., Jr. 1987. "Attention-Directing Analytical Review Explorations Newsletter (6:1), pp. 50-59.
Using Accounting Ratios: A Case Study," Auditing : A JournalPiramuthu, S., Ragavan, H., and Shaw, M. J. 1998. "Using Feature
of Practice and Theory , Spring, pp. 59-73. Construction to Improve the Performance of Neural Networks,"
Kirkos, E., Spathis, C., and Manolopoulos, Y. 2007. "Data MiningManagement Science (44:3), pp. 416-430.
Techniques for the Detection of Fraudulent Financial State-Quinlan, R. 1986. "Induction of Decision Trees," Machine
ments," Expert Systems with Applications (32), pp. 995-1003. Learning ( 1 : 1 ), pp. 8 1 - 1 06.
Kohavi, R. 1996. "Scaling Up the Accuracy of Naïve Bayes Rendell, L., Seshu, R., and Tcheng, D. 1987. "Layered Concept-
Classifiers: A Decision Tree Hybrid," in Proceedings of the 2nd Learning and Dynamically- Variable Bias Management," in Pro-
International Conference on Knowledge Discovery and Data ceedings of the 10th International Joint Conference on Artificial
Mining , Portland, OR, August 2-4, pp. 202-207. Intelligence , San Francisco: Morgan Kaufmann, pp. 308-314.
Kuhn, J. R., Jr., and Sutton, S. G. 2006. "Learning from Shmueli, G., Patel, N., and Bruce, P. 2010. Data Mining for
WorldCom: Implications for Fraud Detection and Continuous Business Intelligence (2nd ed.), Hoboken, NJ: Wiley & Sons.
Assurance," Journal of Emerging Technologies in AccountingSigletos, G., Paliouras, G., Spyropoulos, C. D., and Hatzopoulos, M.
2005 . "Combining Information Extraction Systems Using Voting
(3), pp. 61-80.
Kuncheva, L. I., and Whitaker, C. J. 2003. "Measures of Diversity
and Stacked Generalization," Journal of Machine Learning
Research (6), pp. 1751-1782.
in Classifier Ensembles and Their Relationship with the
Spathis, C. T. 2002. "Detecting False Financial Statements Using
Ensemble Accuracy," Machine Learning (51:2), pp. 181-207.
Published Data: Some Evidence from Greece," Managerial
Langley, P., Zytkow, J. M., Simon, H. A., and Bradshaw, G. L.
Auditing Journal (17:14), pp. 179-191.
1986. "The Search for Regularity: Four Aspects of Scientific
Spathis, C. T., Doumpos, M., and Zopounidis, C. 2002. "Detecting
Discovery," in Machine Learning: An Artificial Intelligence
Falsified Financial Statements: A Comparative Study Using
Approach , Vol. II, S. R. Michalski, G. J. Carbonell, and M. T.
Multicriteria Analysis and Multivariate Statistical Techniques,"
Mitchell (eds.), San Francisco: Morgan Kaufman, pp. 425-470.
The European Accounting Review (11 :3), pp. 509-535.
Lin, J. W., Hwang, M. I., and Becker, J. D. 2003. "A Fuzzy Neural
Stempel, J. 2009. "Fraud Seen Growing Faster in Financial Sec-
Network for Assessing the Risk of Fraudulent Financial
tor," Reuters , October 19 (http://www.reuters.com/article/2009/
Reporting," Managerial Auditing Journal (18:8), pp. 657-665.
1 0/1 9/businesspro-us-fraud-study-idUSTRE59I5592009 1019).
Lynam, T. R., and Cormack, G. V. 2006. "On-line Spam Filtering
Storey, V., Burton- Jones, A., Sugumaran, V., and Purao, S. 2008.
Fusion," in Proceedings of the 29th Annual International ACM
"CONQUER: A Methodology for Context-Aware Query Pro-
SIGIR Conference on Research and Development in Information
cessing on the World Wide Web," Information Systems Research
Retrieval , Seattle, WA, August 6-11, pp. 123-130.
(19:1), pp. 3-25.
Maletta, M., and Wright, A. 1996. "Audit Evidence Planning: An
Summers, S. L., and Sweeney, J. T. 1998. "Fraudulently Misstated
Examination of Industry Error Characteristics," Auditing : A
Financial Statements and Insider Trading: An Empirical
Journal of Practice and Theory (15), pp. 71-86.
Analysis," The Accounting Review (73:1), pp. 131-146.
March, S. T., and Smith, G. 1995. "Design and Natural Science
Tian, Y., Weiss, G. M., and Ma, Q. 2007. "A Semi-Supervised
Research on Information Technology," Decision Support Systems
Approach for Web Spam Detection Using Combinatorial Feature-
(15:4), pp. 251-266. Fusion," in Proceedings of the ECML Graph Labeling Work-
Markus, M. L., Majchrzak, A., and Gasser, L. 2002. "A Design shops ' Web Spam Challenge , Warsaw, Poland, September 17-21,
Theory for Systems That Support Emergent Knowledge Pro- pp. 16-23.
cesses," MIS Quarterly (26:3), pp. 179-212. Ting, K. M., and Witten, I. H. 1997. "Stacked Generalization:
Martin, B. 1995. "Instance-Based Learning: Nearest Neighbor When Does It Work?" in Proceedings of the 15th Joint Interna-
with Generalization," unpublished Master's Thesis, University of tional Conference on Artificial Intelligence , San Francisco:
Waikato, Computer Science Department, Hamilton, New Morgan Kaufmann., pp. 866-871.
Zealand.
Tsoumakas, G., Angelis, L., and Vlahavas, I. 2005. "Selective
Matheus, C. J., and Rendell, L. A. 1989. "Constructive Induction Fusion of Heterogeneous Classifiers," Intelligent Data Analysis
on Decision Trees," in Proceedings of the 1 1th International Joint (9), pp. 511-525.
Conference on Artificial Intelligence , San Mateo, CA: Morgan Vapnik, V. 1999. The Nature of Statistical Learning Theory ,
Kaufman, pp. 645-650. Berlin: Springer- Verlag.
Michalewicz, Z., Schmidt, M., Michalewicz, M., and Chiriac, C.
Vercellis, C. 2009. Business Intelligence: Data Mining and Opti-
2007. Adaptive Business Intelligence , New York: Springer. mization for Decision Making , Hoboken, NJ: Wiley.
Vilalta, R., and Drissi, Y. 2002. "A Perspective View and Survey ACM Transactions on Information Systems. He is a member of the
of Meta-Learning, "Artificial Intelligence Review ( 1 8), pp. 77-95. AIS and IEEE.
(36), pp. 2633-2644. security consultant and fraud analyst for Deloitte. His work is
Zhou, Y., and Goldman, S. 2004. "Democratic Co-learning," in published in MIS Quarterly , Journal of Management Information
Proceedings of the 16th IEEE International Conference on Tools Systems, European Journal of Information Systems , Journal of the
with Artificial Intelligence , pp. 594-202. American Society for Information Science and Technology , Infor-
Zhou, Z. and Li, M. 2005. "Tri-Training: Exploiting Unlabeled mation & Management , Journal of Organizational and End User
Data Using Three Classifiers," /2s£is Transactions on Knowledge Computing , and Communications of the AIS. His research interests
and Data Engineering ( 1 7: 1 1 ), pp. 1 529- 1 54 1 . are information security and trust in information systems.