0% found this document useful (0 votes)
27 views13 pages

Paper 9

Uploaded by

nav27543
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views13 pages

Paper 9

Uploaded by

nav27543
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Received 25 March 2024, accepted 2 April 2024, date of publication 11 April 2024, date of current version 9 May 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3387841

A Distributed Knowledge Distillation Framework


for Financial Fraud Detection Based on
Transformer
YUXUAN TANG 1 AND ZHANJUN LIU2
1 School of Accounting, Southwestern University of Finance and Economics, Chengdu, Sichuan 611130, China
2 School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

Corresponding author: Yuxuan Tang ([email protected])

ABSTRACT Financial fraud cases causing serious damage to the interests of investors are not uncommon.
As a result, a wide range of intelligent detection techniques are put forth to support financial institutions’
decision-making. Currently, existing methods have problems such as poor detection accuracy, slow
inference speed, and weak generalization ability. Therefore, we suggest a distributed knowledge distillation
architecture for financial fraud detection based on Transformer. Firstly, the multi-attention mechanism is used
to give weights to the features, followed by feed-forward neural networks to extract high-level features that
include relevant information, and finally neural networks are used to categorize financial fraud. Secondly, for
the problem of inconsistent financial data indicators and unbalanced data distribution focused on different
industries, a distributed knowledge distillation algorithm is proposed. This algorithm combines the detection
knowledge of the multi-teacher network and migrates the knowledge to the student network, which detects
the financial data of different industries. The final experimental results show that the proposed method
outperforms other methods in terms of F1 score (92.87%), accuracy (98.98%), precision (81.48%), recall
(95.45%), and AUC score (96.73%) when compared to the traditional detection methods.

INDEX TERMS Transformer, knowledge distillation, financial fraud detection.

I. INTRODUCTION and the other is to detect whether there is any suspicion


The number of listed firms is increasing quickly due to of counterfeiting through big data-driven machine learning
the ongoing social economy development, and their place algorithms [4]. Manual audits and reviews of publicly traded
in the global economy is vital. However, cases of financial corporations’ financial statements are examples of traditional
fraud are frequent and prohibited, causing great losses to financial analysis techniques, however they are expensive,
the majority of investors and arousing discussions in all time-consuming, and prone to error [4]. These methods are
sectors of society. In China, the number of criminals involved not absolute, and as the methods of financial fraud continue
in financial counterfeiting activities in 2019 exceeded 961, to evolve, it is difficult for practitioners to detect new
with a total value of more than US 8 billion [1]. Numerous patterns of fraud. Also, certain anomalies may be legitimate
investors’ faith has been damaged by these instances, which business practices, rendering such methods less feasible.
has had a detrimental impact on the capital markets and Then, big data-driven machine learning algorithms were used
increased financial market volatility [2], [3]. In order to to detect financial fraud, an area where computers were
address these counterfeiting issues, the development of new more adept at data analysis than people when dealing with
detection methods is imperative. Currently, there are two large amounts of data, particularly when it came to high-
main means of detecting counterfeiting by listed companies: dimensional features. The effectiveness of machine learning
one is to audit and analyze the company’s financial data, models in financial forgery detection has been demonstrated
in the literature [5]. But forgers continue to innovate and
The associate editor coordinating the review of this manuscript and adopt new concealment methods, making it difficult for
approving it for publication was Jolanta Mizera-Pietraszko . traditional machine learning detection methods to detect
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 62899
Y. Tang, Z. Liu: Distributed Knowledge Distillation Framework for Financial Fraud Detection

and identify new forgery methods in a timely manner, and incorporate the relevant information of financial data, thus
correlations between data features are difficult to be learned improving the characterization of data relevance.
by the models. It is difficult for the model to extract the (2) To address the problem of inconsistent financial
more critical information for the task at hand from the data indicators and unbalanced data distribution focused on
complex and large data features, resulting in the performance different industries, and to reduce the complexity of the
of existing counterfeiting detection models being greatly financial fraud detection model and improve the accuracy
limited.More significantly, by identifying the relationships of the model, this paper proposes a distributed knowledge
between features, the attention model can uncover more distillation algorithm. The algorithm migrates the detection
concealed counterfeiting information and investigate more knowledge of the multi-teacher network to the student
counterfeiting patterns. For example, literature [6] proposes a network separately, and the student network detects the
two-level attention model that captures deep representations financial data of different industries.
of features from data sample level and feature level sets, (3) The proposed distributed network was evaluated on the
respectively. dataset of the 9th ‘‘TipDM Cup’’ listed company financial
Existing financial fraud detection methods are mostly analysis competition. Experimental results demonstrate that
based on machine learning and deep learning algorithms [4]. our proposed financial fraud detection method based on
These techniques pay less attention to the internal correla- Transformer with distributed knowledge distillation out-
tions within financial data and instead concentrate on mining performs traditional tree models and ensemble models in
the fundamental features of the data. Additionally, different key performance metrics on the dataset. This confirms the
industries may encounter varying challenges in financial data feasibility and effectiveness of our proposed method.
fraud, and the internal correlations of financial data features The rest of the paper is structured as follows, the second
differ across industries. Furthermore, with the continuous part is a review of related research, the third part introduces
growth in the scale of financial data, these models become our proposed model for financial fraud detection, the fourth
increasingly deep and complex, resulting in issues such as part describes the distributed knowledge distillation frame-
model bloat and slow inference speed. Therefore, how to work for detecting fraudulent data in different industries, and
effectively mine the internal correlation of financial data, the experimental results are discussed in the fifth part. Finally
compress the model size, and enhance the model’s ability Part VI summarizes the conclusions of this study.
to detect financial data falsification in different industries
is a new direction for researchers to explore. To address
the above problems, this research suggests a distributed II. BACKGROUND AND RELATED WORK
knowledge distillation architecture based on Transformer. A. TRADITIONAL FINANCIAL FRAUD DETECTION
The method uses a multi-attention mechanism to extract METHODS
the internal correlation of the data, and then the high- Financial fraud detection technology can lower investor
level features that contain the information related to the losses, preserve equity and justice in the trading market,
financial data are extracted through a forward neural network, and assist the China Securities Regulatory Commission
which is combined with the neural network to classify the (CSRC) in determining if listed businesses are suspected
financial data fraud. Secondly, to address the problem of of fraud. Traditional approaches for determining a listed
inconsistent financial data indicators and unbalanced data firm’s involvement in fraudulent operations rely on analyzing
distribution focused on different industries, and to reduce financial data, information from listed firms, and third-party
the complexity of the financial fraud detection model and evidence. With the continuous development of science and
improve the accuracy of the model, this paper proposes a technology, detection methods for fraud have also made sig-
distributed knowledge distillation algorithm. The algorithm nificant progress. Artificial intelligence technologies driven
migrates the detection knowledge of the multi-teacher by big data have been widely applied and have shown
network to the student network separately, and the student promising results in fraud detection. The core idea of artificial
network detects the financial data of different industries. The intelligence is to train a model with strong generalization
final experimental results show that the proposed method capabilities, supported by big data, enabling the model to
has better F1 score, accuracy, precision, recall, and AUC accurately detect the likelihood of listed companies engaging
score compared to the traditional detection methods, which in financial data fraud. According to whether the sample data
improves the accuracy of financial forgery detection. is labeled, these methods can be roughly divided into two
The following are the primary contributions of our categories: supervised learning and unsupervised learning.
research: In a supervised learning approach, the model used for
(1) For financial fraud detection, considering that Trans- financial forgery detection can be viewed as a binary
former has strong generalization and expressive ability, it is classification task, i.e., whether the company is a forgery
easier to adapt to diverse financial data. Therefore, we pro- or not, and the result is often given in the form of a
pose a financial fraud detection model based on Transformer, probability, where the higher the probability the more likely
which utilizes the multi-head attention mechanism and feed- it is that the company is a forgery. Many classification
forward neural network to mine the high-level features that algorithms have been proposed and have achieved good

62900 VOLUME 12, 2024


Y. Tang, Z. Liu: Distributed Knowledge Distillation Framework for Financial Fraud Detection

results in various industries. Based on whether the distribu- detection as a sequence classification task and utilized Long
tion of observed variables is modeled, supervised learning Short-Term Memory (LSTM) for predictions.Experimental
models can be divided into two categories: discriminative results show that LSTM effectively improves the accuracy
models and generative models. Generative models include of credit card fraud compared to random forest [12]. Zhou
Naive Bayes (NB), Restricted Boltzmann Machine (RBM), et al. use a graph embedding algorithm to learn topological
Hidden Markov Model (HMM). Discriminative models features from financial network graphs and represent them
include Logistic Regression (LR), Multilayer Perceptron as low-dimensional dense vectors. In this way, they utilize
(MLP), Support Vector Machine (SVM), K-Nearest Neigh- deep neural networks to intelligently and efficiently classify
bors (KNN), Maximum Entropy Model (ME), Conditional and predict data samples from large-scale datasets [13].
Random Field (CRF), Decision Tree (DT), Random Forest The literature [14], taking into account the homogeneity
(RF). Such as, in reference [7], the accuracy of four machine of the data structure, proposes a graph learning algorithm
learning algorithms–LR, RF, DT,CatBoost–is analyzed and capable of learning topological features and transaction
compared as the subject of financial fraud detection is amount features in financial transaction network graphs.
explored through the use of several algorithms. Using a In literature [15], a novel graph neural network (GNN)
dataset of financial fraud, Liu et al. used the RF technique architecture with a time de-biasing constraint based on
and contrasted it with other algorithms like LR, KNN, DT, adversarial loss is proposed. This architecture captures fraud
and SVM. They discovered that the RF algorithm had the best patterns that exhibit fundamental consistency over time and
interpretability and maximum accuracy [8]. Unsupervised performs well in fraud detection tasks. In literature [16],
learning does not require labeling the data; it is similar in a new credit card fraud detection model named CCFD-
nature to a statistical tool that detects anomalous data to Net is introduced, featuring a hybrid architecture combining
determine if samples that do not belong to the main class are 1D-Conv and Residual Neural Network (Res-net). This
deceptive. Two common types of algorithms for unsupervised model demonstrates good effectiveness and robustness in
learning are clustering and dimensionality reduction. The credit card fraud detection.
clustering algorithms are K-mean clustering, hierarchical
clustering, etc., and the dimensionality reduction algorithms
are Principal Component Analysis(PCA) and Singular Value C. MULTI-TEACHER KNOWLEDGE DISTILLATION
Decomposition(SVD). Such as, reference [9] proposed a METHODS
model framework that separates clusters using the K-means The single-student-multi-teacher distillation paradigm has
method and compared the performance with two of the most made significant progress in converting complicated,
important financial fraud detection systems. Reference [10] multi-attribute instructor information into lightweight student
introduced an unsupervised learning approach that combines networks. Multi-teacher distillation research focuses on
Particle Swarm Optimization(PSO) and K-Means clustering, designing appropriate distillation strategies for use in
demonstrating better performance in financial fraud detection instructing students. In 2017, You et al [17] proposed a
compared to K-Means. framework for multi-teacher distillation. This approach aver-
ages the soft labels of logits produced from several teacher
models and provides them to student models for learning.
B. DEEP LEARNING COUNTERFEIT DETECTION METHODS Shi et al [18] used another way of directly splicing logits of
Classical machine learning algorithms typically use shallow multiple teachers and then performing PCA dimensionality
models, effective for linearly separable tasks or simple reduction on the face recognition model. Shin [19] extended
non-linear tasks. In contrast, deep learning algorithms are the multi-instructor-single-student distillation architecture to
generally employed for deep models, providing stronger non- a visual multi-attribute recognition task of a target, where
linear modeling capabilities and better performance on real- each instructor specialises in learning one attribute, and then
world complex tasks. For tasks with higher complexity and synthesises the multi-instructor’s knowledge to transfer it to
deeper concealment, such as financial data fraud detection, the student to achieve the student’s multi-attribute recognition
deep learning algorithms generally outperform machine learning. Furthermore, in a recent study, Hailin et al. [20]
learning algorithms [4]. For example, Rushin et al. compared proposed an adaptive multi-instructor knowledge distillation
the performance of LR, gradient boosting trees, and deep strategy that allows diverse instructor knowledge to be
learning in detecting credit card fraud, indicating that deep jointly utilised to improve student performance. The multi-
learning methods outperform the other two approaches [11]. instructor knowledge distillation paradigm proposed in the
In addition, deep learning algorithms can deeply explore literature [21] empowers students to integrate and capture a
the potential connections between data, thereby uncovering variety of knowledge from different sources. Although many
more methods for detecting financial fraud and enhancing studies have used a multi-teacher distillation framework, less
the effectiveness of detection. For example, the classification attention has been paid to the uneven distribution of positive
results depend on features constructed from domain-specific and negative samples. In this research, we employ a multi-
knowledge, without considering other attributes of the data, teacher knowledge distillation strategy to aggregate various
such as temporal attribution. Jurgovsky et al. treated fraud instructors’ knowledge of financial fraud detection across

VOLUME 12, 2024 62901


Y. Tang, Z. Liu: Distributed Knowledge Distillation Framework for Financial Fraud Detection

industries onto a lightweight student model. The goal is to


enhance the model’s performance in detecting imbalances in
the distribution of positive and negative data using a simple
and effective multi-teacher distillation architecture. One
distinction between our technique and other multi-teacher
approaches is that our multi-teacher model learns about
financial fraud in different industries separately, whereas our
student network learns about financial fraud in each industry
from all of the teacher models, allowing the model to be
generalized efficiently in the presence of an imbalanced data
distribution.
Machine learning techniques are heavily used in the
field of financial fraud detection, and graph network-
based approaches have made significant progress in recent
years [4]. However, these methods only focus on the topo-
logical features and data features of the network, ignoring
the dependencies between data features. Table 1 summarises
the existing work related to our problem, compared to FIGURE 1. Transformer encode block.
other methods, the method proposed in this paper exploits
the dependencies between financial indicators for forgery network to extract higher-level features that integrate relevant
detection, and uses multi-instructor distributed knowledge information more comprehensively. Following this, a neural
distillation to improve the speed of model inference and the network maps these higher-level features to the probability
generalisation of the model when the data is unbalanced. And of fraud output. Finally, the cross-entropy loss of the samples
these are not available in other models. is computed, and the model parameters are updated through
gradient descent based on the loss value.
III. FINANCIAL FRAUD DETECTION MODEL BASED ON
TRANSFORMER B. MODEL ARCHITECTURE
A. FINANCIAL FRAUD DETECTION METHODS AND The architecture of the financial fraud detection model based
PROCESSES on Transformer is illustrated in Figure 2. The model consists
Transformer is an advanced deep learning model which was of three modules. The first module is a multi-industry data
first proposed by Vaswani et al. in 2017 and was initially used processing module. The second module is a Transformer
for natural language processing tasks [22]. However, due to Encode Block module, which includes a multi-head attention
its robust parallelism and expressive capabilities, it has been module and a fully connected feedforward neural network.
successfully applied to other domains, including the fields of The feedforward network comprises a linear transformation,
image processing and classification. ReLU non-linear activation function, along with a residual
One of Transformer’s basic features is the self-attention connection and layer normalization operation. The third
mechanism, which allows the model to process all points module is the output neural network module, containing a
in the input sequence at once rather than step-by-step like linear neural network for output and a softmax function for
a recurrent neural network or convolutional neural network. result normalization.
The self-attention mechanism enables the model to capture
1) MULTI-HEAD ATTENTION
correlations by assigning different attentional weights to
different sections of the input sequence. To better capture The financial dataset is represented as D = {(Xn , Yn )}N n=1 ,
various sorts of relationships, the self-attention mechanism is where the matrix X = {x1 , x2 , . . . , xm } represents financial
expanded to several attention heads, each capable of learning data features. Here, xm is a vector of dimension dmodel ,Y =
varied attention weights. The structure of the Transformer {y1 , y2 , . . . , ym |ym ∈ [0, 1]}, where 0 indicates no fraud
encoder is shown in Figure 1. The encoder typically and 1 indicates fraud. For a single sample X ,the first
includes a multi-head attention layer, a feed-forward neural step involves computing the self-attention scores for its
network layer, residual connectivity, and layer normalization. features. Here, we define three matrices for the scaled dot-
Transformer are usually made up of multiple encoders and product operation:Query(Q),Key(K ),and Value(V ). Addi-
decoders stacked on top of each other, and these stacked tionally, three learnable weight matrices Wq ,Wk ,Wv are
layers help the model learn complex feature representations. introduced to map each input feature to query, key, and value
To enhance the accuracy of data analysis and modeling, the vectors:
financial dataset is first preprocessed. Subsequently, multiple
Q = X Wq (1)
attention scores are calculated for financial data to obtain
a representation of the correlation between features. These K = X Wk (2)
multiple attention scores are then fed through a feedforward V = X Wv (3)

62902 VOLUME 12, 2024


Y. Tang, Z. Liu: Distributed Knowledge Distillation Framework for Financial Fraud Detection

TABLE 1. Comparative analysis of methods used to falsify financial statements.

where Q ∈ Rm×d ,K ∈ Rm×d ,V ∈ Rm×d ,Wq ∈ Rm×d ,Wk ∈ The self-attention scores for financial data features can be
Rm×d ,Wv ∈ Rm×d . summarized as formula (6):
Then, for the query matrix Q, calculate its similarity score
QK T
 
matrix S with the key matrix K . To prevent excessively large Attention (Q, K , V ) = softmax √ V. (6)
scores that could d
√ lead to model gradient explosions, divide
each score by d: The multi-head attention mechanism enables the model
QK T to capture richer correlations among financial data features,
S= √ (4)
d facilitating a more in-depth exploration of patterns related to
where S ∈ Rm×m the scores represent the correlation between data falsification. Multi-head attention involves performing
each financial data feature and other features. the self-attention mechanism multiple times, essentially
Finally, normalize the scores using the softmax function having n individuals focusing attention on different positions
and multiply the normalized correlation scores by the value of financial data features. This approach increases the
matrix V to obtain the self-attention scores O for financial likelihood of detecting crucial information related to data
data features: falsification:
O = softmax(S)V (5) MultiHeadAtt (Q, K , V ) = Concat
where O ∈ Rm×d . (head1 , head2 , . . . , headh ) · Wo (7)

VOLUME 12, 2024 62903


Y. Tang, Z. Liu: Distributed Knowledge Distillation Framework for Financial Fraud Detection

FIGURE 2. The architecture of the financial fraud detection model based on transformer.

where head = Attention (Qi , Ki , Vi ),i ∈ {1, . . . , h},Wo ∈ not discarded, their values are scaled by the reciprocal
Rhd×d . of the dropout probability, maintaining the expected value
of the data. By training different network structures in
2) FEEDFORWARD NEURAL NETWORK each iteration, dropout introduces variability, eliminating
The multi-head attention scores obtained from formula (7) and weakening the interdependence among neuron nodes,
undergo a residual connection and layer normalization thereby enhancing the model’s ability to generalize internal
operation. The residual connection addresses the training correlations in financial data. The dropout computation
issues of deep networks by adding the output to the original process is as follows formula (10).
input, enhancing the network’s representational capacity [25]. 
 0, p
Layer normalization normalizes all inputs to have a mean
droput (X ) = X (10)
of 0 and a standard deviation of 1. This helps alleviate the  , 1−p
problem of internal covariate shift in neural network training, 1−p
providing more stable and faster training: 3) OUTPUT NEURAL NETWORK
LayerNorm (X + MultiHeadAtt (Q, K , V )) . (8) After the financial data goes through the stacked encoder,
we map and output the high-level features X , which are
Subsequently, the multi-head attention scores, after the
extracted by the last encoder and contain internal correlation
residual connection and layer normalization, undergo further
information, through a linear layer. We normalize the output
processing through two linear transformations and a ReLU
using the softmax function. The normalization calculation is
activation function. This step aims to extract higher-level
shown in formula (11):
features with richer contextual information:  
FFN (X ) = max (0, X W1 + b1 ) W2 + b2 (9) Y pre = softmax W · X T + b (11)

while the linear transformations at different positions in the where Y pre ∈ R1×2 is the probability distribution vector, W
encoder are the same, the parameters between layers are is the neural network weight matrix, and b is the bias vector.
distinct.
In order to prevent overfitting, we introduce dropout into 4) OVERALL LOSS CALCULATION
the output of each fully connected layer to ensure the model’s The financial dataset D = {(Xn , Yn )}N n=1 is passed into
generalization. Dropout involves randomly discarding each the Transformer-based financial fraud detection model. After
neuron with a probability p. For the neurons that are extracting high-level features related to the data, the model

62904 VOLUME 12, 2024


Y. Tang, Z. Liu: Distributed Knowledge Distillation Framework for Financial Fraud Detection

maps the samples to predicted label values f (W , X ). The A. MULTI-TEACHER MODEL


true label values Yn and the predicted label values f (W , X ) The knowledge distillation algorithm is a model compression
are then used to calculate the cross-entropy loss through technique. It involves transferring knowledge from a large
formula (12): model (usually referred to as the teacher model) to a smaller
model (typically known as the student model), with the aim of
fcls (W , Xn , Yn ) = −[Yn · log (f (W , Xn )) retaining the performance of the teacher model on a relatively
+ (1 − Yn ) · log (1 − f (W , Xn ))] (12) smaller scale student model [26].
The teacher model adopts the Transformer-based financial
where W represents the model’s parameter matrix, f (W , X ) fraud detection model mentioned in Section III. The multi-
represents the mapping of feature X through the model’s industry financial dataset is represented as the set I =
parameter matrix W , and its value is the probability of no {D1 , D2 , . . . , Dm } where Dm = {(Xn , Yn )}N
n=1 represents
fraud. the financial dataset of industries such as manufacturing
The model utilizes data samples for training to update the and transportation. The Multi-teacher model is trained
model parameters W . Here, we provide the general formula using the collection of multi-teacher financial datasets. The
for parameter updates: performance of the Multi-teacher model is further optimized
∂Fcls (W ) by adjusting hyperparameters. The training of the Multi-
W =W +η· (13) teacher model is illustrated in Algorithm 1.
∂W
where η represents the learning rate, and Fcls (W ) represents B. STUDENT MODEL
the total loss function of the financial dataset D. Its
For a classification task, the final output of the model is
calculation formula is as follows:
the probabilities for each class, which are referred to as soft
1 X targets. The true labels for each sample are called hard targets.
Fcls (W ) = fcls (W , Xn , Yn ). (14)
N The difference with hard targets is that soft targets not only
Xn ,Yn ∈D
inform us about the most likely class for a sample but also
IV. DISTRIBUTED KNOWLEDGE DISTILLATION provide probabilities for other classes, indicating that soft
DETECTION FRAMEWORK targets contain more information than hard targets. Therefore,
On the one hand, due to the presence of various challenges when training the Multi-teacher model, we use hard targets.
related to financial data manipulation in different industries, The predictions obtained from training the teacher network on
there exist distinct characteristics and internal correlations a sample can convey more information to the student network.
in the financial data of different industries. Moreover, there Consequently, we can use the soft targets from the teacher
are significant differences in the financial data indicators network to guide the training of the student network.
that different industries focus on. Therefore, it is challenging The student network adopts a smaller Transformer-based
to use a universal model to detect financial data with such financial fraud detection model with fewer parameters. For
the financial dataset D = {(Xn , Yn )}N (t) ∈ RB×C
substantial variations. On the other hand, traditional models n=1 let Z
(s)
and Z ∈ R B×C represent the logits output by the teacher
suffer from issues such as complex structures, deep model
depths, and slow inference speeds, making it difficult to network and student network, respectively, where B is the
deploy them in practical application scenarios. Based on the batch size, and C is the number of categories. Y ∈ [0, 1]
above problem considerations, this paper uses a distributed represents the hard targets for the samples. After applying
architecture to train multiple teacher detection models for the softmax function to the outputs Z (t) ∈ RB×C and Z (t) ∈
multiple industries. And a distributed knowledge distillation RB×C of the teacher and student networks, the probability
algorithm is proposed to migrate the detection knowledge distributions range from 0 to 1. If we find that the relative
from the multi-teacher network to the lightweight student sizes between the categories are not sufficiently distinct,
network separately. On the one hand, the detection model is we introduce a distillation temperature T . A higher T makes
compressed to adapt to practical application scenarios, and on the relative sizes between the categories more pronounced.
the other hand, the generalisation ability of the model in the The introduction of T involves dividing the original softmax
case of unbalanced data distribution is improved. values by T . In theory, as T increases,the distillation effect
The distributed knowledge distillation detection frame- improves, but excessively large T can cause the relative sizes
work, as shown in Figure 3, is illustrated as follows. Firstly, between categories to disappear. Therefore, it’s necessary to
datasets from various industries are prepared, and these choose an appropriate value for T . The distillation process is
datasets are utilized to train teacher models. Subsequently, represented as formula (15):
 
untrained student models with simpler structures than the   exp zi T
teacher models are prepared. A knowledge distillation softmax Z T = P   (15)
j exp zj T
algorithm is used so that the knowledge from the multi-
teacher model is migrated separately to the student net- where Z = {z1 , z2 , . . . , zn }.
work, which finally tests the financial data from different The guidance of the teacher model in training the student
industries. model involves two steps. The first step is to compute

VOLUME 12, 2024 62905


Y. Tang, Z. Liu: Distributed Knowledge Distillation Framework for Financial Fraud Detection

FIGURE 3. Distributed knowledge distillation framework for financial data detection.

the distillation loss. This involves using the distillation loss:


formula (16) and formula (17)to calculate the soft targets P(t)
and P(s) from the outputs Z (t) and Z (s) of the teacher and Ltr = α · Lcls + β · LKD . (20)
student networks, respectively. Then, the KL divergence loss
between these soft targets is calculated using formula (18): where α and β are weight coefficients, determining the
contribution of each loss term in the final knowledge
P(t) = softmax YT T
pre  
(16) distillation loss.
(s) pre  
P = softmax YS T (17) The model utilizes data samples for training to update
(s) ! model parameters W.The specific algorithm for model
T 2 XB XC pi,j
LKD = log (t) . (18) training is shown in Algorithm 2. Here, we provide the
B i=1 j=1 p general formula for parameter updates:
i,j

The second step is to compute the student loss. This involves ∂Ltr
using a temperature softmax distiller (with T = 1) on W = W +η · (21)
∂W
the output Z (s) of the student network to calculate soft
targets P,and then calculating the cross-entropy loss between where η represents the learning rate.
P(t=1) and the hard targets Yn from the financial data using
formula (19): V. EXPERIMENT
1 XB XC  In this section, we first describe the structure of the
Lcls = −[Yi,j · log Pi,j dataset. Subsequently, we compare the performance metrics
B i=1 j=1

+ 1 − Yi,j · log 1 − Pi,j ].

(19) of the teacher model and the student model. We then
compare the student model with other machine learn-
The final knowledge distillation loss is obtained by taking ing algorithms, followed by visualization and parameter
the weighted sum of both the distillation loss and the student analysis.

62906 VOLUME 12, 2024


Y. Tang, Z. Liu: Distributed Knowledge Distillation Framework for Financial Fraud Detection

Algorithm 1 Multi-Teacher Model Training Algorithm Algorithm 2 Student Model Training Algorithm
Hyperparameters: Enter feature dimension d;Bulk Hyperparameters: Enter feature dimension d;Bulk
attention nhead=6;Number of feedforward neurons attention nhead=2;Number of feedforward neurons
dim=1024;Random dropout dropout=0.2;Encode layer dim=1024;Random dropout dropout=0.2;Encode layer
number layers=2;Learning rate η =0.001;Number of number layers=2;Learning rate η=0.001;Distillation
iterations T =100;Training data amount N1 ;Batch size temperature Tem=7;Number of iterations
n1 =32;Optimizer=Adam. T =100;Training data amount N1 ;Batch size
Input: Multi-industry financial data set collection I = n1 =32;Optimizer=Adam.
{D1 , D2 , . . . , Dm },where Dm = {(Xn , Yn )}N
n=1 . Input: Multi-industry financial data set collection I =
Output: Teacher model convergence parameters W (t) . {D1 , D2 , . . . , Dm },where Dm = {(Xn , Yn )}N n=1 .
1: Random initialization W (t) ← N (0, 1); Output: Multi-industry nstudent network o convergence
(s) (s) (s) (s)
2: Random sorting of different industries in the collection parameters W (s) = w1 , w2 , . . . , wn , where wn
I; Express the network convergence parameters in a certain
3: while t ≤ T do  industry.
4: for n = 1 : N1 n1 do 1: Random initialization W (s) ← N (0, 1);
5: Select batch samples from data set I (Xn , Yn ); 2: for i = 1 : m do
6: for k = 1 : layers do 3: Select the industry dataset Di from the collection I ;
7: for i = 1 : nhead do 4: Sorting the sample of the industry dataset Di randomly
8: From the formula (1), (2), (3) calculate Qi , Ki , sort;
Vi according to Xn ; 5: while t ≤ T do 
9: From the formula (6) calculate headi according 6: for n = 1 : N1 n1 do
to Qi , Ki , Vi ; 7: Select batch samples from data set Di (Xn , Yn );
10: end for (t)
8: Calculate the output Zn of the teacher network
11: Calculate the multi-head attention score M based based on Xn and the teacher network parameters
on headi according to formula (7); W (t) from algorithm 1;
12: Calculate the residual network and layer nor- (s)
9: Calculate the output Zn of the student network
malization L based on X and M according to based on Xn and the student network parameters
formula (8); (s)
wn ;
13: Feed the feedforward neural network FFN (L) 10: According to equations (16) and (17),distill the
based on formula (9), and apply random dropout (t) (s)
classification results Zn and Zn through a
to each fully connected layer according to distillation process with distillation temperature
formula (10); (t)
Tem = t, resulting in distilled outputs Pn and
14: Calculate the residual network and layer normal- (s)
Pn ;
ization to obtain the encoder output X̄ based on
11: According to equation (15),distill the classifica-
formula (8); (s)
tion result Zn of the student network through a
15: Feed the output back to the input, and stack the
distillation process with a distillation temperature
encoder:X = X̄ ;
Tem = 1, obtaining the distilled output Pn ;
16: end for
12: Calculate the final loss Ln for dataset Di based on
17: Apply the linear output layer to the output of the last
formulas (18),(19) and (20);
encoder based on formula (11) to obtain the output (s)
13: Finally, update the parameters wn of the stu-
result Y pre ;
dent network based on the final loss Li using
18: Calculate the cross-entropy loss for the dataset
formula (21);
based on formula (14);
14: end for
19: Update the model parameters W based on for-
15: end while
mula (13);
16: end for
20: end for
17: return Multi-industry n student networko convergence
21: end while (s) (s) (s) (s) (s)
22: return Output the convergence parameters W (t) of the
parameters W = w1 , w2 , . . . , wn , where wn
teacher model. express the network convergence parameters in a certain
industry.

A. DATASET DESCRIPTION from 19 different industries. Among them, manufacturing


The dataset used in this experiment is from the 9th companies significantly outnumber companies from other
‘‘TipDM Cup’’ Financial Analysis Competition for Listed industries, with 2,667 companies, while the distribution of
Companies. All listed companies in the dataset come companies in other industries is relatively even, totaling

VOLUME 12, 2024 62907


Y. Tang, Z. Liu: Distributed Knowledge Distillation Framework for Financial Fraud Detection

TABLE 2. Summary of the analyzed data sets. TABLE 3. Comparison of evaluation metrics between teacher and student
models.

only 1,496. Due to the uneven distribution of data across


different industries, we divide the entire dataset into two
categories: manufacturing and other industries. We separately TABLE 4. Inference time comparison between teacher and student
train student models for the manufacturing industry and models.
other industries. These two models serve as subsystems in
a distributed framework. The experiment involves training
on 70% of the data, with the remaining 30% used as a
validation set.

B. TEACHER MODEL AND STUDENT MODEL


PERFORMANCE COMPARISON ANALYSIS
Our experiment was conducted on the hardware platform
of 13th Gen Intel(R) Core(TM) i9-13900KF 3.00 GHz and
NVIDIA GeForce RTX3060 Ti. The primary configuration
environment for the experiment includes Python 3.9.1,
torch 2.0.1, numpy 1.22.4, and pandas 2.1.1. All machine
In terms of model inference speed, as shown in Table 4,
learning algorithms were implemented using the third-party
on the dataset from other industries, the teacher model
library Scikit-Learn. The Transformer-based financial fraud
has average inference times of 825.5µ s and 121.4µ s on
detection model was constructed using the PyTorch deep
CPU and GPU, respectively. In comparison, the student
learning framework.
model has average inference times of 187.3µ s and 31.2µ s,
For the final detection criteria, we utilize the following
which are faster by 638.2µ s and 90.2µ s, respectively.
metrics:
On the manufacturing industry dataset, the teacher model
TP
precision = (22) has average inference times of 1048.7µ s and 262.3µ s
TP + FP on CPU and GPU, while the student model has average
TP inference times of 256.4µ s and 40.7µ s. The student
recall = (23)
TP + FN model is faster by 792.3µ s and 221.3µ s, respectively.From
recall · precision the table, it can be observed that the inference speed
f 1_score = 2 · (24)
recall + precision of the student model is generally faster than that of the
TP + TN teacher model. This is because the student model has fewer
accuracy = (25)
TP + FP + TN + FN parameters and a simpler structure than the teacher model,
where TP,TN ,FP and FN ,represent true positives,true nega- leading to faster inference speed. Additionally, the inference
tives,false positives, and false negatives, respectively. speed on GPU is faster compared to CPU, as GPUs are
After training the proposed model on the training set, better suited for matrix operations. The experimental results
evaluation was conducted using the test set to assess the of comparing the performance of the teacher model and
detection performance and speed of both the teacher model the student model show that after multi-teacher distributed
and the student model. As shown in Table 3, in terms of knowledge distillation, the student model improves detection
detection performance, the student model that learned distil- performance, generalization ability, and inference speed more
lation had average Accura values of 98.98% and 98.83% on than the teacher network does.
the other industries and manufacturing datasets, respectively,
compared to only 97.54% and 97.38% for the instructor C. COMPARISON RESULTS OF STUDENT MODEL
model, implying that the student model outperformed the DETECTION PERFORMANCE WITH OTHER ALGORITHMS
instructor model in terms of detection accuracy. The average In order to further evaluate the performance of the student
Recall on the dataset Other Industries and Manufacturing was model, we compared the proposed method with advanced
92.51% and 90.12% for the teacher model, and 95.45% and machine learning algorithms, including Log Reg [27],
92.70% for the student model, suggesting that the student SVM linear [28], DT [29], RF [30], XGBoost [31], and
model outperforms the teacher model at proper detection. Adaboost [32].

62908 VOLUME 12, 2024


Y. Tang, Z. Liu: Distributed Knowledge Distillation Framework for Financial Fraud Detection

FIGURE 4. Comparative analysis of MAE values of the proposed method FIGURE 6. Comparative analysis of MCC values of the proposed method
with other models. with other models.

TABLE 5. Comparison of evaluation metrics between student model and


machine learning model.

FIGURE 5. Comparative analysis of RMSE values of the proposed method


with other models.

To begin, this study uses the MAE and RMSE to assess


each model’s error performance on the test data. Figures 4
and 5 demonstrate a comparison examination of MAE and
RMSE, with the findings indicating that our suggested model
has lower MAE and RMSE than the other models. values of 81.48% and 67.74% for other industries and
Second, MCC is used to evaluate the classification manufacturing, respectively. The Tree algorithm slightly
model’s performance; the MCC can provide a more accurate lagged behind our proposed method, with precision values of
performance assessment in unbalanced datasets. The closer 72.41% and 66.66% for other industries and manufacturing.
the MCC metric is to 1 indicates better model classification Furthermore, our proposed method obtained the highest
performance. The comparison study of MCC is displayed in F1 scores, with values of 92.87% and 87.65% for other
Figure 6, and the findings reveal that our proposed model has industries and manufacturing, respectively. The F1 scores of
a higher MCC than the other models, implying that our model the Tree algorithm were lower than our proposed method,
has better classification performance in unbalanced datasets. with values of 89.20% and 87.56% for other industries and
The performance was assessed based on accuracy, pre- manufacturing.
cision, recall, and F1 score. As indicated in Table 5, our The ROC curve is a measure of the model’s overall
proposed method achieved the highest accuracy of 98.98% classification performance, and the area under the ROC curve
percent on other sectors and 98.83% percent on manufac- is the AUC; the closer the AUC value is to one, the better the
turing industries. Log Reg and linear SVM achieved the model’s correct classification performance, and the closer it
lowest accuracy in other industries and manufacturing, with is to zero, the worse the surface model’s correct classification
values of 84.01% and 81.47%, respectively. Our proposed performance. Figures 7 and 8 display the ROC curves of the
method achieved the highest recall in other industries at proposed method and other machine learning algorithms on
95.45%, while the Tree algorithm slightly surpassed our other and manufacturing industry datasets. The ROC curves
model in manufacturing with a recall of 93.36%. Our of the proposed method are positioned closest to the top-
proposed method also achieved the highest precision, with left corner of the graphs, indicating superior performance of

VOLUME 12, 2024 62909


Y. Tang, Z. Liu: Distributed Knowledge Distillation Framework for Financial Fraud Detection

FIGURE 7. The proposed method and AUC curves compared to other FIGURE 10. The proposed method and precision-recall curves on a
machine learning algorithms on datasets from various industries. manufacturing dataset compared to other ML algorithms.

Precision and recall are important metrics for comparing


classifier performance. Precision-recall (PR) curves can be
plotted based on precision and recall, and the quality of
a system can be judged based on these curves. The PR
curve is plotted with recall on the x-axis and precision
on the y-axis. Figures 9 and 10 clearly illustrate the PR
curves of our proposed method and other machine learning
algorithms. The PR curve of the proposed method is
positioned in the upper-right corner of the graphs, indicating
good performance on both datasets. Additionally, the PR
curve of the proposed method is higher than the PR curves of
other algorithms, suggesting that, compared to other machine
learning algorithms.

VI. CONCLUSION
FIGURE 8. The proposed method and AUC curves for a manufacturing The detection of fraudulent financial data in listed companies
dataset compared to other ML algorithms.
is of significant importance for safeguarding the interests of
shareholders and investors. This paper proposes a distributed
knowledge distillation framework based on Transformer
for detecting fraudulent financial data in listed companies.
Experimental validation was conducted using the dataset
from the 9th ‘‘TipDM Cup’’ Financial Analysis Competition
for Listed Companies. The performance of the proposed
method was evaluated by comparing it with other advanced
machine learning algorithms, including logistic regression,
linear support vector machine, decision tree, random forest,
XGBoost, and Adaboost. The experimental results demon-
strate that the proposed method outperforms other machine
learning algorithms, achieving the highest performance in
terms of AUC, accuracy, precision, recall, and F1 score.

REFERENCES
[1] C. Defang and L. Baichi, ‘‘SVM model for financial fraud detection,’’
FIGURE 9. The proposed method and precision-recall curves on datasets Northeastern Univ., Natural Sci., vol. 40, pp. 295–299, Feb. 2019.
from various industries compared to other ML algorithms.
[2] T. Shahana, V. Lavanya, and A. R. Bhat, ‘‘State of the art in financial
statement fraud detection: A systematic review,’’ Technological Forecast-
ing Social Change, vol. 192, Jul. 2023, Art. no. 122527.
[3] W. Xiuguo and D. Shengyong, ‘‘An analysis on financial statement fraud
the proposed fraud detection model on both datasets. These detection for Chinese listed companies using deep learning,’’ IEEE Access,
results demonstrate the effectiveness of our proposed method. vol. 10, pp. 22516–22532, 2022.

62910 VOLUME 12, 2024


Y. Tang, Z. Liu: Distributed Knowledge Distillation Framework for Financial Fraud Detection

[4] M. N. Ashtiani and B. Raahemi, ‘‘Intelligent fraud detection in financial [24] J. Geng and B. Zhang, ‘‘Credit card fraud detection using adversarial
statements using machine learning and data mining: A systematic literature learning,’’ in Proc. Int. Conf. Image Process., Comput. Vis. Mach. Learn.
review,’’ IEEE Access, vol. 10, pp. 72504–72525, 2022. (ICICML), 2023, pp. 891–894.
[5] M. El-Bannany, A. H. Dehghan, and A. M. Khedr, ‘‘Prediction of financial [25] E. Orhan, ‘‘Skip connections as effective symmetry-breaking,’’ 2017,
statement fraud using machine learning techniques in UAE,’’ in Proc. 18th arXiv:1701.09175.
Int. Multi-Conf. Syst., Signals Devices (SSD), Mar. 2021, pp. 649–654. [26] H. Hong and H. Kim, ‘‘Feature distribution-based knowledge distillation
[6] R. Cao, G. Liu, Y. Xie, and C. Jiang, ‘‘Two-level attention model of for deep neural networks,’’ in Proc. 19th Int. SoC Design Conf. (ISOCC),
representation learning for fraud detection,’’ IEEE Trans. Computat. Social Oct. 2022, pp. 75–76.
Syst., vol. 8, no. 6, pp. 1291–1301, Dec. 2021. [27] D. Varmedja, M. Karanovic, S. Sladojevic, M. Arsenovic, and A. Anderla,
[7] A. Singh, A. Singh, A. Aggarwal, and A. Chauhan, ‘‘Design and ‘‘Credit card fraud detection–machine learning methods,’’ in Proc. 18th
implementation of different machine learning algorithms for credit Int. Symp. Infoteh-Jahorina (INFOTEH), Mar. 2019, pp. 1–5.
card fraud detection,’’ in Proc. Int. Conf. Electr., Comput., Commun. [28] T. Priyaradhikadevi, S. Vanakovarayan, E. Praveena, V. Mathavan,
Mechatronics Eng. (ICECCME), Nov. 2022, pp. 1–6. S. Prasanna, and K. Madhan, ‘‘Credit card fraud detection using machine
[8] C. Liu, Y.-C. Chan, S. H. Alam, and H. Fu, ‘‘Financial fraud detection learning based on support vector machine,’’ in Proc. 8th Int. Conf. Sci.
model: Based on random forest,’’ in Econometrics: Econometric Model Technol. Eng. Math. (ICONSTEM), Apr. 2023, pp. 1–6.
Construction, 2015. [29] C.-C. Lin, A.-A. Chiu, S. Y. Huang, and D. C. Yen, ‘‘Detecting the
[9] H. Shivraman, U. Garg, A. Panth, A. Kandpal, and A. Gupta, ‘‘A model financial statement fraud: The analysis of the differences between data
frame work to segregate clusters through K-means method,’’ in Proc. 2nd mining techniques and experts’ judgments,’’ Knowl.-Based Syst., vol. 89,
Int. Conf. Comput. Sci., Eng. Appl. (ICCSEA), Sep. 2022, pp. 1–6. pp. 459–470, Nov. 2015.
[10] N. Sharma and V. Ranjan, ‘‘Credit card fraud detection: A hybrid of PSO [30] V. Arora, R. S. Leekha, K. Lee, and A. Kataria, ‘‘Facilitating user
and K-means clustering unsupervised approach,’’ in Proc. 13th Int. Conf. authorization from imbalanced data logs of credit cards using artificial
Cloud Comput., Data Sci. Eng. (Confluence), Jan. 2023, pp. 445–450. intelligence,’’ Mobile Inf. Syst., vol. 2020, pp. 1–13, Oct. 2020.
[11] G. Rushin, C. Stancil, M. Sun, S. Adams, and P. Beling, ‘‘Horse race [31] L. Torlay, M. Perrone-Bertolotti, E. Thomas, and M. Baciu, ‘‘Machine
analysis in credit card fraud—Deep learning, logistic regression, and learning–XGBoost analysis of language networks to classify patients with
gradient boosted tree,’’ in Proc. Syst. Inf. Eng. Design Symp. (SIEDS), epilepsy,’’ Brain Informat., vol. 4, no. 3, pp. 159–169, Sep. 2017.
Apr. 2017, pp. 117–121. [32] P. Yu and X. Liu, ‘‘Construction and application of bid fraud prediction
[12] J. Jurgovsky, M. Granitzer, K. Ziegler, S. Calabretto, P.-E. Portier, model based on AdaBoost algorithm,’’ in Proc. 2nd Int. Conf. Electron.
L. He-Guelton, and O. Caelen, ‘‘Sequence classification for credit-card Inf. Eng. Comput. Technol. (EIECT), Oct. 2022, pp. 292–295.
fraud detection,’’ Exp. Syst. Appl., vol. 100, pp. 234–245, Jun. 2018. [33] T. Zhang and S. Gao, ‘‘Graph attention network fraud detection based
[13] H. Zhou, G. Sun, S. Fu, L. Wang, J. Hu, and Y. Gao, ‘‘Internet financial on feature aggregation,’’ in Proc. 4th Int. Conf. Intell. Inf. Process. (IIP),
fraud detection based on a distributed big data approach with node2vec,’’ Oct. 2022, pp. 272–275.
IEEE Access, vol. 9, pp. 43378–43386, 2021. [34] A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
[14] R. Li, Z. Liu, Y. Ma, D. Yang, and S. Sun, ‘‘Internet financial fraud L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Neural
detection based on graph learning,’’ IEEE Trans. Computat. Social Syst., Inf. Process. Syst., 2017, pp. 1–11.
vol. 10, no. 3, pp. 1394–1401, 2023.
[15] A. Singh, A. Gupta, H. Wadhwa, S. Asthana, and A. Arora, ‘‘Temporal
debiasing using adversarial loss based GNN architecture for crypto fraud
detection,’’ in Proc. 20th IEEE Int. Conf. Mach. Learn. Appl. (ICMLA),
Dec. 2021, pp. 391–396.
YUXUAN TANG is currently pursuing the bach-
[16] X. Liu, K. Yan, L. Burak Kara, and Z. Nie, ‘‘CCFD-net: A novel deep
learning model for credit card fraud detection,’’ in Proc. IEEE 22nd Int.
elor’s degree in accounting with the School of
Conf. Inf. Reuse Integr. Data Sci. (IRI), Aug. 2021, pp. 9–16. Accounting, Southwestern University of Finance
[17] S. You, C. Xu, C. Xu, and D. Tao, ‘‘Learning from multiple teacher and Economics, Chengdu, Sichuan, China. Her
networks,’’ in Proc. 23rd ACM SIGKDD Int. Conf. Knowl. Discovery Data current research interests include financial big
Mining, Aug. 2017, pp. 1285–1294. data analysis, financial fraud detection, credit
[18] W. Shi, G. Ren, Y. Chen, and S. Yan, ‘‘ProxylessKD: Direct knowl- card fraud detection, machine learning, and deep
edge distillation with inherited classifier for face recognition,’’ 2020, learning.
arXiv:2011.00265.
[19] M. Shin, ‘‘Semi-supervised learning with a teacher–student network for
generalized attribute prediction,’’ in Proc. Eur. Conf. Comput. Vis., 2020,
pp. 509–525.
[20] H. Zhang, D. Chen, and C. Wang, ‘‘Adaptive multi-teacher knowledge
distillation with meta-learning,’’ in Proc. IEEE Int. Conf. Multimedia Expo
(ICME), Jul. 2023, pp. 1943–1948. ZHANJUN LIU received the Ph.D. degree in
[21] A. Amirkhani, A. Khosravian, M. Masih-Tehrani, and H. Kashiani, circuits and systems from Chongqing University,
‘‘Robust semantic segmentation with multi-teacher knowledge distilla- Chongqing, China, in 2018. He is currently a
tion,’’ IEEE Access, vol. 9, pp. 119049–119066, 2021. Professor with the School of Communication and
[22] B. An and Y. Suh, ‘‘Identifying financial statement fraud with decision Information Engineering, Chongqing University
rules obtained from modified random forest,’’ Data Technol. Appl., vol. 54, of Posts and Telecommunications, China. His
no. 2, pp. 235–255, May 2020. current research interests include network intelli-
[23] P. Craja, A. Kim, and S. Lessmann, ‘‘Deep learning for detecting gence, big data analysis, and deep learning.
financial statement fraud,’’ Decis. Support Syst., vol. 139, Dec. 2020,
Art. no. 113421.

VOLUME 12, 2024 62911

You might also like