Information and Communication Technology
Information and Communication Technology
Information and
Communication Technology
13th International Symposium, SOICT 2024
Danang, Vietnam, December 13–15, 2024
Proceedings, Part I
Communications
in Computer and Information Science 2350
Series Editors
Gang Li , School of Information Technology, Deakin University, Burwood, VIC,
Australia
Joaquim Filipe , Polytechnic Institute of Setúbal, Setúbal, Portugal
Zhiwei Xu, Chinese Academy of Sciences, Beijing, China
Rationale
The CCIS series is devoted to the publication of proceedings of computer science con-
ferences. Its aim is to efficiently disseminate original research results in informatics
in printed and electronic form. While the focus is on publication of peer-reviewed full
papers presenting mature work, inclusion of reviewed short papers reporting on work in
progress is welcome, too. Besides globally relevant meetings with internationally repre-
sentative program committees guaranteeing a strict peer-reviewing and paper selection
process, conferences run by societies or of high regional or national relevance are also
considered for publication.
Topics
The topical scope of CCIS spans the entire spectrum of informatics ranging from foun-
dational topics in the theory of computing to information and communications science
and technology and a broad variety of interdisciplinary application fields.
Information for Volume Editors and Authors
Publication in CCIS is free of charge. No royalties are paid, however, we offer registered
conference participants temporary free access to the online version of the conference
proceedings on SpringerLink (http://link.springer.com) by means of an http referrer from
the conference website and/or a number of complimentary printed copies, as specified
in the official acceptance email of the event.
CCIS proceedings can be published in time for distribution at conferences or as post-
proceedings, and delivered in the form of printed books and/or electronically as USBs
and/or e-content licenses for accessing proceedings at SpringerLink. Furthermore, CCIS
proceedings are included in the CCIS electronic book series hosted in the SpringerLink
digital library at http://link.springer.com/bookseries/7899. Conferences publishing in
CCIS are allowed to use Online Conference Service (OCS) for managing the whole
proceedings lifecycle (from submission and reviewing to preparing for publication) free
of charge.
Publication process
The language of publication is exclusively English. Authors publishing in CCIS have
to sign the Springer CCIS copyright transfer form, however, they are free to use their
material published in CCIS for substantially changed, more elaborate subsequent publi-
cations elsewhere. For the preparation of the camera-ready papers/files, authors have to
strictly adhere to the Springer CCIS Authors’ Instructions and are strongly encouraged
to use the CCIS LaTeX style files or templates.
Abstracting/Indexing
CCIS is abstracted/indexed in DBLP, Google Scholar, EI-Compendex, Mathematical
Reviews, SCImago, Scopus. CCIS volumes are also submitted for the inclusion in ISI
Proceedings.
How to start
To start the evaluation of your proposal for inclusion in the CCIS series, please send an
e-mail to [email protected].
Wray Buntine · Morten Fjeld · Truyen Tran ·
Minh-Triet Tran · Binh Huynh Thi Thanh ·
Takumi Miyoshi
Editors
Information and
Communication Technology
13th International Symposium, SOICT 2024
Danang, Vietnam, December 13–15, 2024
Proceedings, Part I
Editors
Wray Buntine Morten Fjeld
VinUniversity University of Bergen
Hanoi, Vietnam Bergen, Norway
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Singapore Pte Ltd. 2025
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors
or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Honorary Chairs
General Chairs
Program Chairs
Track Chairs
AI Applications
Multimedia Processing
Software Engineering
Generative AI
Tutorial Chairs
Organizing Chairs
Publication Chairs
Publicity Chairs
Web Chairs
Program Committee
Organizers
xx Organization
Technical Sponsors
Financial Sponsors
Contents – Part I
Multimedia Processing
Distortion-Resilient DIBR for Novel View Synthesis from a Single Image . . . . . 287
Yuchen Liu, Eiji Kamioka, and Phan Xuan Tan
Operations Research
1 Introduction
Automated navigation is primarily aimed at avoiding obstacles accurately and efficiently.
Building a vision system for complex systems makes systems expensive. Compact and
low-cost monocular cameras offer the advantage of responding to contextual information
and compatibility with deployments [1]. Accurate estimation of the depth of field scale
objects allows the system to accurately determine the location of objects within a certain
distance. Depth values are watered down through multiple approaches. Typically, the
combination of cameras and Lidar. While Lidar-based depth estimation is highly accurate
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 3–13, 2025.
https://doi.org/10.1007/978-981-96-4282-3_1
4 V.-T. Nguyen et al.
and efficient, their high cost and complex computational resource requirements limit their
application in many contexts [2]. Vision-based cognitive systems are often used due to
their low cost and effective compatibility with systems. However, monocular cameras
do not allow the direct extraction of depth information [3]. Therefore, they limit their
effectiveness in understanding and reproducing postures and 3D maps during operation.
Binocular vision costs more than monocular vision, but it provides more accurate depth
information. But its intrinsic nature is not suitable for a wide range, besides that is the
combination of information from complex perspectives [4].
In order to accurately estimate the depth of a 2D scene, the features, and relationships
of the specific details in the image and the overall context of the scene must be extract-ed
and processed. Leveraging the ability to learn from both local and global contexts, deep
convolutional neural networks (DCNNs) have been used extensively in recent studies
to estimate monocular depth. Gao et al. proposed an unsupervised learning to simulta-
neously predict both monocular depth and ego-motion trajectory [5]. Then, Xiong et al.
proposed robust geometric losses to maintain consistency of the depth and pose estima-
tion [6]. The self-supervised monocular depth estimation is stabilized by incorporating
scale-consistent geometric constraints into the loss functions. Subsequently, Nguyen
et al. employed deep learning model to ease the training based on himself-collected
dataset [7]. Nevertheless, the implementation of such extensive networks consisting of
parameters, causing substantial computational expenses and extensive memory demands.
Godard et al. generated disparity images by using re-construction loss for depth predic-
tion in poor quality depth images [8]. As a result, the distortion of depth information near
edges significantly reduces the accuracy of following tasks such as 3D reconstruction.
Therefore, the objective of the 3D perception system is to retrieve the 3D bounding box,
described in the coordinate frame of the 3D environment, as well as mobile robot’s bird
eye view.
The paper proposes the lightweight FDE-Net built on the PPLC-Net as backbone
and fast convolution block as decoder model. Firstly, proposed backbone improves the
network’s performance on multiple tasks. Then, fast block convolution will decode
information from backbone and return depth map. By using only the current data, a
new transform eliminates the redundant computations without the need for the over-
all overlapped data. Furthermore, the number of parameters significantly reduced for
low-resource embedded systems. Data processing speed is significantly improved from
arithmetic analysis, which implies that the reduced transform size gives additional advan-
tage in data manipulation. In summary, the prediction layer produces the final segment
map for identifying obstacles and constructing the real-time global path in mobile robot’s
environment.
Main contributions are as follows:
• The authors present an FDE-Net model: combining the decoder of a fast convolu-
tional block with lightweight PPLC-Net backbone for efficient depth estimation and
resource-limited systems.
• The combination of L1 and SSIM loss functions helps to bring efficiency and balance
in the model training process.
FDE-Net: Lightweight Depth Estimation for Monocular Cameras 5
• Based on experiments on the NYU-V2, Cityscapes, and KITTI datasets, the proposed
model shows superior performance compared to state-of-the-art monocular camera-
based depth estimation methods.
2 Related Work
Depth estimation from perspective images has garnered considerable interest over the
past decade through the utilization of deep-learning-based monocular perspective depth
estimation. To improve the precision of depth-map prediction, Eigen and Fergus intro-
duced a novel multi-dimensional monocular depth estimation approach that combines
the fundamental aspects of global and local perspectives [9]. Zhou et al. devised an inno-
vative method to reduce reliance on ground truth data by concurrently enhancing the
accuracy of depth estimation and pose estimation [10]. This was accomplished by lever-
aging an input image from monocular video sequences. Godard et al. have made signifi-
cant advancements with the introduction of MonoDepth2, an advanced model designed
to effectively handle occluded pixels [8]. By filtering out unsuitable training pixels
with camera motion, the reprojection loss was minimized at a per-pixel level through
an auto-masking loss framework. The key features include a redesigned arrangement
of skip connections and the incorporation of suitable attributes to achieve exceptional
high-resolution output. Wofk et al. introduced FastDepth, a proficient and lightweight
encoder-decoder network structure that reduces computational complexity and latency
[11]. However, challenges persist, such as the loss of intricate details and blurring of
predicted depth map edges. Rudolph et al. reconstructed high-resolution depth maps
using guided upsampling blocks in the decoder [12]. Zhou et al. iteratively improved
the depth map through a recurrent multi-scale feature modulation [13]. In an effort to
incorporate global contexts, Zhang et al. proposed the Lite-Mono architecture, which
combines a lightweight CNN with a transformer [14]. Consequently, this architecture not
only reduces model size but also maintains accuracy. Nevertheless, with the increased
data processing in 3D image reconstruction, these lightweight approaches must over-
come limitations in representation and computational resources. Following the lead of
CondConv, Zhang et al. chose to replace regular convolutions with CondConv to enhance
network scale and capabilities while preserving performance and inference costs [14].
By dynamically adapting the convolution kernel using CondConv and integrating sub-
pixel convolution, the authors introduce a spatially aware dynamic lightweight depth
estimation network. This strategy enables accurate depth estimation with minimal com-
putational overhead. In essence, the challenge lies in developing depth estimation models
that offer enhanced efficiency with minimal resource requirements and reliable real-time
operation.
3 Proposed Method
In this section, the authors propose the FDE-Net architecture in Fig. 1. Our sugges-
tion involves harnessing features through the PPLC-Net structure to boost the pace of
processing in the depth estimation model. Through the utilization of DepthSepConv,
the model is capable of delivering precise outcomes on various CPU or GPU tools. The
incorporation of the SE module aims to enhance convolutional proficiency by effectively
segregating filters for individual channels.
6 V.-T. Nguyen et al.
3.1 Encoder
CNNs have demonstrated remarkable progress in the realm of computer vision tasks in
recent years. These networks exhibit the capability to undergo training and application
in a wide range of scenarios, including depth estimation. Within this study, the authors
have developed a lightweight depth estimation network by leveraging the PPLC-Net
architecture [15], renowned for its superior alignment with the MKLDNN acceleration
strategy. We advocate for the utilization of the PPLC-Net model for feature extraction to
enhance the processing speed of the depth estimation model. By employing DepthSep-
Conv, the model can achieve high accuracy when functioning on CPU or GPU devices.
Each convolution block consists of multiple sequential convolution layers that utilize
filters to extract features from the input image, with the size and quantity of filters being
determined by the network’s architecture. Typically, smaller filters such as 3×3 or 5×
5 are favored to reduce the network’s parameters while improving computational effi-
ciency. Given the demand for precise per-pixel accuracy in tasks like single-camera
depth estimation (e.g., obstacle avoidance or object centering), aggregating features at
various scales becomes crucial for the decoder to accurately decode these features. To
enhance the understanding of image semantics at different scales, we propose integrating
convolution layers with distinct kernel sizes directly associated with the model outputs
at sizes 112×112, 56×56, 28×28, 7×7. Subsequently, the decoded segments are com-
bined to produce the final output of the model. Initializing the convolution layers for
each encoder output size assists in detailing the decoded segments.
Moreover, our approach involves breaking down the convolution operation into two
stages: depthwise convolution (DW) and pointwise convolution (PW). Global average
pooling (GAP) is utilized, and the activation function H-wish is selected for its effi-
ciency and adaptability in scenarios of data imbalance or interference. The SE module is
positioned near the network’s end to ensure improved balance and accuracy, aiming to
harness higher-level features effectively [18]. The activation functions employed include
ReLU and Sigmoid.
FDE-Net: Lightweight Depth Estimation for Monocular Cameras 7
3.2 Decoder
The decipherer possesses the solution for untangling and merging the distinct qual-
ities obtained from the code creator. Consequently, this segment generates a forecast
map containing detailed information for each individual pixel. The novel methodology
utilizes 2D convolutional layers (conv2D) with a variety of kernel sizes. These funda-
mental components are carefully designed to reveal features across different scales and
sizes. Each layer evaluates input from its corresponding DepthSepConv component. The
findings from each layer are combined and directed into the next layer. Decoders incor-
porating varied kernel dimensions introduce complexity into the final conclusions. The
unique characteristics and roles of the decoder layers are as follows: (Fig. 2)
• 3×3 Conv2D: Extracts global-scale features from the Stem conv/h-swish layer.
• 3×2 Conv2D: Extracts horizontal direction features from the first DepthSepConv
layer.
• 2×3 Conv2D: Like the previous decoder layer, but a 2x3 kernel is used to capture
vertical information.
• 2×2 Conv2D: Employs a small kernel to extract detailed features from the last
DepthSepConv layer.
The decoder comprises of four upconvolution modules featuring a reduction in the
number of channels alongside an increase in the size of the feature map. Within each
module, the sequence of blocks unfolds as follows: Unpool, convolution, batch normal-
ization, and Rectified Linear Unit (ReLU). Diverse information can be gathered across
various scales and merged through a concatenation block to reintegrate the details into a
unified prediction map. Then, the deciphered data undergoes interpolation to produce a
comprehensive depth estimation. Finally, a filter is added to denoise values and normalize
the predicted values.
8 V.-T. Nguyen et al.
where µx , and µy are the mean of the two images; σx2 , and σy2 are the variances of the
two images; σxy is the covariance of the two images; c1 , and c2 are constants stabilizing
the calculation SSIM.
The standard loss function L1 is the sum of the absolute difference between the target
value and the estimated value. Hence, the L1 loss is illustrated as follows:
n
L1 = |Y − YGT | (3)
i=1
Mean Squared Error (MSE): is defined as a measure of the average error between the
predicted depth and the true depth value. MSE is calculated using the following formula:
1
n
2
MSE = Yi − Ŷi , (4)
n
i=1
where n is the total number of predicted pixels, and are the predicted depth and ith actual
depth, respectively.
Mean Absolute Error (MAE): is a common loss function for deep learning-based
methods. The author uses this metric to represent the pixel difference between basic
reality and predicted depth. Then, averaging the results of the evaluation of pixels on the
photo. MAE is calculated using the following formula:
1
n
MAE = Yi − Ŷi . (5)
n
i=1
Absolute relative error (Abs Rel): measures the average of the absolute relative
difference between predicted depth values and actual ground depth values, normalized
by the average of actual ground depth values. Abs Rel is calculated as the following:
n
1 Yi − Ŷi
Abs_Rel = Ŷ . (7)
n i
i=1
Based on the results observed during the training phase, FDE-Net demonstrates its com-
petitive advantage in attributes. As illustrated in Fig. 3, the monitoring values converge
quickly to the thresholds necessary for the model to work effectively.
Parametric fluctuations do not occur, indicating the stability of the pattern. The
proposed approach shows a high degree of adaptability to the datasets used for training.
Specifically, it is demonstrated in Table 1 with three datasets: NYU-V2, KITTI and
Cityscapes. Although only trained with a modest number of epochs, the proposed model
still provides highly competitive hit indicators with other methods in the same training
scenario. The predicted value is a small difference compared to the trained ground truth.
10 V.-T. Nguyen et al.
Dataset MAE MSE Abs-Rel σ < 1.25 σ < 1.252 σ < 1.253
NYU-V2 0.0493 0.0064 0.162 0.8722 0.9641 0.9811
KITTI 0.030 0.0030 0.5484 0.7015 0.8007 0.9042
Cityscapes 0.040 0.0033 0.4599 0.733 0.8126 0.9287
FDE-Net has been crafted with a size and resource cost that are just right. By har-
nessing the benefits of DepthSepConv layers, the model effectively utilizes the trained
parameters. The performance of FDE-Net and Resnet18 + Upconv is scrutinized by the
authors based on various metrics such as number of parameters, FLOPs, MACs, Latency,
and Throughput, with the outcomes laid out in Table 2.
Table 2. Comparison with other methods using different backbones on the parameters of
flexibility and model weight.
Remarkably, FDE-Net uses only 0.04 times the number of parameters in Resnet-
Upconv. Furthermore, FLOPs and MAC demonstrate a much more efficient computa-
tion in comparison to the alternative model. The latency of FDE-Net is notably 4.2 times
smaller, while the throughput has surged by 3.9 times when contrasted with Resnet18 +
Up-conv. Similar to the rest of the compared methods, the proposed model shows supe-
rior agility and speed of operation by significantly reducing the computational volume.
FDE-Net: Lightweight Depth Estimation for Monocular Cameras 11
These findings underscore FDE-Net as a nimble model with a scant number of parame-
ters. The adept utilization of a small parameter count significantly slashes computational
and storage expenses for embedded or edge devices, thereby notably enhancing computa-
tional performance and data flow. Superior compatibility with limited-resource systems
or mobile robots is met.
Comparison with the methods that have been introduced shows that FDE-Net offers
remarkable competitiveness. The evaluation parameters illustrated in Table 3 demon-
strate its effectiveness in predicting the depth map of the environment. To summarize,
FDE-Net achieves outstanding inference accuracy with little model size and weight. The
results show that compared to the current best practices, the proposed strategy performs
quite well in the experiments. When coupling with efficient machine learning meth-
ods like dropout and knowledge distillation, FDE-Net’s size advantage and flexibility is
suitable for limited-resource systems.
Model MSE MAE Abs-Rel σ < 1.25 σ < 1.252 σ < 1.253
MRF [19] 0.0082 0.074 0.623 0.800 0.928 0.280
Alex-Net [9] 0.030 0.0030 0.5484 0.7015 0.8007 0.9042
FCN [20, 21] 0.040 0.0033 0.4599 0.733 0.8126 0.9287
SSDM [22] 0.0068 0.052 0.803 0.935 0.946 0.183
Ours 0.0063 0.050 0.709 0.949 0.9641 0.184
Next, Fig. 4 illustrates the inferences obtained from the KITTI dataset. The suggested
method’s output closely matches the objects’ real depth values, which is readily apparent.
Predictions have never been wrong and overlapping zones have never been confused. It
does not create a difference to the final projection. The proposed method demonstrates
reliable effectiveness in creating depth maps through monocular cameras. Furthermore,
semantic segmentation data is successfully integrated with depth data. Since it has grad-
ually completed the construction of a knowledge model to help the system accurately
grasp the characteristics of the environment.
12 V.-T. Nguyen et al.
5 Conclusion
The paper proposes a novel monocular depth estimation solution of FDE-Net efficiently
extracting depth information with minimal computational cost. Moreover, the integra-
tion of DepthSepConv, combined with model optimization techniques of squeeze-and-
excitation and the Adam optimizer improves the efficiency of the proposed model. Based
on three datasets of Cityscapes, Kitti, and NYU-V2, experiments yield remarkable eval-
uation results. Surprisingly, FDE-Net uses only 0.04 times the number of parameters
of Resnet-Upconv. Computational efficiency is significantly emphasized by FLOP and
MAC, demonstrating a notable advantage over alternative models. FDE-Net boasts a
significantly shorter latency of 4.2 times, along with 3.9 times increase in throughput
compared to Resnet18-Upconv. Consequently, this innovative model has great potential
for seamless integration into both real-time scenarios and simulated environments. Addi-
tionally, the distance data from the viewpoint to the center of the object, as provided by
FDE-Net, is confirmed to be valid. Future work will investigate and evaluate the ability
to combine depth maps with semantic information about the environment to enhance
knowledge-based systems for mobile robots.
Acknowledgments. This work was supported by Vingroup Innovation Foundation (VINIF) under
Project code VINIF.2023.DA089.
References
1. Dang, T.V., Bui, N.T.: Multi-scale fully convolutional network-based semantic segmentation
for mobile robot navigation. Electronics 12(3), 533 (2023)
2. Dang, T.V., Bui, N.T.: Obstacle avoidance strategy for mobile robot based on monocular
camera. Electronics 12(8), 1932 (2023)
3. Huang, K.C., Wu, T.H., Su, H.T., Hsu, W.H.: MonoDTR: monocular 3D object detection with
depth-aware transformer. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 4002–4011 (2022)
FDE-Net: Lightweight Depth Estimation for Monocular Cameras 13
4. Sun, H., et al.: Transformer-based Stereo-Aware 3D Object Detection from Binocular Images.
arXiv:2304.11906 (2023)
5. Gao, R., et al.: Unsupervised learning of monocular depth and ego-motion in outdoor/indoor
environments. IEEE Internet Things J. 9(17), 16247–16258 (2022)
6. Xiong, M., et al.: Self-supervised monocular depth and visual odometry learning with scale-
consistent geo-metric constraints. In: Proceedings of the Twenty-Ninth International Joint
Conference on Artificial Intelligence, pp. 963–969 (2020)
7. Nguyen, V.T., Nguyen, A.T., Nguyen, V.T., Bui, H.A.: A real-time human tracking system
using convolutional neural network and particle filter. In: ICISN 2021, Intelligent Systems
and Networks 50(243), 411–417 (2021)
8. Godard, C., Aodha, O.M., Firman, M., Brostow, G.J.: Digging into self-supervised monocular
depth estimation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV),
pp. 3827–3837 (2019)
9. Eigen, D., Fergus, R.: Predicting depth, surface normal and semantic labels with a common
multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference
on Computer Vision (ICCV), pp. 2650–2658 (2015)
10. Zhou, T., Brown, M.A., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-
motion from video. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 6612–6619 (2017)
11. Wofk, D., Ma. F., Yang. T.J., Karaman, S., Sze, V.: FastDepth: fast monocular depth estimation
on embedded systems. In International Conference on Robotics and Automation (ICRA),
pp. 6101–6108 (2019)
12. Rudolph, M.B., Dawoud, Y., Guldenring, R., Nalpantidis, L., Belagiannis, V.: Lightweight
monocular depth estimation through guided decoding. In: International Conference on
Robotics and Automation (ICRA), pp. 2344–2350 (2022)
13. Zhou, Z., Fan, X., Shi. P., Xin, Y.: R-MSFM: recurrent multi-scale feature modulation
for monocular depth estimating. IEEE/CVF International Conference on Computer Vision
(ICCV), pp. 12757–12766 (2021)
14. Zhang, N., Nex, F., Vosselman, G., Kerle, N.: Lite-mono: a lightweight CNN and transformer
architecture for self-supervised monocular depth estimation. In: IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) (2023)
15. Cheng, C., et al.: PP-LCNet: A Lightweight CPU Convolutional Neural Network. arXiv:2109.
15099v1 (2021)
16. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
17. Tu, X., et al.: Efficient monocular depth estimation for edge devices in internet of things.
IEEE Trans. Industr. Inf. 17(4), 2821–2832 (2021)
18. Kubilay, M.S., Esat, K., Fatma, B.C., Ahmet, A.: Comparative parotid gland segmentation
by using ResNet-18 and MobileNetV2 based DeepLab v3+ architectures from MR images.
Concurrency and Computation Practice and Experience 35(1), e7405 (2023)
19. Saxena, A., Sun, M., Ng, A.Y.: Make3D: learning 3D scene structure from a single still image.
IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 824–840 (2009)
20. Dang, T.V., Tran, D.M.C., Phan, X.T.: IRDC-net: lightweight semantic segmentation network
based on monocular camera for mobile robot navigation. Sensors 23(15), 6907 (2023)
21. Dang, T.V., Phan, X.T.: Hybrid mobile robot path planning using safe JBS-A*B algorithm
and improved dwa based on monocular camera. J. Intell. Rob. Syst. 110(151), 1–21 (2024)
22. Israr, H., Shunquan, T., Jiwu, H.: A semi-supervised deep learning approach for cropped
image detection. Expert Syst. Appl. 243(5), 122832 (2023)
Language-Guided Video Object
Segmentation
1 Introduction
Referring video object segmentation (RVOS) is an increasingly important task
in computer vision that combines visual and natural language processing to
segment objects in video frames based on textual descriptions [13, 14]. Unlike
traditional video object segmentation, which depends solely on visual cues,
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 14–24, 2025.
https://doi.org/10.1007/978-981-96-4282-3_2
Language-Guided Video Object Segmentation 15
2 Related Work
Referring Video Object Segmentation (RVOS) is an evolving field [13, 14] in com-
puter vision that combines visual and linguistic information to segment objects
in videos based on natural language descriptions. Early approaches primarily
focused on leveraging static visual attributes such as color and shape to iden-
tify objects, often extending techniques from image segmentation to video by
using per-frame mask propagation and memory attention modules. Although
these methods achieved some success, they struggled with dynamic and motion-
centric content inherent in videos [1].
Traditional RVOS datasets, including DAVIS17-RVOS [15] and Refer-
YouTube-VOS [3], focused on salient objects with limited motion information.
These datasets often allowed the target object to be identified based on static
features, enabling strong performance using models initially designed for image
segmentation [1]. However, these benchmarks neglected the importance of tem-
poral dynamics, which are essential for real-world applications where motion,
rather than static appearance, defines the identity of the object [1].
To address these limitations, the MeViS dataset was developed to emphasize
motion expressions in RVOS tasks. MeViS is distinguished by its complexity: it
includes multiple objects of the same category within a video, longer sequences,
and focuses on dynamic attributes. This poses unique challenges for existing
RVOS methods that typically rely on static information [1]. The dataset is a
crucial contribution as it forces models to incorporate temporal context and
motion understanding into segmentation decisions, representing a significant leap
from previous datasets [1].
Recent advancements in RVOS architectures have shifted towards
transformer-based methods, which have proven effective in combining visual and
linguistic cues. Approaches such as MTTR [16] and ReferFormer [17] introduced
end-to-end frameworks that use transformers to model multi-modal interactions
between object queries and textual descriptions. These models rely on robust
multi-modal alignments and temporal-aware interactions to achieve strong per-
formance on existing benchmarks. However, when applied to motion-centric
datasets like MeViS, these models often struggle due to their reliance on static
visual cues and shorter video sequences used during training [1].
Innovative methods such as SOC [18], MUTR [19] have emerged to address
these challenges by focusing on temporal and motion-aware interactions. These
models attempt to unify object detection across frames by leveraging transform-
ers and multi-modal attention mechanisms to maintain coherent segmentation
throughout a video sequence. Despite this progress, achieving consistent per-
formance on long and complex videos remains a challenge, particularly when
motion plays a critical role in identifying object.
Language-Guided Video Object Segmentation 17
3 Methods
3.1 Language-Guided Motion Perception and Matching
In this paper, we leverage the LMPM [1] framework as our main architec-
ture. Figure 1 illustrates the overall architecture of LMPM. This architecture
is inspired by VITA [9] and has several improvements:
In the later Sect. 3.2 and 3.3, we discuss more deeply how the previous version
VITA [9] as well as the frame-level detector Mask2Former [6] work to understand
this architecture better.
The input goes through the frame-level detector Mask2Former [6] as
the first stage. The output from the multi-scale transformer decoder (from
Mask2Former), which also represents objects and provides instance-specific infor-
mation, reduces computational requirements [7, 8]. It puts the object embeddings
through layers of Transformer Encoder. Indeed, these layers perform motion per-
ception by inter-frame self-attention on the object embeddings to obtain a global
view across T frames. Motion perception enables object embeddings to capture
temporal contextual information that spans multiple frames or even the entire
video. We discuss these Encoders more in Sect. 3.3 [9]. Then, it uses N2 language
queries as the query and the object embeddings after Motion Perception as the
18 M.-D. Phan et al.
key and value for the Transformer Decoders. The Transformer Decoders decode
language-related information from all object embeddings and aggregate relevant
information to predict object trajectories. Finally, it matches the language fea-
tures with the predicted object trajectories to identify the target object(s) by
using a matching threshold σ, as we mentioned above.
Object Decoder. Each of these components plays a distinct role in ensuring the
system’s ability to capture object information from individual frames and aggre-
gate it over time to accurately track instances across an entire video sequence.
The Frame-level Detector processes each of the T input frames independently,
generating two key features that serve as the foundation for further process-
ing within VITA. The first feature is the Frame Queries {f t }Tt=1 ∈ RC×T ×Nf ,
which encapsulate object-centric information
distilled from each frame.
The sec-
t T C×T × H × W
ond feature is the Per-Pixel Embeddings {M }t=1 ∈ R S S , which are
produced by the pixel decoder of the Frame-level Detector. These embeddings
provide dense pixel-level representations, which will be used later in the model
for mask prediction.
The Object Encoder gathers the Frame Queries from all frames and trans-
forms them into object tokens through a linear layer. To handle long video
sequences efficiently, VITA employs a window-based self-attention mechanism
inspired by the Swin Transformer [12]. This mechanism partitions the object
tokens into local windows along the temporal axis, facilitating communication
between frames without the prohibitive computational cost of a naive self-
attention approach. By alternately shifting these windows across frames, VITA
ensures that object tokens from different frames can effectively exchange infor-
mation, allowing it to handle long sequences in a scalable manner.
To address the challenges posed by dynamic scenes and long videos, VITA
replaces traditional dense spatio-temporal features with object tokens in its
Object Decoder. The Object Decoder uses Nv trainable video queries to extract
object-wise information from all object tokens across the video. These video
queries are decoded into the final predictions, which include class probabilities
and mask logits. VITA’s class head acts as a linear classifier, predicting the
class probabilities of each instance, while the mask head dynamically generates
mask embeddings for each video query, corresponding to an object’s trajectory
across frames. This approach enables faster convergence and more accurate video
instance segmentation by effectively capturing video contexts and aggregating
relevant information into compact representations.
VITA introduces clip-wise losses to optimize the model’s predictions while
maintaining consistency across frames. Instance Matching in VITA is designed
to efficiently pair predictions with ground-truth annotations by extending the
Mask2Former cost function to the temporal axis, considering the video con-
text. Using the Hungarian algorithm, VITA determines optimal matching pairs
between the Nv video queries and the ground-truth annotations, eliminating the
need for post-processing techniques such as Non-Maximum Suppression (NMS).
This ensures that the model’s predictions are well-aligned with the ground-truth,
even across complex video sequences.
VITA further improves instance tracking through the use of Similarity Loss,
which helps preserve object identity across frames. Inspired by MaskTrack R-
CNN, the similarity loss encourages consistency between frame-level and video-
level queries by using binary cross-entropy to measure the similarity between
matching object queries. Queries representing the same object are assigned a
20 M.-D. Phan et al.
label of 1, while those representing different objects are labeled as 0. This loss
function helps cluster queries with the same identity closer together in the latent
space, improving the model’s ability to track objects throughout the video.
The total loss used for training VITA is a weighted combination of several key
components. First, the frame-level loss is applied to per-frame outputs, following
the approach used in Mask2Former. Second, the video-level loss Lv is applied to
video-level outputs and is computed in a similar way to the frame-level loss but
extended across the temporal axis to account for video context. Finally, the sim-
ilarity loss Lsim is included to enhance identity consistency between frame and
video queries. The total loss is expressed as a weighted sum of these components:
Ltotal = λv Lv + λf Lf + λsim Lsim . This comprehensive loss formulation ensures
that VITA performs robustly across both frame and video levels, enabling accu-
rate and efficient video instance segmentation.
4 Experiments
4.1 Datasets
In our study, we use two datasets, these are MeViS [1] and Refer-Youtube-Vos-
2021 [3]. MeViS (Motion Expressions Video Segmentation) [1] is a new large-
scale dataset focusing on video object segmentation guided by motion expres-
sions. Unlike previous datasets that rely on static attributes, MeViS emphasizes
motion, with 2,006 videos and 28,570 motion-based expressions referring to 8,171
objects. The dataset is divided into training, validation, and testing subsets
and includes complex scenarios where multiple objects with similar appearances
require motion-based identification. Compared to Refer-Youtube-VOS, MeViS
features longer videos, more objects (4.28 per video), and exclusively motion-
centric expressions, making it more challenging and realistic. The dataset sup-
ports multi-object expressions and requires understanding temporal motion for
effective segmentation, offering a valuable resource for studying video under-
standing in dynamic environments.
Language-Guided Video Object Segmentation 21
We adjusted the model parameters from the original paper [1] and experimented
with various backbones [2], including Tiny Swin Transformer, Base Swin Trans-
former, and Large Swin Transformer. Unlike the original study, which used
Roberta, we opted for Roberta-Large as the text encoder. We trained the modi-
fied models with either 100,000 or 150,000 iterations on the Refer-Youtube-VOS
2021 [3] and MeViS dataset [1], respectively, and then evaluated them on the
MeViS validation set. For each version of the Swin Transformer, we also modify
configuration parameters such as the dimensionality of the feature embeddings,
the number of transformer blocks at each stage, the number of attention heads
in the multi-head self-attention, the size of the local window for self-attention,
and the pretrain image size to suit the current backbone.
The MeViS dataset poses significant challenges in detecting and understand-
ing object motions in both video and language contexts. Language expressions
describe motions that can span a random number of frames, requiring metic-
ulous frame-by-frame analysis to detect short-term actions and a comprehen-
sive understanding of extended motions. Current state-of-the-art methods, as
referenced in studies [1], typically rely on random frame sampling, risking omit-
ting crucial information. These methods also struggle with extracting tempo-
ral context, often defaulting to spatial-temporal feature extractors due to high
computational demands. Additionally, language expressions can describe vary-
ing numbers of objects, necessitating variable outputs from the model. Some
parameters remained unchanged, such as the number of layers in motion percep-
tion (six layers) and the Transformer decoder (three layers), as well as specific
hyperparameters (σ, N1 , and N2 ) set to 0.8, 20, and 10, respectively.
In Table 1, we report our results on the validation-unit set in the MeViS dataset,
as we utilized the baseline method [1], adjusted the backbone architecture and
22 M.-D. Phan et al.
5 Conclusion
In this paper, we leverage LMPM as our baseline method to deal with the prob-
lem of Referring Video Object Segmentation, which posed by the MeViS chal-
lenge. We try with several backbone architecture of Swin Transformers and get
the improvement results in the ablation studies. We also do the knowledge dis-
tillation to get the light weight model but can achieve the better performance
Language-Guided Video Object Segmentation 23
when evaluate on the validation set. Despite these successes, we will focus on
the failure cases to find the better methods and enhance the result of our model
in the near future.
References
1. Henghui, D., et al.: MeViS: a large-scale benchmark for video segmentation with
motion expressions. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision (2023)
2. Liu, Z., et al.: Video Swin transformer. In: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition (2022)
3. Seo, S., Lee, J.-Y., Han, B.: URVOS: unified referring video object segmentation
network with a large-scale benchmark. In: Computer Vision ECCV 2020: 16th
European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XV
16. Springer (2020)
4. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1),
79–86 (1951)
5. Cheng, B., et al.: Masked-attention mask transformer for universal image segmen-
tation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (2022)
6. Cheng, B., Choudhuri, A., Misra, I., Kirillov, A., Girdhar, R., Schwing, A.G.:
Mask2Former for video instance segmentation (2021)
7. Li, X., et al.: Transformer-based visual segmentation: a survey. IEEE Trans. Pat-
tern Anal. Mach. Intell. (2024)
8. Li, X., et al.: Tube-Link: a flexible cross tube framework for universal video segmen-
tation. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision (2023)
9. Heo, M., et al.: Vita: video instance segmentation via object token association.
Adv. Neural. Inf. Process. Syst. 35, 23109–23120 (2022)
10. Carion, N., et al.: End-to-end object detection with transformers. In: European
Conference on Computer Vision. Springer, Cham (2020)
11. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (2016)
12. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted win-
dows. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision (2021)
13. Liu, S., et al.: Cross-modal progressive comprehension for referring segmentation.
IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 4761–4775 (2021)
14. Hui, T., et al.: Collaborative spatial-temporal modeling for language-queried video
actor segmentation. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (2021)
15. Jordi, P.-T., et al.: The 2017 Davis challenge on video object segmentation. arXiv
preprint arXiv:1704.00675 (2017)
24 M.-D. Phan et al.
16. Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object seg-
mentation with multimodal transformers. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (2022)
17. Wu, J., et al.: Language as queries for referring video object segmentation. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (2022)
18. Luo, Z., et al.: Soc: semantic-assisted object cluster for referring video object seg-
mentation. Adv. Neural Inf. Process. Syst. 36 (2024)
19. Yan, S., et al.: Referred by multi-modality: a unified temporal transformer for
video object segmentation. In: Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 38. no. 6 (2024)
20. Hinton, G.: Distilling the knowledge in a neural network. arXiv preprint
arXiv:1503.02531 (2015)
MythraGen: Two-Stage Retrieval
Augmented Art Generation Framework
1 Introduction
In art analysis, content and style are two fundamental elements. Content
describes the concepts depicted in the image, such as objects, people, or loca-
tions. Style, on the other hand, describes the visual appearance of the artwork,
including its color, composition, and shape. Furthermore, each artist expresses
his/her own style, creating unique features in his/her works. Through the unique
combination of content and the artist’s individual style that makes each piece of
art special.
Recent advances in deep learning have facilitated powerful breakthroughs in
text-to-image generation (T2I) [1, 6, 14, 19, 22, 26]. T2I methods can incorporate
a specific style into generated images by taking a textual description of the style
as a prompt. However, conveying artistic style through text descriptions has
Q.-K. Le and C.-L. Nguyen—Contributed equally to this research.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 25–38, 2025.
https://doi.org/10.1007/978-981-96-4282-3_3
26 Q.-K. Le et al.
limitations. These descriptions are often less expressive and informative than
visual representations of style, so the style features of T2I outputs are often
rough and lack details.
Recent fine-tuning techniques such as Dream-booth [20], Textual Inversion
[8], and Low-Rank Adaptation (LoRA) [11] can enhance adaptability to specific
T2I generation tasks, and show convincing capability in creating images with
unique content and style. Among these methods, LoRA stands out and gains
extensive adoption among art designers and T2I enthusiasts, due to its advan-
tages of low-cost an computational efficiency, making it user-friendly and suit-
able for consumer devices. However, an artist can paint in many different styles.
When extending this to hundreds of artists, the need to fine-tune or retrain the
model for each artist’s style becomes impractical. This process requires a vast
amount of computational resources and time, making these methods unrealistic
for large-scale application.
To address these issues, we propose MythraGen, a simple yet efficient retrieval
augmented art generation framework. First, we employ a retrieval technique to
search for paintings from an external database that have the highest similarity in
content, genre, and style to the artist described in the prompt. Our art retrieval
technique leverages BLIP-2 to encode both the visual features of each image and
its associated metadata, including captions, genre, style, and artist information.
These encoded features are combined into a comprehensive feature vector, which
is indexed using FAISS [7] to facilitate the retrieval of relevant images. Then,
we utilize LoRA algorithm [11] for finetuning Stable Diffusion on the identified
paintings. This two-stage framework allows for the flexible combination of dif-
ferent styles from various artists and content, optimizing image generation while
ensuring the quality of the created images.
In this paper, the WikiArt dataset [23], consisting of 80,000 unique images
from more than 1,100 artists across 27 styles, was used as the external art
database. We also leveraged a zero-shot classifier based on Visual Question
Answering (VQA) to annotate genres for round 16,000 images missing labels.
Extensive experiments and user study showcase an impressive performance our
MythraGen, outperforming existing existing open sources and commercial image
generation methods in all evaluation metrics.
Our contributions can be summarized as follows:
Retrieval Augmented Art Generation 27
2 Related Work
2.1 Art Retrieval
Similarity search algorithms have played a crucial role in many artificial intelli-
gence applications, with K-Nearest Neighbors (KNN) and Approximate Nearest
Neighbors (ANN) being commonly used methods. KNN works by finding the
closest points in the dataset to a query point based on a specified distance metric,
making it useful for classification and regression tasks. ANN, on the other hand,
provides faster search results by approximating the nearest neighbors, making
it suitable for high-dimensional data where exact searches are computationally
expensive.
Recently, the Facebook AI Similarity Search (FAISS) library [7] is a power-
ful tool for efficient similarity search and clustering of high-dimensional data,
enabling developers to quickly find similar items in large datasets. It is particu-
larly useful for tasks such as image retrieval, recommendation systems, and nat-
ural language processing, where finding similar items in large datasets is crucial.
Therefore, this paper utilized FAISS due to its ability to quickly and accurately
search for similar embedding vectors in latent space. This is especially useful
when working with large datasets like WikiArt (around 80k images), allowing
us to rapidly gather relevant images to support image generation processes.
To ensure accurate retrieval, the alignment of image and text embeddings
is crucial for effective cooperation. Inspired by the vision-language pre-training
model BLIP-2 [12], which produces high-quality text-aligned visual representa-
tions, we adapted it to extract text-aligned subject representations, improving
the model’s ability to understand and generate content across modalities.
By integrating BLIP-2 [12] with FAISS, we harness the strengths of advanced
vision-language pre-training models and efficient similarity search algorithms.
This combination allows us to improve the accuracy of image-text retrieval,
providing a more precise and comprehensive dataset for further applications.
28 Q.-K. Le et al.
Fig. 2. Overview of the proposed MythraGen framework, with two main stages: (a)
the Art Retrieval retrieves images to enhance the image generation process and (b)
the Art Generation generates images based on the user’s input text combined with the
images provided by the Art Retrieval module.
Early text-to-image models [17, 24, 25, 27] made significant progress by utiliz-
ing Generative Adversarial Networks (GANs) [8] trained on large paired image-
caption datasets, which can lead to model collapse issues [2, 5, 10]. Recently, diffu-
sion models [9, 15, 18, 22] have become powerful in text-to-image (T2I) tasks due
to their ability to generate high-quality images and their flexibility in adapting
images to the context of the text. GLIDE [15] and Imagen [21] employ classifier-
free guidance by replacing the label in the class-conditioned diffusion model with
text descriptions of the images. Stable Diffusion [19] utilizes VQ-GAN for the
latent representation of images, enhancing photorealism through an adversar-
ial objective. DALL-E2 [16] employs a multimodal latent space where image
and text embeddings are aligned to generate images reflecting a deeper level of
language understanding. However, these models often struggle when handling
prompts containing less common genres or styles related to artists. Instead of
generating a suitable image, they tend to either create nonexistent genres or
styles or use similar but more popular ones, which does not align with the user’s
intent, leading to a mismatch between the original prompt and the final image.
Unifying both image retrieval and generation processes, Re-Imagen [4]
addressed augmenting rare entities to improve image quality. However, our goal
is to use retrieval to enhance the less common drawing styles of various artists,
while ensuring that when these styles are applied, the main content of the prompt
remains preserved. Additionally, Re-Imagen trains its image generation process
on the cascaded diffusion model [10], while our method uses LoRA [11] to fine-
tune the Stable Diffusion model [12] for reducing required resources and speed
up the training process.
Retrieval Augmented Art Generation 29
3 Proposed Method
3.1 Overview
In this section, we present our approach for retrieval from an external database
to find images with the highest similarity in content, genre, and style to the
artist described in the input query. Figure 3 provides a visual representation
of the Art Retrieval module, built around the Bootstrapping Language Image
Pre-training (BLIP-2 [12]) architecture. First, each 256-dimensional vector from
the feature, genre, artist, and style databases, which are related to each other,
are concatenated into a single 1024-dimensional vector. This vector is then fed
to the FAISS system for indexing to support the retrieval process. When the
input query is processed by the Text Embedding Generator, generated by the
Q-Former component of BLIP-2 [12], the resulting vector is sent to the FAISS
system to return a set of images with the highest similarity in content, genre,
and style to the artist described in the input query.
The Image Embedding Generator consists of two main components: a frozen
pre-trained image encoder and a multimodal encoder called Q-Former (see
Fig. 4). The process begins with the input image, which is passed through the
image encoder to extract visual features. These features are then combined with
learnable query tokens and fed into the Q-Former. The output of the Q-Former
is then passed through a fully connected layer and normalized to produce the
final image feature vector. Similarly, the Text Embedding Generator tokenizes
30 Q.-K. Le et al.
Fig. 4. Multimodal representation, where image and text embeddings processes rely
primarily on the Q-former.
the input text and processes it through the Q-Former. The output from the Q-
Former is then passed through a fully connected layer and normalized to produce
the final text feature vector.
To improve the retrieval performance, we combine image embedding with
caption and genre embeddings. Let Ei with i ∈ {image, caption, genre} repre-
sents the image, caption, and genre embeddings. We compute the weighted sum
of the embeddings as follows:
V = Wi ∗ E i , (1)
i
Fig. 5. LoRA combination in Art Generation module. The first LoRA model is fine-
tuned for the artist and style (Wstyle+artist ), while the second LoRA model is fine-
tuned for the genre and style (Wgenre+style ). Both are combined and applied to the
Stable Diffusion model to generate an image that faithfully reflects the input prompt,
incorporating the specific style, artist, and genre.
4 Experiments
4.1 Implementations
In our experiments, we leveraged the PyTorch deep learning framework on a
computer equipped with an NVIDIA RTX A4500 GPU with 24 GB. For the Art
32 Q.-K. Le et al.
Fig. 6. Visual Question Answering (VQA) model used to classify the genre of an image
when the genre is unknown.
Retrieval module, we used BLIP-2 [12] pre-trained with ViT-L/14 to embed the
image and its related metadata. Weights in Eq. 1 is set Wimage = 1.0, Wcaption =
0.9, and Wgenre = 0.75.
Regarding the Art Generation module, we employed Stable Diffusion V1.5
[19] as the backbone. We utilized two LoRA models for our experiments: one
fine-tuned for genre + style and another fine-tuned for artist + style. For
both LoRA, we set max_train_steps to 4095, using AdamW8bit as the opti-
mizer to save memory, with xformers enabled for memory optimization and
mixed_precision = “bf16” for faster computation without compromising qual-
ity. The learning rates were set to 1 × 10−4 for the U-Net and 5 × 10−5 for
the text encoder. The LoRA network was configured with network_dim = 32
and network_alpha = 1, adjusting the rank and scaling of the LoRA layers for
efficient fine-tuning.
4.2 Dataset
We used WikiArt as the external database for all experiments in this paper.
WikiArt has complete labels for artists and styles, but over 16,452 images
are missing genre labels. Therefore, we employed a Visual Question Answering
(VQA) model (i.e., ShareGPT4V [3]) to label these 16,452 images according to 12
corresponding genres (genre painting, illustration, landscape, etc.) as described
in Fig. 6, where the model analyzes each image and assigns the appropriate genre
label based on visual content.
We performed retrieval on the WikiArt dataset using three separate cate-
gories: artist, styles, and genres, as well as a combined category that includes all
three.
Model Genres
mAP@5 mAP@15 mAP@25 mAP@40 mAP@50
BLIP2 using ViT-g/14 (EVA-CLIP) 100% 98.4% 98.4% 98.6% 98.6%
BLIP2 using ViT-L/14 (CLIP) 100% 99.8% 99.7% 99.6% 99.5%
BLIP2 finetuned on COCO [13] 100% 98.8% 98.3% 97.8% 97.7%
Model Artists, Styles
mAP@5 mAP@15 mAP@25 mAP@40 mAP@50
BLIP2 using ViT-g/14 (EVA-CLIP) 100% 100% 100% 100% 100%
BLIP2 using ViT-L/14 (CLIP) 100% 100% 100% 100% 100%
BLIP2 finetuned on COCO [13] 100% 100% 100% 100% 100%
Model Genres + Artists + Styles
mAP@5 mAP@15 mAP@25 mAP@40 mAP@50
BLIP2 using ViT-g/14 (EVA-CLIP) 90.5% 92.1% 92.4% 92.4% 92.6%
BLIP2 using ViT-L/14 (CLIP) 92.8% 94.8% 94.4% 93.8% 94.2%
BLIP2 finetuned on COCO [13] 93.4% 94% 93.7% 93.7% 93.7%
high accuracy for the top 5 results. For genres, BLIP2 with ViT-L/14 (CLIP)
scores the highest mAP@50 at 99.5%, followed by ViT-g/14 (EVA-CLIP) at
98.6%, and the COCO-finetuned model at 97.7%. In artist-based and style-based
retrieval, all models reach a perfect mAP of 1.0 at all levels, showing equal skill in
retrieving artist and style information. However, when combining genre, artist,
and style, the COCO-finetuned model performs best at mAP@5 with 93.4%, but
34 Q.-K. Le et al.
Metrics Methods
MythraGen BingAI Midjourney SD
CLIP-T ↑ 30.68 27.77 30.13 29.61
CLIP-I ↑ 79.84 66.82 65.19 75.29
FID ↓ 322.9 373.85 329.22 325.79
1
https://www.bing.com/images/create/.
2
https://www.midjourneyfree.ai/.
Retrieval Augmented Art Generation 35
Fig. 7. Qualitative comparison with SOTA methods based on style reference image.
Methods like BingAI, Midjourney, and SD 2.0 lack specific stylistic information drawn
from the original artist, leading to difficulties in balancing content and style. In contrast,
MythraGen performs better in generating both style and content as intended.
Midjourney (65.19). Finally, for FID, which measures the quality of the images,
MythraGen obtains the lowest score (indicating better performance) at 322.9,
compared to SD (325.79), Midjourney (329.22), and BingAI (373.85).
Figure 7 illustrates the visual results of these methods. While SD [19] strug-
gles to balance style and content due to poor representation of the text extracted
from the reference image, the content of the images generated by BingAI, and
Midjourney is relatively closer to the prompt, but their style differs from the
style reference. In contrast, our method produces images that are more faith-
ful to the style of the reference image, especially regarding brushstrokes, lines,
etc. This demonstrates that our MythraGen method achieves a better balance
between content similarity, style similarity, and generated quality according to
objective metrics.
Fig. 8. Humans evaluate the methods based on two criteria: Faithfulness and Natural-
ness.
The user study involves 32 participants (48.4% of whom are male) with age
between 11 and 60 (most of them are from 11 to 20). Each participant was asked
to score the outputs from different methods on a scale of 1 (worst) to 5 (best)
based on two primary criteria. Each participant evaluated a total of 30 images
per method.
Quantitative Results. Fig. 8 shows the quantitative results of the human eval-
uation, where MythraGen achieves the highest scores for both faithfulness and
naturalness. For faithfulness, MythraGen gets an average score of 3.09, clearly
improving over SD 2.0 (2.88), Midjourney (2.97), and BingAI (2.53). This result
shows that MythraGen is better at generating images that accurately match
the descriptions, both visually and textually. Similarly, for naturalness, Mythra-
Gen also outperforms other methods with an average score of 3.28. Participants
find MythraGen’s images more visually appealing and better aligned with the
described styles in the input prompt. These results demonstrate that MythraGen
not only excels in generating images that are faithful to their descriptions but
also produces outputs that accurately reflect the requested artistic style, proving
its effectiveness over SOTA methods in both accuracy and quality.
Our experimental results show that MythraGen is highly effective at generat-
ing images that faithfully represent both the text prompt and the desired artistic
style. By using a retrieval-augmented approach, MythraGen leverages existing
artwork to fine-tune the generation process, producing high-quality outputs that
closely match user expectations. Additionally, the use of LoRA for efficient fine-
tuning helps reduce computational costs while maintaining impressive perfor-
mance, making the model accessible even on less powerful hardware.
5 Conclusion
and contextual accuracy of generated artworks but also enables the incorpo-
ration of diverse artistic styles that meet the user’s expectations. Experimental
results demonstrate that MythraGen outperforms existing methods in gener-
ating images that faithfully reflect text descriptions and are highly natural, as
evidenced by user studies. We further demonstrate that our model is particularly
effective at generating images from text that requires a greater diversity of artis-
tic genres and periods. We believe that our work can inspire further innovations
in the intersection of art and artificial intelligence, fostering deeper engagement
with both creators and audiences.
References
1. Alaluf, Y., Richardson, E., Metzer, G., Cohen-Or, D.: A neural space-time repre-
sentation for text-to-image personalization. ACM Trans. Graph. 42(6), 243 (2023).
https://doi.org/10.1145/3618322
2. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity
natural image synthesis. arXiv preprint (2018). https://api.semanticscholar.org/
CorpusID:52889459
3. Chen, L., et al.: ShareGPT4v: improving large multi-modal models with better
captions. arXiv preprint arXiv:2311.12793 (2023)
4. Chen, W., Hu, H., Saharia, C., Cohen, W.W.: Re-Imagen: retrieval-augmented
text-to-image generator. arXiv preprint (2022). https://api.semanticscholar.org/
CorpusID:252596087
5. Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In:
NeurIPS, vol. 34, pp. 8780–8794 (2021). https://proceedings.neurips.cc/paper_
files/paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf
6. Ding, M., et al.: CogView: mastering text-to-image generation via transformers.
In: NeurIPS, pp. 19822–19835 (2021). https://proceedings.neurips.cc/paper_files/
paper/2021/file/a4d92e2cd541fca87e4620aba658316d-Paper.pdf
7. Douze, M., et al.: The Faiss library. arXiv preprint arXiv:2401.08281 (2024)
8. Gal, R., et al.: An image is worth one word: personalizing text-to-image generation
using textual inversion. arXiv preprint arXiv:2208.01618 (2022)
9. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. arXiv (2020).
https://api.semanticscholar.org/CorpusID:219955663
10. Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M., Salimans, T.: Cascaded
diffusion models for high fidelity image generation. JMLR 23(47), 1–33 (2022).
http://jmlr.org/papers/v23/21-0635.html
11. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: ICLR
(2022). https://openreview.net/forum?id=nZeVKeeFYf9
12. Li, J., Li, D., Savarese, S., Hoi, S.C.H.: BLIP-2: bootstrapping language-image pre-
training with frozen image encoders and large language models. In: ICML (2023).
https://api.semanticscholar.org/CorpusID:256390509
13. Lin, T.Y., et al.: Microsoft coco: common objects in context. arXiv preprint
arXiv:1405.0312 (2014)
38 Q.-K. Le et al.
14. Liu, D., Fan, H., Liu, J.: Expogenius: robust personalized human image generation
using diffusion model for exposure variation and pose transfer. In: ICMR, pp. 239–
247 (2024). https://doi.org/10.1145/3652583.3658071
15. Nichol, A., et al.: Glide: towards photorealistic image generation and editing with
text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2022)
16. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-
conditional image generation with clip latents. arXiv preprint arXiv:2204.06125
(2022)
17. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative
adversarial text to image synthesis. In: ICML, Proceedings of Machine Learn-
ing Research, vol. 48, pp. 1060–1069 (2016). https://proceedings.mlr.press/v48/
reed16.html
18. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: CVPR, pp. 10674–10685 (2021).
https://api.semanticscholar.org/CorpusID:245335280
19. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution
image synthesis with latent diffusion models. In: CVPR, pp. 10684–10695 (2022)
20. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dream-
booth: fine tuning text-to-image diffusion models for subject-driven generation.
arXiv preprint arXiv:2208.12242 (2022)
21. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language
understanding. NeurIPS 35, 36479–36494 (2022)
22. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language
understanding. In: NeurIPS, p. 2643 (2024). https://doi.org/10.5555/3600270.
3602913
23. Tan, W.R., Chan, C.S., Aguirre, H., Tanaka, K.: Improved ArtGAN for conditional
synthesis of natural image and artwork. IEEE Trans. Image Process. 28(1), 394–
409 (2019). https://doi.org/10.1109/TIP.2018.2866698
24. Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional
generative adversarial networks. In: CVPR (2018)
25. Zhang, H., et al.: StackGAN++: realistic image synthesis with stacked generative
adversarial networks. TPAMI 41(8), 1947–1962 (2019). https://doi.org/10.1109/
TPAMI.2018.2856256
26. Zhou, Y., et al.: Towards language-free training for text-to-image generation. In:
CVPR, pp. 17886–17896 (2022). https://doi.org/10.1109/CVPR52688.2022.01738
27. Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative
adversarial networks for text-to-image synthesis. In: CVPR, pp. 5795–5803 (2019).
https://api.semanticscholar.org/CorpusID:91183909
Towards Unsupervised Speaker Diarization
System for Multilingual Telephone Calls
Using Pre-trained Whisper Model
and Mixture of Sparse Autoencoders
Phat Lam1(B) , Lam Pham2(B) , Truong Nguyen1 , Dat Ngo3 , Thinh Pham4 ,
Tin Nguyen1 , Loi Khanh Nguyen1 , and Alexander Schindler2
1
Ho Chi Minh University of Technology, Ho Chi Minh City, Vietnam
{phat.lamhcmutddk21,truongnguyen,tin.nguyen112101bku,nkloi}@hcmut.edu.vn
2
Austrian Institute of Technology, Vienna, Austria
{lam.pham,alexander.schindler}@ait.ac.at
3
University of Essex, Colchester, UK
[email protected]
4
Ho Chi Minh City University of Science, Ho Chi Minh City, Vietnam
[email protected]
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 39–53, 2025.
https://doi.org/10.1007/978-981-96-4282-3_4
40 P. Lam et al.
1 Introduction
Sound-based applications have drawn significant attention from the research
community and have become an integral part in the forefront of driving innova-
tion. These applications involve advanced audio processing techniques to analyze
and interpret various types of sound data (e.g., acoustic scenes [27, 28], sound
events [20], machinery sound [21], human speech [16, 26]), enabling the core func-
tionality in many intelligence systems. In human speech analysis, speaker diariza-
tion plays a crucial role by identifying and segmenting audio streams based on
speaker identity, making it essential for various applications such as communi-
cation (e.g., customer support calls), security (e.g., voice tracking), healthcare
(e.g., patient monitoring), smart home (e.g., personal assistants), etc. Typically,
a cluster-based speaker diarization system consists of five modules. The tradi-
tional approach to such a system is illustrated at the top of Fig. 1. The prepro-
cessing module first converts raw audio into a suitable format, followed by the
voice activity detection (VAD) module extracting speech segments. These seg-
ments are then divided into fixed-length speaker segments. The speaker embed-
ding extractor converts these speaker segments into vectors representing speaker
characteristics, and a clustering algorithm assigns speaker labels. Among these
modules, speaker embedding and clustering are crucial components to enhance
the performance of a cluster-based speaker diarization system [24].
Regarding the speaker embedding extractor, numerous approaches have been
proposed for speaker embedding extraction, including metric-based models (GLR
[8], BIC [37], etc.), probabilistic models (GMM-UBM [33], i-vectors [6], etc.),
and neural network-based models (d-vectors [38], x-vectors [35], etc.). All these
methods require a substantial amount of annotated data, especially for neu-
ral network-based approaches, to optimize speaker feature extractors. However,
training these extractors on one type of dataset could reduce the model’s ability
to generalize to diverse or unseen data, particularly from different domains. In
addition, datasets for speaker diarization mainly support one single language,
due to the labor-intensive and time-consuming nature of collecting data and
insufficient availability of data from diverse languages, limiting the effectiveness
of speaker diarization systems in multilingual speech analysis applications.
Concerning the clustering module, common methods such as Agglomera-
tive Hierarchical Clustering (AHC) [9], k-Means [40], Mean-shift [36] have been
proposed. However, these methods operate directly on the input vector space
and rely heavily on distance-based metrics, without leveraging representation
learning techniques to uncover deeper patterns. While some deep learning-based
frameworks, such as DNN [14], GAN [23], and Autoencoder [12], incorporate
representation learning for speaker embeddings, they often require pre-extracted
embeddings (e.g., x-vectors) that fit on certain datasets and are primarily eval-
uated on single-language datasets, typically English.
To address existing limitations, we aim to develop an unsupervised speaker
diarization system that does not rely on large-scale training datasets and sup-
ports multiple languages. For speaker embedding extraction, we use the mul-
tilingual Whisper model [30]. This foundation model was trained on diverse
Unsupervised Speaker Diarization for Multilingual Calls 41
audio data for relevant tasks such as speech recognition, language identification,
and translation. The Whisper’s representations have been applied to several
downstream classification or detection tasks (e.g., speaker change detection [10],
dysarthric severity-level classification [32], vocal intensity categorization [11],
audio deepfake detection [26]), indicating that these representations can cap-
ture a wide range of speech features such as acoustic characteristics, speaker
attributes, vocal details, etc. [41]. However, its applicability in speaker diariza-
tion task has not been widely explored. Thus, leveraging Whisper’s scalability
and robustness, we explore its potential to produce high-quality speaker embed-
dings for diarization, assuming that as a general-purpose model, Whisper can
learn representations that incorporate different aspects of large training data
(e.g., phonetic content, speaker characteristics, acoustic features) that may be
useful for various downstream tasks, hypothetically including speaker diariza-
tion, despite being primarily designed for automatic speech recognition. For
speaker clustering, we propose an unsupervised deep clustering network called
Mixture of Sparse Autoencoders (Mix-SAE) to cluster the extracted embeddings.
Overall, our key contributions can be summarized as follows:
– We explored the Whisper model’s capability in the diarization task by using
it as an alternative to conventional speaker embedding extractors, eliminating
the need for annotated training data in developing diarization systems.
– Inspired by [5], we proposed the Mix-SAE network for speaker clustering,
which enhances both speaker representation learning and clustering by using
a mixture of sparse autoencoders with additional pseudo-label supervision.
– Through extensive experiments, we indicated that speaker diarization can
be effectively integrated into Whisper-based systems, enabling comprehen-
sive and multilingual speech analysis applications that combine speech-to-
42 P. Lam et al.
(i)
unit .j at the .l-th hidden layer, .zj is the input of .i-th sample that leads to
hidden unit .j. We obtain the average activation of hidden unit .j at .l-th layer
over one batch of .N samples, which is written as:
1 (l) (i)
N
(l)
ρ̂
. j = g aj (zj ) (1)
N i=1
where the mapping .g(.) uses the sigmoid function, which aims to scale the activa-
(l)
tion parameter to .[0; 1] and avoid too large value of .ρ̂j . The sparsity constraint
(l)
ensures the average activation .ρ̂j is close to the sparsity parameter .ρ, which
is quite small. This helps the model learn meaningful features while avoiding
copying or memorizing the input by enforcing a limited number of activation
neurons in the hidden layer. To achieve the approximation .ρ̂j ≈ ρ, we leverage
Kullback-Leibler divergence penalty term [19]. The KL penalty term applied for
the .l-th hidden layer that has .n(l) hidden units can be written as:
(l) (l)
n
n
(l) (l) ρ 1−ρ
.Lpen = KL(ρ||ρ̂j ) = ρ log + (1 − ρ) log (2)
(l) (l)
j=1 j=1 ρ̂j 1 − ρ̂j
Then, the penalty term is calculated for all hidden layers of the autoencoder .A
(except the latent layer) by taking the sum of KL terms as:
(l)
2L
n
ρ 1−ρ
Lpen =
. ρ log (l)
+ (1 − ρ) log (l)
(3)
l=1 j=1 ρ̂j 1 − ρ̂j
We also apply MSE loss for the pair of input data .x and its reconstruction
data over one batch of .N samples as:
N
1
LMSE
. = ||xi − D (E(xi )) ||22 (4)
2N i=1
Given the KL penalty and MSE losses, we define the final objective function
for the optimization of one individual sparse autoencoder .A:
The role of the Gating Projection (.G) is to assign weights .p̂ = [p̂1 , p̂2 , ..., p̂k ]
to the outputs of k-sparse autoencoders based on the input data. Given the
weights of .p̂ = [p̂1 , p̂2 , ..., p̂k ], the Gating Projection is also utilized to assign
46 P. Lam et al.
Get final cluster result: Get the final cluster result for the batch X via the gating projection G:
labels for clusters during the inference phase. In this work, the Gating Projection
leverages an MLP architecture with a single linear layer, followed by Leaky
ReLU activation and the final softmax layer. Given the input data .x, the Gating
Projection (.G) produces weights .p̂ = [p̂1 , p̂2 , ..., p̂k ] as:
where .W ∈ Rk×m , b ∈ Rk are the trainable weights and bias of the linear layer
in the gating projection.
obtain initial pseudo-labels .P [0] from the learned latent representation of the
sparse autoencoder .Apre . Next, we initialize the parameters of k-sparse autoen-
coders by sequentially training the .j-th sparse autoencoder .Aj with the subset of
points such that .P [0] [c = j], as shown in the lower part of Fig. 4, where .c denotes
the cluster index, .j = 1, 2, ..., k. Notably, the training process of k-sparse autoen-
coders also use Eq. 5 as the loss function.
The next Main-training step is described in Fig. 3. This step involves the joint
optimization of the k-sparse autoencoders with initialized parameters obtained
from the Pre-training step, and the predicted probabilities from the gating pro-
jection. Given k-sparse autoencoders .{A1 (θ1 ), A2 (θ2 ), ..., Ak (θk )}, where .θj is
the parameters of encoder (.Ej ) and decoder (.Dj ) of sparse autoencoder .Aj ,
.j = 1, 2, ..., k, and the parameters (.W , .B) of the gating projection .G, the main
objective function of the proposed Mix-SAE network for one batch of .N samples
[.x(1) , x(2) , ..., x(N ) ] is defined as:
where .α is the parameter to constrain the effect of both terms on the main
objective function.
The term .Lrec is the weighted sum of reconstruction error over k-sparse
autoencoders. This term ensures that the sparse autoencoders could have infor-
mation on inter-cluster reconstruction error to further strengthen feature learn-
ing within their own clusters. We define this term as:
2
1 (i)
N k
1 (i)
Lrec = − p̂j exp − x − Dj (Ej (x(i) ))
N i=1 j=1 2
. (8)
k
(i)
s.t. p̂j = 1, ∀i = 1, 2, . . . , N.
j=1
where .Dj (Ej (x(i) )) is the output of the .j-th sparse autoencoder given the input
(i)
sample .x(i) ; the probability .p̂j , which is computed from (.W , .B) in Eq. 6, is
the weight from the gating projection assigned to the .j-th reconstruction loss.
The term .Lent is referred to as the pseudo-label guided supervision loss.
We denote the pseudo-labels for one batch of .N samples at epoch .t as: .P[t] =
[t] [t] [t] [t]
[p1 , p2 , ..., pN ], where .pi ∈ Rk . The supervision loss is defined as the Cross-
Entropy loss between the pseudo-labels .P [tu ] previously updated at epoch .tu and
[t]
the prediction of the gating projection .P̂ at the current epoch .t:
N
1 [tu ] [t]
Lent = −
. p log p̂i (9)
N i=1 i
The entropy loss .Lent uses pseudo-labels to provide additional learning sig-
nals, simulating a semi-supervised setting [15]. This aims to guide the model
towards correct clustering and enhances feature learning. Notably, pseudo-labels
48 P. Lam et al.
Overall, needed steps in the training strategy of our proposed Mix-SAE cluster-
ing network can be summarized in Table 2.
The proposed method was implemented with deep learning framework PyTorch
[25]. The network architecture consists of autoencoders with hidden layers [256,
128, 64, 32] for the encoder and mirrored for the decoder, using Leaky ReLU
activation and Batch Normalization followed each hidden layer. The latent vector
size is also .k (equal to the number of speakers), with mini-batch size .N = 16.
We use k-Means.++ [3] to initialize pseudo-labels in the Pre-training step.
Regarding hyperparameters, we set the sparsity parameter .ρ = 0.2, the spar-
sity constraint .β = 0.01, pseudo-label supervision parameter .α = 1. The training
process uses learning rate .0.001 and weight decay .5×10−4 . The Pre-training step
involves 50 epochs for the main autoencoder .Apre and 20 epochs for each of k-
sparse autoencoders. The Main training step runs for 29 epochs and updates
pseudo-labels after 10 epochs.
Unsupervised Speaker Diarization for Multilingual Calls 49
Table 3. Diarization Error Rate (DER) (%) of different systems on SD-EVAL dataset
(Whisper version: Tiny, no tolerance)
EN FR GER SPA EN FR GER SPA EN FR GER SPA EN FR GER SPA EN FR GER SPA
k-Means 44.77 51.42 49.11 48.25 43.75 51.92 43.84 47.08 38.72 46.88 40.97 42.77 40.23 46.61 44.11 44.38 42.06 47.72 46.13 44.66
AHC 38.42 46.72 41.41 42.93 47.64 52.81 46.33 50.69 40.50 48.69 42.90 43.15 38.55 45.91 43.02 43.44 42.91 47.81 47.63 44.80
SpectralNet [34] 36.18 44.62 40.02 46.03 40.44 51.63 41.22 47.52 37.06 44.68 41.29 42.69 36.11 44.67 44.16 46.42 41.88 46.08 44.31 47.23
DCN [39] 32.15 35.77 36.51 36.98 37.42 38.92 42.17 43.01 32.08 37.57 38.84 40.77 33.02 43.72 44.23 40.55 40.17 45.96 40.21 38.51
DAMIC [5] 27.78 36.22 36.93 35.21 27.97 35.96 36.14 35.11 28.11 36.67 34.66 33.31 27.22 36.91 34.78 34.22 26.95 36.91 36.11 34.65
k-DAE [22] 29.12 37.91 41.23 37.00 30.53 39.81 37.10 37.29 32.72 38.84 34.96 35.23 33.33 38.55 34.24 35.51 30.36 37.32 36.22 35.02
Mix-SAE-V1 32.18 38.61 36.07 36.78 29.02 35.92 36.51 35.04 27.28 37.01 34.98 34.03 27.90 37.51 34.42 33.83 28.00 37.88 36.18 34.29
Mix-SAE-V2 28.72 43.22 40.66 36.32 29.62 40.07 36.71 35.72 27.81 36.83 34.90 33.54 27.98 39.68 34.62 33.21 27.93 38.05 36.73 33.82
Mix-SAE 26.51 36.12 35.00 34.91 26.88 37.30 35.64 34.33 27.08 36.70 34.55 32.82 27.24 38.39 34.17 32.03 26.85 37.57 35.33 33.82
Fig. 5. Evaluation: (a) DER scores using speaker embeddings from different Whisper
versions; (b) Compare DER score versus complexity across deep clustering methods
Fig. 6. t-SNE visualization of speaker embeddings after the pre-training step (Whisper
version: Tiny).
Mix-SAE achieves 26.51% DER with 334k parameters, striking a good balance
between accuracy and efficiency. Additionally, when combined with Whisper
Tiny (39M), the system is promising for integration into edge devices for sound
applications [7, 31].
Visualization and the Effect of Pre-training Step: We visualized 2-speaker
embeddings after the Pre-training step in our Mix-SAE by applying t-SNE. As
Fig. 6 shows, the sparse autoencoders effectively learn underlying patterns from
extracted speaker embeddings and map them into latent space where the embed-
Unsupervised Speaker Diarization for Multilingual Calls 51
5 Conclusion
This paper has presented an unsupervised speaker diarization system for mul-
tilingual telephone call applications. In this proposed system, the traditional
feature extractor was replaced with the Whisper encoder, benefiting from its
robustness and generalization on diverse data. Additionally, the Mix-SAE net-
work architecture was also proposed for speaker clustering. Experimental results
demonstrate that our Mix-SAE network outperforms other compared cluster-
ing methods. The overall performance of our system highlights the effectiveness
of our approach when exploring Whisper embedding for the diarization task to
develop unsupervised speaker diarization system in the contexts of limited anno-
tated training data. Furthermore, the results also enhances the system’s ability
to integrate into Whisper-based multi-task speech analysis application. Overall,
this work indicates a promising direction toward developing generalized speaker
diarization systems based on general-purpose models in future work.
References
1. Alexandra, C., Graff, D., Zipperlen, G.: CABank Spanish CallHome Corpus (1996).
https://doi.org/10.21415/T51K54
2. Alexandra, C., Graff, D., Zipperlen, G.: CABank English CallHome Corpus (1997).
https://doi.org/10.21415/T5KP54
3. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In:
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algo-
rithms, pp. 1027–1035. SODA ’07, Society for Industrial and Applied Mathematics,
USA (2007)
4. Canavan, A., Graff, D., Zipperlen, G.: CABank German CallHome Corpus (1997).
https://doi.org/10.21415/T56P4B
5. Chazan, S.E., Gannot, S., Goldberger, J.: Deep clustering based on a mixture of
autoencoders. In: 29th International Workshop on Machine Learning for Signal
Processing (MLSP), pp. 1–6 (2019). https://doi.org/10.1109/MLSP.2019.8918720
6. Dehak, N., et al.: Front-end factor analysis for speaker verification. IEEE Trans.
Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/
TASL.2010.2064307
52 P. Lam et al.
25. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learn-
ing library. In: Advances in Neural Information Processing Systems, vol. 32, pp.
8024–8035. Curran Associates, Inc. (2019). http://papers.neurips.cc/paper/9015-
pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
26. Pham, L., Lam, P., Nguyen, T., Nguyen, H., Schindler, A.: Deepfake audio detec-
tion using spectrogram-based feature and ensemble of deep learning models. In:
2024 IEEE 5th International Symposium on the Internet of Sounds (IS2), pp. 1–5
(2024). https://doi.org/10.1109/IS262782.2024.10704095
27. Pham, L., et al.: Lightweight deep neural networks for acoustic scene classification
and an effective visualization for presenting sound scene contexts. Appl. Acoust.
211, 109489 (2023)
28. Pham, L., Nguyen, T., Lam, P., Ngo, D., Jalali, A., Schindler, A.: Light-weight
deep learning models for acoustic scene classification using teacher-student scheme
and multiple spectrograms. In: 4th International Symposium on the Internet of
Sounds, pp. 1–8 (2023). https://doi.org/10.1109/IEEECONF59510.2023.10335258
29. Quatra, M.L., et al.: Vad - simple voice activity detection in python. https://
github.com/MorenoLaQuatra/vad
30. Radford, A., et al.: Robust speech recognition via large-scale weak supervision. In:
International Conference on Machine Learning, pp. 28492–28518 (2023)
31. Ramírez, A., Foster, M.E.: A whisper ROS wrapper to enable automatic speech
recognition in embedded systems (2023)
32. Rathod, S., Charola, M., Patil, H.A.: Noise robust whisper features for dysarthric
severity-level classification. In: International Conference on Pattern Recognition
and Machine Intelligence, pp. 708–715. Springer (2023)
33. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted
Gaussian mixture models. Digital Sig. Process. 10(1), 19–41 (2000). https://doi.
org/10.1006/dspr.1999.0361
34. Shaham, U., Stanton, K., Li, H., Nadler, B., Basri, R., Kluger, Y.: SpectralNet:
spectral clustering using deep neural networks (2018)
35. Snyder, D., et al.: X-vectors: robust DNN embeddings for speaker recognition.
In: IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 5329–5333 (2018). https://doi.org/10.1109/ICASSP.2018.8461375
36. Stafylakis, T., Katsouros, V., Carayannis, G.: Speaker Clustering via the mean shift
algorithm. In: Proceedings of the Speaker and Language Recognition Workshop
(Speaker Odyssey), pp. 186 – 193. ISCA, Brno, Czech Republic (2010)
37. Tritschler, A., Gopinath, R.A.: Improved speaker segmentation and segments
clustering using the Bayesian information criterion. In: EUROSPEECH (1999).
https://api.semanticscholar.org/CorpusID:15220583
38. Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker
verification (2020)
39. Yang, B., Fu, X., Sidiropoulos, N.D., Hong, M.: Towards k-means-friendly spaces:
simultaneous deep learning and clustering (2017)
40. Zhang, A., Wang, Q., Zhu, Z., Paisley, J., Wang, C.: Fully supervised speaker
diarization (2019)
41. Zhang, L., Jiang, N., Wang, Q., Li, Y., Lu, Q., Xie, L.: Whisper-SV: adapting
whisper for low-data-resource speaker verification. Speech Commun. 163, 103103
(2024). https://doi.org/10.1016/j.specom.2024.103103
Hybrid Compression: Integrating Pruning
and Quantization for Optimized
Neural Networks
1 Introduction
Over the past decade, the explosion of data and computational power has driven
significant advancements in deep neural networks (DNNs) [13]. As a result, many
new network architectures have emerged, each more complex and demanding
more resources than its predecessors [22]. For instance, the first Convolutional
Neural Network (CNN) model was proposed in 1998 with fewer than 1 mil-
lion parameters [19], while OpenAI’s GPT-3 model in 2020 comprised up to
175 billion parameters, requiring hundreds of gigabytes of memory for storage
M.-L. Nguyen, L.-B. Nguyen and V.-H. Huynh—Contributed equally to this research.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 54–64, 2025.
https://doi.org/10.1007/978-981-96-4282-3_5
Hybrid Compression for Optimized Neural Networks 55
and thousands of teraflops for training. The rapid development of these large-
scale models has introduced significant challenges and limitations. Deploying
DNN models in real-world scenarios, such as mobile applications and Internet
of Things (IoT) devices, often becomes impractical due to constrained memory
and computing resources [22].
To address these challenges, the field of model compression has gained con-
siderable attention [3]. Model compression techniques aim to reduce the size
and computational requirements of DNNs without significantly compromising
their performance. Among these techniques, deep compression has emerged as
a robust approach, with methods such as pruning, quantization, and the Mix-
ture of Experts (MoE) achieving substantial reductions in model size and com-
putational cost [15]. Pruning methods [10] are designed to remove less impor-
tant connections in neural network layers based on various evaluation criteria.
Instead of using high-precision floating-point numbers, quantization methods
[15, 24] reduce the precision of parameters by representing them with fewer bits.
MoE [7, 16] dynamically selects a subset of parameters (or experts) for each
input, optimizing resource use by activating only the network parts relevant to
each task, enabling efficient scaling.
In this paper, we introduce a novel multi-stage method to develop a cost-
efficient CNN-based model. Our approach focuses on optimizing both the com-
plexity and the computational efficiency of the model through a series of tar-
geted stages. In the first stage, we employ well-established hard compression
techniques such as pruning and quantization to significantly reduce the model’s
complexity, including the number of parameters and the overall inference cost,
making it more feasible for deployment in resource-constrained environments.
In addition, we leverage the Neural Network Intelligence (NNI) [23] framework
56 M.-L. Nguyen et al.
to implement and automate our pruning and quantization techniques. The sec-
ond stage involves the MoEs paradigm to enhances the model’s adaptability and
efficiency by allocating previous compressed model to specialize for each input.
The specialization helps enhance the performance and stability of compressed
models, which might drop due to pruning and quantization, while still leverag-
ing the low resource consumption and computational cost of these compressed
models. Experimental results on CIFAR-10 [18] and BloodMNIST [5] datasets
show that our method successfully achieved a 10x-11x reduction in FLOPs and
a 10.5x reduction in parameters, with a negligible accuracy drop on the image
classification task (See Fig. 1).
In summary, our contributions are as follows:
– We introduce a novel method that combines pruning, quantization, and
the Mixture of Experts (MoE) paradigm, demonstrating how this fusion
brings superior effectiveness and provides detailed insights into the trade-offs
between model size, computational efficiency, and accuracy.
– We investigate our method on different CNN models, providing practical
insights for implementing compression model techniques.
2 Related Work
Pruning is a widely used technique for compressing neural networks by remov-
ing redundant weights and connections. These methods identify unimportant
elements in the model, such as weights and neural connections, and prune them
by setting their values to zero, ensuring they do not participate in the back-
propagation process. Hassibi et al. [12] introduced an early pruning method that
uses the inverse Hessian matrix to identify and remove redundant weights, while
updating the remaining ones with second-order information. More recently, var-
ious pruning techniques have emerged, including magnitude-based weight prun-
ing [10], which gradually eliminates small magnitude weights to achieve network
sparsity. In CNN models, pruning is typically categorized into two approaches:
weight pruning [6], which removes individual redundant weights, and filter prun-
ing [20], which eliminates entire convolutional filters with minimal impact on
performance.
Quantization is a popular technique for compressing neural networks by low-
ering parameter precision, reducing memory usage and computational costs.
Binarized neural networks Quantization method [4] trains networks with binary
weights and activations but still accumulates gradients in 32-bit precision, high-
lighting the need for high precision during training. DoReFa-Net Quantizer [28]
reduces gradient precision by quantizing them into low-bitwidth floating-point
numbers. Quantization methods generally fall into two categories: Quantization-
Aware Training (QAT) [15], where models are retrained with quantized weights
and activations, and Post-Training Quantization (PTQ) [24], which quantizes a
pretrained model without retraining. This paper focuses on the QAT approach
for CNN model quantization.
Hybrid Compression for Optimized Neural Networks 57
3 Methodology
3.1 Overview
We present an in-depth exploration of the methodologies employed in our
research, aim to compress CNN models for deployment on resource-constrained
hardware, such as mobile and edge devices. As seen in Fig. 2, our proposed
method consists of three phases: pruning, quantization, and integrating the MoE
58 M.-L. Nguyen et al.
3.2 Pruning
Network pruning, crucial for reducing model size and bandwidth needs, removes
unnecessary neural connections [21]. We use Magnitude Pruning [11] to eliminate
redundancies effectively.
This gradual increase allows the model to adapt and learn from the pruned infor-
mation, maintaining a balance in the importance scores across layers [26]. This
balance prevents layer-collapse, as the importance scores among layers remain
equivalent. In the final stage, AGP pruning imposes a high sparsity to achieve
the configured target level.
We also consider that pruning can potentially disrupt the neural network’s
structure, resulting in a substantial decrease in accuracy. This challenge can
be addressed by retraining the model, which incurs additional costs [10, 21]. In
our pipeline, retraining is performed after the model undergoes the Speedup
technical, which optimizes the model for faster execution. Additionally, during
model construction, we observed the weight distribution of the classifier layer,
responsible for generating the logit vector for classification output. Therefore,
in our configuration, we set a target sparsity level for the entire CNN backbone
and a lower sparsity level for the classifier layer.
Model Speedup: In the pruning process, a binary mask layer is used to rep-
resent retained connections, assigning a value of 1 to kept connections and 0
to reduced ones. Consequently, during the forward and backward passes of the
model, this binary mask matrix is multiplied with the corresponding weights.
Obviously, this pruning method does not significantly enhance model inference
and training speed.
To mitigate this limitation, we utilize the Model Speedup method [1], which
involves removing the feature maps that were previously pruned in the CNN
layer and retaining the weights to preserve the layer’s output. As a result, the
model achieves a smaller weight set than the original. This optimization can lead
to a latency reduction by a factor of 2 compared to the original model, albeit
with a slight trade-off in accuracy.
3.3 Quantization
Quantization reduces model size and speeds up inference by converting weights
or activations from high-precision floating points to lower bit-widths, like 8-bit
integers, with minimal accuracy loss [15, 24].
In this paper, we implement QAT [15] to maintain high accuracy post-
quantization. QAT [15] simulates the effects of quantization during training,
allowing the model to learn and adjust to the reduced precision, which mini-
mizes the accuracy degradation typically observed in post-training quantization
[10, 24].
where .min(x) is the minimum value in the range of .x, and .Δ is the quantization
step size defined as:
max(x) − min(x)
Δ=
. , (3)
2n − 1
where .n is the bit-width (e.g., 8 for 8-bit quantization) [15]. The quantized
value .x̃ is then converted back to a floating-point format during inference using
.xq = x̃ · Δ + min(x), ensuring the model operates within the quantized value
max(xcalib ) − min(xcalib )
.Δ= . (6)
2n − 1
The zero
point is calculated
to align the quantized range with the original
min(xcalib )
range .z = − Δ .
3.4 Retraining
Retraining is an essential step in the model compression pipeline to recover
the accuracy lost during pruning and quantization. After applying pruning and
quantization, the model may suffer from reduced performance due to significant
structural and precision changes. Retraining enables the model to regain per-
formance by re-optimizing its weights within the new compressed architecture.
This improves accuracy by allowing the model to adapt to altered parameters
and mitigate errors introduced during quantization [10, 21].
4 Experimental Results
4.1 Implementation Details
4.2 Datasets
4.3 Results
5 Conclusion
We have presented a novel deep learning model compression method that com-
bines pruning, quantization, and the Mixture of Experts (MoE) paradigm. Our
approach significantly reduces model size and computational requirements with-
out compromising accuracy. Experimental results demonstrated the potential
of our method for deploying sophisticated deep learning models on resource-
constrained devices. Our approach enables the use of complex neural networks
in mobile and edge applications where computational resources and energy effi-
ciency are critical constraints.
References
1. Neural network intelligence. https://github.com/microsoft/nni (2020)
2. Arora, S., Du, S.S., Hu, W., Li, Z., Wang, R.: Fine-grained analysis of optimization
and generalization for overparameterized two-layer neural networks. arXiv preprint
arXiv:1901.08584 (2019)
3. Cheng, Y., Wang, D., Zhou, P., Zhang, T.: Model compression and acceleration
for deep neural networks: The principles, progress, and challenges. IEEE 35(1),
126–136 (2018)
4. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural
networks: Training deep neural networks with weights and activations constrained
to+ 1 or-1. arXiv preprint arXiv:1602.02830 (2016)
5. Doe, J., Smith, J.: Bloodmnist dataset (2022), version 1.0
64 M.-L. Nguyen et al.
6. Dong, X., Chen, S., Pan, S.: Learning to prune deep neural networks via layer-wise
optimal brain surgeon. In: Proceedings of NIPS (2017)
7. Eigen, D., Ranzato, M., Sutskever, I.: Learning factored representations in a deep
mixture of experts. In: ICLR (2014)
8. Frankle, J., Carbin, M.: The lottery ticket hypothesis: Finding sparse, trainable
neural networks. In: ICLR (2019)
9. Girdhar, R., Ramanan, D.: Attentional pooling for action recognition. In: Proceed-
ings of NIPS (2017)
10. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural net-
works with pruning, trained quantization, and huffman coding. In: ICLR, pp. 199–
203 (2016)
11. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for
efficient neural network. In: NIPS (2015)
12. Hassibi, B., Stork, D.G., Wolff, G.J.: Optimal brain surgeon and general network
pruning. In: IEEE (1993)
13. Hatcher, W.G., Yu, W.: A survey of deep learning: platforms, applications and
emerging research trends. IEEE Access 6 (2018)
14. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (2018)
15. Jacob, B., et al.: Quantization and training of neural networks for efficient integer-
arithmetic-only inference. arXiv preprint arXiv:1712.05877 (2017)
16. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local
experts. Neural Computation 3(1) (1991)
17. Jiang, A.Q., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)
18. Krizhevsky, A.: Learning multiple layers of features from tiny images (2009)
19. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. IEEE 86(11), 2278–2324 (1998)
20. Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient
convnets. ArXiv abs/1608.08710 (2016)
21. Liang, T., Glossner, J., Wang, L., Shi, S., Zhang, X.: Pruning and quantization for
deep neural network acceleration: A survey. Neurocomputing 461, 370–403 (2021)
22. Liao, H., et al.: A survey of deep learning technologies for intrusion detection in
internet of things. IEEE Access (2024)
23. Microsoft: Nni automl toolkit. https://nni.readthedocs.io/en/latest/ (2021)
24. Nagel, M., Amjad, R.A., van Baalen, M., Louizos, C., Blankevoort, T.: Up or down?
adaptive rounding for post-training quantization. arXiv preprint arXiv:2004.10568
(2020)
25. Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.:
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.
arXiv preprint arXiv:1701.06538 (2017)
26. Tanaka, H., Kunin, D., Yamins, D.L.K., Ganguli, S.: Pruning neural networks
without any data by iteratively conserving synaptic flow
27. Wang, X., et al.: Deep mixture of experts via shallow embedding. In: PMLR (2020)
28. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: Training low
bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint
arXiv:1606.06160 (2016)
29. Zhu, M.H., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning
for model compression (2017)
AI-Generated Image Recognition
via Fusion of CNNs and Vision
Transformers
1 Introduction
Artificial intelligence (AI)-generated images have become increasingly popular
in recent years, with various tools and platforms available for users to create
captivating visuals for social media and other purposes. One of the top recent
image generators is DALL.·E 3, known for its ability to produce high-quality
images quickly [19]. AI has the capacity to create images from scratch using
trained artificial neural networks [1]. This technology allows for limitless cre-
ativity at the fingertips of users, with AI engines capable of producing art that
rivals human creativity [16]. Overall, AI-generated images have the potential to
be of excellent quality and offer users a range of creative possibilities. As tech-
nology continues to advance, AI image generators are likely to become even more
sophisticated and user-friendly, providing new opportunities for creativity and
visual expression.
Detecting AI-generated images is crucial due to the potential harm they can
cause in society. These images can be easily manipulated and altered to create
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 65–76, 2025.
https://doi.org/10.1007/978-981-96-4282-3_6
66 X.-B. Mai et al.
Our main contribution lies in running individual detection models and fusing
their outputs to create higher-accuracy models. This approach enhances detec-
tion precision and ensures our model’s adaptability against a diverse range of
AI-generated images. By integrating the capabilities of CNNs and ViTs, our goal
is to build a detection system that effectively safeguards against the spread of
misleading or harmful content in the digital realm.
In the remaining sections of this paper, we cover related methods, method-
ology, experimental results, conclusions, and future work. We discuss existing
approaches to detecting AI-generated images in Sect. 2, detail our methodology
for developing our AI-generated image recognition model in Sect. 3, present the
experimental results obtained from testing our model, draw conclusions based
on our findings in Sect. 4, and outline future research directions in Sect. 5.
2 Related Work
2.1 Generative Models
Generative models constitute a pivotal domain within artificial intelligence,
aiming to capture and model the inherent distribution of data from observed
samples. These computational frameworks, which encompass a diverse array of
methodologies, have witnessed significant evolution over time. Early forays into
generative modeling leveraged deep neural networks, exemplified by restricted
Boltzmann machines (RBMs) and deep Boltzmann machines (DBMs). Recent
advancements have introduced a multitude of innovative approaches, includ-
ing variational autoencoders (VAEs), autoregressive models, normalizing flows,
generative adversarial networks (GANs), and diffusion models [14]. Notably, the
ProGAN/StyleGAN [12] family has demonstrated remarkable capabilities in pro-
ducing photorealistic images, predominantly focusing on single-class generation
tasks. The emergence of these sophisticated generative techniques has spurred
investigations into forensic methodologies geared towards discerning synthetic
imagery from authentic counterparts. Particularly noteworthy are recent strides
in diffusion models, which have showcased unprecedented proficiency in gener-
ating images from textual descriptions [3].
In this paper, we leverage the CIFAKE [4] dataset, which harnesses Stable
Diffusion [21] to generate synthetic images. CIFAKE serves as a valuable resource
for training and evaluating AI-generated image detection models, as it provides
a diverse collection of synthetic images across various domains. As the field
continues to push the boundaries of generative capabilities, the challenge of
effectively distinguishing between real and synthetic content grows increasingly
complex.
image splicing or Photoshop warps [25]. With the proliferation of deep generative
methods, particularly in the context of GAN-based techniques [11], recent inves-
tigations have delved into the efficacy of discriminative methods for detecting
synthesized content. A central inquiry pertains to the generalizability of detec-
tors to unseen methods, with studies indicating that a classifier trained on one
GAN model can generalize to others, especially under aggressive augmentations.
[26]
Despite successes, challenges emerge when adapting detectors to new gener-
ators, where observed high average precision is juxtaposed with low accuracy,
indicating proficient separation between real and fake classes but suboptimal
calibration. Various techniques, including the utilization of frequency cues [10],
co-occurrence matrices [17], pretrained CLIP features [24], and augmentation
with representation mixing [5], have demonstrated effectiveness [18]. Notably,
Ojha et al. [18] demonstrate that a simple nearest neighbors classifier improve
accuracy, though at the cost of inference time. We expand upon the common
observation that even rudimentary classifiers possess a capacity for generalization
across various data generators. This exploration involves analyzing and defining
their performance within an online context.
Recent investigations into diffusion methods reveal that, contrary to GAN-
based detectors’ limitations in generalization, diffusion models are detectable
and exhibit some degree of mutual generalizability [18]. David C. Epstein et al.
[8] take these studies further by training a detector on 14 methods in an online
fashion, simulating their release dates, and releasing an accompanying dataset of
570k images. While these works detect whole images, local prediction also offers
important use-cases. For instance, in forensic analysis, there’s a growing need to
identify alterations made by conventional editing tools like Photoshop, such as
image warping and splicing [25]. Chai et al. [6] show that patch-based classifiers
can generate heatmaps for regions that contain more detectable cues. We aim to
determine whether we can localize inpainted regions. Remarkably, even in the
absence of direct access to inpainted examples, employing CutMix augmentation
[27] enables us to utilize entire images effectively for pixel-level predictions.
3 Proposed Method
3.1 Overview
Our new approach focuses on using fusion strategies to significantly enhance
the accuracy of our AI-generated image recognition model. By combining the
strengths of CNNs and (ViTs) through fusion, we aim to create a model that
efficient in noticing both tiny local details and broader global context within
images. This fusion not only deals with the different types of AI-generated images
but also makes sure that CNNs, which are good at pulling out details, and ViTs,
which understand the whole picture, work well together. The fusion strategies
play a vital role in improving the model’s precision and adaptability, making it
a potent solution for the task of accurately detecting AI-generated images. Our
proposed method is outlined in Fig. 1.
AI-Generated Image Recognition 69
Fig. 1. Illustrating of our fusion method between efficientnetv2-b0 model and ViT-b16
model.
Concatenation. Our first approach is fusing CNNs and ViTs using concatena-
tion method. This fusion aims to capitalize on the localized feature extraction
capabilities of CNNs, which are known for capturing intricate spatial hierarchies,
and the global context understanding of ViTs, adept at discerning long-range
dependencies within images.
The method involves training dedicated CNN and ViT models for image
feature extraction, followed by the extraction of representative features from
their intermediate layers. These features are then fused using concatenation or
merging techniques, facilitated by a fusion layer. This innovative amalgamation
of features creates a unified representation that leverages the complementary
advantages of both architectures.
Let .XCN N be the output features from the CNN model, and .XV iT be the
output features from the ViT model. The concatenation operation can be rep-
resented as follows:
Xconcatenated = [XCN N , XV iT s ],
. (1)
Xf inal = F ullyConnectedLayer(Xconcatenated ). (2)
.
4 Experimental Results
Experiment Settings. We performed all experiments using the TensorFlow
framework on an Ubuntu system. Our experiments were run on two Nvidia T4
GPUs with 16GB of memory each. We selected Binary cross-entropy loss as our
loss function for two-label classification. We employed the AdamW optimizer for
training our CNN models. The optimizer’s weight decay was set to 100 for every
CNN, ensuring regularization during training. A momentum of 0.9 was utilized
to facilitate faster convergence. Our training process incorporates an early stop-
ping mechanism, stopping training if the validation loss does not improve by a
margin of 1000 within five epochs. Additionally, to safeguard model progress,
we implement a checkpoint system, preserving the model’s state each time the
validation loss experiences reduction of at least 10000.
4.1 Datasets
4.2 Results
Table 1. Result of pure CNNs base model and pure ViTs base model.
Fusion Result. Our proposed fusion strategies have successfully achieved the
two highest accuracy scores when compared with the accuracy of the two indi-
vidual models, CNN and ViT. As depicted in Table 2, the concatenation method
achieved the highest accuracy (97.44%), representing an increase of 0.37% com-
pared to EfficientNet and 9.96% compared to ViT. Additionally, the Linear Com-
bination method surpassed our initial expectations during training, achieving an
accuracy of 97.32%. This achievement was realized through adjustments of the
weight constants, assigning a weight of 0.6 to EfficientNet and 0.4 to ViTs.
Reduce Brightness by 50%. After applying the fusion method, the accuracy
slightly increases that of the base model. Thus, we attempt to assess our model’s
performance under challenging conditions by reducing the brightness of testing
images.
In this experiment, we convert the image color space from RGB-base to HSV-
base and then decrease the “Value” parameter by 50% in the validation dataset,
effectively reducing image brightness by half [22]. Then we use our pretrained
CNNs base models and our proposed model to evaluate that modified validation
dataset. The performances of the CNNs base models and our custom models
described in Table 3 and Table 4 respectively.
Table 3. Results of pure CNNs base model and pure ViTs base model on the reduce
brightness dataset.
The observation that most CNN base models experience a sharp decline in
accuracy when faced with images of reduced brightness underscores the com-
plexity of the task at hand. Notably, the VGG16 model stands out as being
relatively resilient to such challenges, with only a marginal 1.5% drop in accu-
racy. Leveraging this insight, we devised our custom fusion model, which not only
preserves the robustness of VGG16 but also harnesses the advanced capabilities
74 X.-B. Mai et al.
5 Conclusion
Our study introduces a novel method for recognizing AI-generated images, aim-
ing to enhance prediction efficiency and accuracy across several popular models.
At the heart of our proposed approach lies the fusion of Convolutional Neu-
ral Networks (CNNs) and Vision Transformer architectures. We explore two
fusion strategies-concatenation and linear combination-which yield slight accu-
racy improvements compared to using the models separately.
Our methodology begins by extracting feature vectors from both Efficient-
Net and Vision Transformer models. These vectors are then combined into a
unified output vector using mathematical formulas and algorithms. This fusion
process enables the models to leverage the strengths of both CNNs and Vision
Transformer, resulting in more robust predictions.
To validate the effectiveness of our approach, we subjected the testing dataset
to challenging conditions. Remarkably, our experiments reveal that the fusion of
VGG16 and ViT achieved the highest accuracy under these demanding circum-
stances. This finding underscores the resilience and effectiveness of our fusion
technique, particularly when faced with complex and varied image data.
Overall, our experiments demonstrate that our proposed fusion technique
significantly enhances feature extraction accuracy and image recognition capa-
bilities compared to individual branch models. By seamlessly integrating CNNs
and Vision Transformer architectures, we pave the way for more accurate and
efficient AI-generated image recognition systems.
References
1. Ai image generation, explained. https://www.altexsoft.com/blog/ai-image-
generation/. Accessed 01 Apr 2024
2. How can ai-generated photos harm each of us. https://www.aiornot.com/blog/
how-can-ai-generation-photos-can-harm-each-of-us. Accessed 01 Apr 2024
3. Balaji, Y.e.a.: ediff-i: Text-to-image diffusion with expert denoisers. arXiv preprint
arXiv:2211.01324 (2022)
4. Bird, J., Lotfi, A.: Cifake: Image classification and explainable identification of
ai-generated synthetic images. https://www.kaggle.com/datasets/birdy654/cifake-
real-and-ai-generated-synthetic-images (Mar 2023)
5. Bui, T.e.a.: RepMix: Representation Mixing for Robust Attribution of Synthesized
Images (2022). https://doi.org/10.1007/978-3-031-19781-9_9
6. Chai, L.e.a.: What makes fake images detectable? (2020)
7. Collins, B.: Ai or not? how to detect if an image is ai-generated
— forbes.com. https://www.forbes.com/sites/barrycollins/2023/10/14/ai-or-not-
how-to-detect-if-an-image-is-ai-generated/?sh=6db008b83254. Accessed 01 Apr
2024
8. Epstein, D.C.e.a.: Online detection of ai-generated images (2023)
9. Farid, H.: Image forgery detection. IEEE Signal Process. Mag. 26(2), 16–25 (2009).
https://doi.org/10.1109/MSP.2008.931079
10. Frank, J.e.a.: Leveraging frequency analysis for deep fake image recognition. In:
ICML, pp. 3247–3258 (2020). https://doi.org/10.48550/arXiv.2302.10174
11. Karras, T.e.a.: A style-based generator architecture for gans. In: IEEE TPAMI
(2019). https://doi.org/10.1109/CVPR.2019.00453
12. Karras, T.e.a.: A style-based generator for gans. In: CVPR, pp. 4401–4410 (2019)
13. Krizhevsky, A.: Learning multiple layers from tiny images (2013-2017)
14. Luo, C.: Understanding diffusion models. arXiv preprint arXiv:2208.11970 (2022)
15. Maybe, M.: Ai image detector - hugging face space by umm-maybe — hugging-
face.co. https://huggingface.co/spaces/umm-maybe/AI-image-detector. Accessed
01 Apr 2024
16. Nast, C.: What ai-generated art really means for human creativity. https://
www.wired.com/story/picture-limitless-creativity-ai-image-generators/. Accessed
01 Apr 2024
17. Nataraj, L.e.a.: Detecting gan generated fake images using co-occurrence matri-
ces. Electronic Imaging (2019). https://doi.org/10.2352/ISSN.2470-1173.2019.5.
MWSF-532
18. Ojha, U.e.a.: Towards universal fake image detectors (2023). https://doi.org/10.
48550/arXiv.2302.10174
19. OpenAI: Dall.·e 3. https://openai.com/dall-e-3. Accessed 25 Mar 2024
20. Popescu, A., Farid, H.: Exposing digital forgeries by detecting traces of re-sampling.
IEEE Trans. on Signal Process. 53, 758–767 (2005). https://doi.org/10.1109/TSP.
2004.839932
21. Rombach, R.e.a.: High-resolution image synthesis with latent diffusion models. In:
CVPR, pp. 10684–10695 (2022)
22. StackOverflow: How to fast change image brightness with python +
opencv? — stackoverflow.com. https://stackoverflow.com/questions/32609098/
how-to-fast-change-image-brightness-with-python-opencv, [Accessed 08-05-2024]
23. Tan, M., Le, Q.V.: Efficientnetv2: Smaller models and faster training (2021)
76 X.-B. Mai et al.
24. Tejankar, A.e.a.: A fistful of words: Learning transferable visual models from bag-
of-words supervision (2021)
25. Wang, S.Y.e.a.: Detecting photoshopped faces by scripting photoshop. In: ICCV
(2019). https://doi.org/10.1109/ICCV.2019.01017
26. Wang, S.Y.e.a.: Cnn-generated images are surprisingly easy to spot. In: CVPR
(2020). https://doi.org/10.1109/CVPR42600.2020.00872
27. Yun, S.e.a.: Cutmix: Regularization strategy for strong classifiers. In: ICCV (2019)
Decoding Deepfakes: Caption Guided
Learning for Robust Deepfake Detection
1 Introduction
Recent advances in image generation via GANs [3] and Diffusion models [4]
complicate real vs. synthetic image identification. Hyper-realistic deepfakes can
mislead audiences by fabricating actions or statements from public figures,
underscoring the need for effective detection methods to ensure digital content
authenticity.
Early detection methods combining CNN classifiers and data augmentation
[12, 21] have struggled with diffusion model-generated images. Techniques like
frequency domain analysis [2] and noise reconstruction [22] aim to enhance deep-
fake detection robustness. However, identifying invariant features remains chal-
lenging, particularly for unseen fakes. Progress in face-related deepfake detection
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 77–87, 2025.
https://doi.org/10.1007/978-981-96-4282-3_7
78 Y.-H. Nguyen and T.-N. Le
Fig. 1. The CLIP model enhances deepfake detection by generalizing forgery recogni-
tion. Expanding on UniFD [16], we decoded forgery features and found that captions
generated from these features often misrepresent the images, with specific words in the
captions proving essential for detecting deepfakes.
Fig. 2. Word frequency across ProGAN, BigGAN, StarGAN, and deepfake datasets
was analyzed by applying a text-decoder after the CLIP image-encoder’s adapter layer,
trained on ProGAN for deepfake classification. High-frequency words in the resulting
captions were found to enhance CLIP’s generalization in deepfake detection.
2 Related Work
Most recent research on deepfake detection typically focuses on three main devel-
opment directions: Data processing to enhance generalization capabilities, data
augmentation, and model design aimed at improving the detection of deepfake
features.
3 Proposed Method
3.1 Overview
Figure 3 illustrates our proposed Caption Guided Learning (CGL) method, con-
sisting of three main parts: Caption Generation generates corresponding captions
for training images and appending forgery captions to achieve enhanced cap-
tions. LoRA Contrastive Learning aims to train LoRA models using contrastive
learning to efficiently extract semantic features and deepfake specific-features.
Forgery Fusion Learning combines forgery features from low-level to high-level
across stages in CLIP-image encoder to predict real and fake images.
Previous studies [9, 16] show that features from the CLIP-image encoder excel
in forgery detection via linear classification. We found that these features were
semantically aligned during CLIP training. Text decoding features after pass-
ing through the adapter layer trained for deepfake detection revealed key words
crucial for identifying deepfakes, despite often misrepresenting the images. Ana-
lyzing the frequency of these words helped differentiate real from fake images
(Fig. 2). Our goal is to enhance CLIP’s ability to extract both semantic and
forgery features by creating captions that fulfill two criteria:
. C̃enhance = {c̃j }N
j=1 , (3)
(Creal , cj ), if y = 0,
.c̃j = (4)
(Cfake , cj ), if y = 1,
Caption Guided Learning for Robust Deepfake Detection 81
where .Cf orgery = {Creal , Cfake } typically assigns pairs of words that dif-
fer from the image caption, for example: .{Creal = real, Cfake = synthetic} or
.{Creal = authentic, Cfake = deepf ake}. These enhanced captions are added to
the image captions to enhance the textual context and guide the CLIP-image
encoder in learning the cues for detecting deepfake images.
encoding vector .fi ∈ R1×D . In this context, .H and .W denote the height and
width of the image, .1 represents the CLS token, .D represents projected D-
dimensional image features, .N = HW P 2 denotes patch number, .fi represents .i-th
stage output of CLIP-ViT.
CLIP-text Encoder has primary role to transform input text into an embed-
ding that can be compared with image embeddings generated by the CLIP-image
encoder. CLIP-text encoder uses a Transformer architecture. The input text is
tokenized into a sequence of tokens .x = [CLS, token1 , token2 , . . . , tokenn ]. .CLS is
a special classification token, and .tokeni are the word tokens. The Transformer
applies a series of layers, including multi-head self-attention and feedforward
layers .z = Transformer(x). After applying positional embeddings and passing
through multiple Transformer layers, the final output of the .CLS token is used
as the text representation .h = zCLS . This vector is then projected into a joint
image-text embedding space using a learned linear projection .ftext (x) = Wtext h.
where .Wtext is the projection matrix.
To enable the CLIP-image encoder to learn both semantic features and forgery
features, we apply LoRA algorithm [5] at each stage of the ViT in the image
encoder. LoRA is a method for updating pre-trained weights by using the prod-
uct of two smaller matrices, denoted as: .A and .B, based on the ’intrinsic rank’
of the downstream task. Given an input .x, a hidden state .h, and a weight matrix
d ×d2
.W ∈ R 1 , the weight adjustment process when applying the LoRA module
is carried out as follows:
h = Wx + γΔWx = Wx + γBAx,
. (5)
d1 ×r d1 ×d2
where .A ∈ R r×d2
, .B ∈ R , .ΔW ∈ R of rank .r, where .r {d1 , d2 },
and .γ is the scaling factor. Matrix .A is initialized with Kaiming initialization,
while the matrix .B is initialized with zeros. This initialization implies that no
change occurs in the weights before the LoRA training process, and thus, the
weight updates are not altered.
LoRA is applied to a stack of ViT blocks of the CLIP image encoder, each
block containing a multi-head attention (MHA) module:
xWQi (xWKi )T
.headi = Softmax √ (xWVi ), (6)
d
MHA(x) = concat(head1 , . . . , headH )WO ,
. (7)
where .d is the scaling factor and .WKi , .WQi , .WVi , .WO are the matrices corre-
sponding to the key, query, value, and output matrices, respectively.
Using LoRA during the fine-tuning of the CLIP image encoder helps reduce
computation time and costs, achieving high performance as the updates are
applied to all ViT Blocks.
Caption Guided Learning for Robust Deepfake Detection 83
u = Encodertext (c̃j ),
. j vj = EncoderLoRA
img (xj ), (10)
1 exp(uTi vi )
. Lu→v = − log N . (13)
j=1 exp(ui vj )
N i T
84 Y.-H. Nguyen and T.-N. Le
L = μ1 Lcontrastive + μ2 Lclassification .
. (14)
4 Experiments
4.1 Implementation Details
Our method involves end-to-end training on an Nvidia Tesla T4 16GB GPU
over 3 epochs with a batch size of 24. The CLIP-text encoder is frozen, with
LoRA layers applied to all ViT blocks of the CLIP-image encoder. With loss
function, we set .μ1 = μ2 = 1.0. In the LoRA parameters, we assign .r = 2,
.α = 1, and the dropout rate to 0.25. Input images were resized to 256 .× 256 and
then cropped to 224 .× 224. Only Random cropping and horizontal flipping were
applied. We utilized the AdamW optimizer [13] with .weight_decay = 5 × 10−2 ,
−4
.betas = (0.9, 0.999) and initial learning rate was set to .5 × 10 . For testing,
we removed the CLIP-text encoder and merged the LoRA weights at the ViT
stages of the CLIP-image encoder.
4.2 Datasets
We train our network on the ForenSynths dataset, incorporating real LSUN
images and ProGAN-generated fake images, with 1-class (horse), 2-class (chair,
horse), and 4-class (car, cat, chair, horse) configurations. For evaluation, we use
the same testing datasets as RINE [9], covering GANs (ProGAN, StyleGAN,
StyleGAN2, BigGAN, CycleGAN, StarGAN, GauGAN, deepfake) and Diffusion
models (PNDM, Guided, DALL-E, VQ-Diffusion, LDM, Glide).
Table 1. Comparison of accuracy and average precision (Acc/AP) between our method
and state-of-the-art techniques on the GANs dataset under three distinct training sce-
narios: 1-class, 2-classes, and 4-classes. The top-performing results are emphasized in
bold.
Method class ProGAN StyleGAN StyleGAN2 BigGAN CycleGAN StarGAN GauGAN deepfake Mean
Wang 1 50.4/63.8 50.4/79.3 68.2/97.4 50.2/61.3 50.0/52.9 50.0/48.2 50.3/67.6 50.1/51.5 52.5/64.9
BiHPF 1 82.5/81.4 68.0/62.8 68.8/63.6 67.0/62.5 75.5/74.2 90.1/90.1 73.6/92.1 51.6/49.9 72.1/72.1
FrePGAN 1 95.5/99.4 80.6/90.6 77.4/93.0 63.5/60.5 59.4/59.9 99.6/100.0 53.0/49.1 70.4/81.5 74.9/79.3
LGrad 1 99.4/99.9 96.0/99.6 93.8/99.4 79.5/88.9 84.7/94.4 99.5/100.0 70.9/81.8 66.7/77.9 86.3/92.7
UniFD 1 99.1/100.0 77.2/95.9 69.8/95.8 94.5/99.0 97.1/99.9 98.0/100.0 95.7/100.0 82.4/91.7 89.2/97.8
FreqNet 1 98.0/99.9 92.0/98.7 89.5/97.9 85.5/93.1 96.1/99.1 94.2/98.4 91.8/99.6 69.8/94.4 89.6/97.6
RINE 1 99.8/100.0 88.7/99.1 86.9/99.7 99.1/99.9 99.4/100.0 98.8/100.0 99.7/100.0 82.7/97.4 94.4/99.5
Ours 1 99.5/100.0 90.3/99.9 85.6/98.9 95.4/99.8 96.7/99.8 99.8/100.0 98.2/100.0 85.0/97.4 93.8/99.5
Wang 2 64.6/92.7 52.8/82.8 75.7/96.6 51.6/70.5 58.6/81.5 51.2/74.3 53.6/86.6 50.6/51.5 57.3/79.6
BiHPF 2 87.4/87.4 71.6/74.1 77.0/81.1 82.6/80.6 86.0/86.6 93.8/80.8 75.3/88.2 53.7/54.0 78.4/79.1
FrePGAN 2 99.0/99.9 80.8/92.0 72.2/94.0 66.0/61.8 69.1/70.3 98.5/100.0 53.1/51.0 62.2/80.6 75.1/81.2
LGrad 2 99.8/100.0 94.8/99.7 92.4/99.6 82.5/92.4 95.9/94.7 99.7/99.9 73.7/83.2 60.6/67.8 86.2/92.2
UniFD 2 99.7/100.0 78.8/97.4 75.4/96.7 91.2/99.0 91.9/99.8 96.3/99.9 91.9/100.0 80.0/89.4 88.1/97.8
FreqNet 2 99.6/100.0 90.4/98.9 85.8/98.1 89.0/96.0 96.7/99.8 97.5/100.0 88.0/98.8 80.7/92.0 91.0/97.9
RINE 2 99.8/100.0 84.9/99.5 76.7/99.6 98.3/99.9 99.4/100.0 99.6/100.0 99.9/100.0 66.7/96.4 90.6/99.4
Ours 2 100.0/100.0 95.8/99.9 98.0/99.7 95.9/99.9 93.4/99.9 99.9/100.0 96.3/100.0 84.27/97.9 95.4/99.6
Wang 4 91.4/99.4 63.8/91.4 76.4/97.5 52.9/73.3 72.7/88.6 63.8/90.8 63.9/92.2 51.7/62.3 67.1/86.9
BiHPF 4 90.7/86.2 76.9/75.1 76.2/74.7 84.9/81.7 81.9/78.9 94.4/94.4 69.5/78.1 54.4/54.6 78.6/77.9
FrePGAN 4 99.0/99.9 80.7/89.6 84.1/98.6 69.2/71.1 71.1/74.4 99.9/100.0 60.3/71.7 70.9/91.9 79.4/87.2
LGrad 4 99.9/100.0 94.8/99.9 96.0/99.9 82.9/90.7 85.3/94.0 99.6/100.0 72.4/79.3 58.0/67.9 86.1/91.5
UniFD 4 99.7/100.0 89.0/98.7 83.9/98.4 90.5/99.1 87.9/99.8 91.4/100.0 89.9/100.0 80.2/90.2 89.1/98.3
FreqNet 4 99.6/100.0 90.2/99.7 88.0/99.5 90.5/96.0 95.8/99.6 85.7/99.8 93.4/98.6 88.9/94.4 91.5/98.5
NPR 4 99.8/100.0 96.3/99.8 97.3/100.0 87.5/94.5 95.0/99.5 99.7/100.0 86.6/88.8 77.4/86.2 92.5/96.1
RINE 4 100.0/100.0 88.9/99.4 94.5/100.0 99.6/99.9 99.3/100.0 99.5/100.0 99.8/100.0 80.6/97.9 95.3/99.7
Ours 4 100.0/100.0 97.1/99.9 99.0/99.90 98.8/98.4 98.5/99.3 100.0/100.0 98.8/99.8 94.00/98.1 98.3/99.5
Method class PNDM Guided DALL-E VQ-Diff LDM 200 LDM w/CFG LDM 100 Glide 100-27 Glide 50-27 Glide 100-10 Mean
Wang 4 50.8/90.3 54.9/66.6 51.8/61.3 50.0/71.0 52.0/64.5 51.6/63.1 51.9/63.7 53.0/71.3 54.2/76.0 53.3/72.9 52.4/70.1
LGrad 4 69.8/98.5 86.6/100.0 88.5/97.3 96.3/100.0 94.2/99.1 95.9/99.2 94.8/99.2 87.4/93.2 90.7/95.1 89.4/94.9 89.4/97.7
UniFD 4 75.3/92.5 75.7/85.1 89.5/96.8 83.5/97.7 90.2/97.1 77.3/88.6 90.5/97.0 90.7/97.2 91.1/97.4 90.1/97.0 85.4/94.6
RINE 4 83.8/98.6 76.2/96.6 95.1/99.5 91.4/99.8 98.3/99.9 88.2/98.7 98.7/99.9 88.9/99.1 92.6/99.5 90.7/99.2 90.40/99.1
FatFormer 4 99.3/100.0 76.1/92.0 98.8/99.8 100.0/100.0 98.6/99.8 94.9/99.1 98.7/99.9 94.4/99.1 94.7/99.4 94.2/99.2 95.0/98.8
Ours 1 90.6/99.9 76.8/86.7 97.2/99.7 87.0/99.9 91.3/98.8 80.9/96.2 92.3/98.9 87.4/98.0 93.2/99.1 89.1/98.3 88.6/97.5
Ours 2 95.2/99.8 78.3/86.7 98.9/99.9 98.4/99.9 98.8/99.9 96.5/99.5 98.5/99.8 92.1/98.5 94.6/99.1 93.2/98.8 94.4/98.2
Ours 4 96.4/99.9 82.4/88.7 99.3/99.8 99.6/100.0 99.4/99.8 98.6/99.7 98.7/99.7 94.6/99.1 95.9/99.4 95.2/99.3 95.9/98.5
94.4% mean accuracy on diffusion data, suggesting that less training data may
enhance generalization by reducing overfitting risks.
5 Conclusion
We present a novel Caption Guided Learning (CGL) method for generalizable
image detection, incorporating three modules with CLIP to enhance feature
extraction for deepfake detection. Extensive experiments on GAN and Diffusion
model datasets show that CGL achieves state-of-the-art performance, highlight-
ing its strong generalization capability. Additionally, the simplicity and flexibility
of our approach may inspire further advancements in deepfake detection using
frozen pre-trained models.
References
1. Bi, X., et al.: Detecting generated images by real images only. arXiv preprint
arXiv:2311.00962 (2023)
2. Frank, J., Eisenhofer, T., Schönherr, L., Fischer, A., Kolossa, D., Holz, T.: Lever-
aging frequency analysis for deep fake image recognition. In: ICML, pp. 3247–3258.
PMLR (2020)
3. Goodfellow, I., et al.: Generative adversarial nets. NIPS 27 (2014)
4. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NIPS 33,
6840–6851 (2020)
Caption Guided Learning for Robust Deepfake Detection 87
5. Hu, E.J., et al.: Lora: Low-rank adaptation of large language models. arXiv preprint
arXiv:2106.09685 (2021)
6. Huang, B., et al.: Implicit identity driven deepfake face swapping detection. In:
CVPR, pp. 4490–4499 (2023)
7. Jeong, Y., Kim, D., Min, S., Joe, S., Gwon, Y., Choi, J.: Bihpf: bilateral high-pass
filters for robust deepfake detection. In: WACV, pp. 48–57 (2022)
8. Jeong, Y., Kim, D., Ro, Y., Choi, J.: Frepgan: robust deepfake detection using
frequency-level perturbations. In: AAAI. vol. 36, pp. 1060–1068 (2022)
9. Koutlis, C., Papadopoulos, S.: Leveraging representations from intermediate
encoder-blocks for synthetic image detection. arXiv preprint arXiv:2402.19091
(2024)
10. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training
for unified vision-language understanding and generation. In: ICML, pp. 12888–
12900. PMLR (2022)
11. Liu, H., Tan, Z., Tan, C., Wei, Y., Wang, J., Zhao, Y.: Forgery-aware adaptive
transformer for generalizable synthetic image detection. In: CVPR, pp. 10770–
10780 (2024)
12. Liu, Z., Qi, X., Torr, P.H.: Global texture enhancement for fake face detection in
the wild. In: CVPR, pp. 8060–8069 (2020)
13. Loshchilov, I.: Decoupled weight decay regularization. arXiv preprint
arXiv:1711.05101 (2017)
14. Luo, Y., Du, J., Yan, K., Ding, S.: Lareˆ . 2: Latent reconstruction error based
method for diffusion-generated image detection. In: CVPR, pp. 17006–17015 (2024)
15. Mokady, R., Hertz, A., Bermano, A.H.: Clipcap: Clip prefix for image captioning.
arXiv preprint arXiv:2111.09734 (2021)
16. Ojha, U., Li, Y., Lee, Y.J.: Towards universal fake image detectors that generalize
across generative models. In: CVPR, pp. 24480–24489 (2023)
17. Radford, A., et al.: Learning transferable visual models from natural language
supervision. In: ICML, pp. 8748–8763. PMLR (2021)
18. Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Frequency-aware deepfake
detection: Improving generalizability through frequency space domain learning.
In: AAAI, vol. 38, pp. 5052–5060 (2024)
19. Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Rethinking the up-sampling
operations in cnn-based generative network for generalizable deepfake detection.
In: CVPR, pp. 28130–28139 (2024)
20. Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y.: Learning on gradients: generalized
artifacts representation for gan-generated images detection. In: CVPR, pp. 12105–
12114 (2023)
21. Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn-generated images
are surprisingly easy to spot... for now. In: CVPR, pp. 8695–8704 (2020)
22. Wang, Z., et al.: Dire for diffusion-generated image detection. In: ICCV, pp. 22445–
22455 (2023)
23. Yan, Z., Luo, Y., Lyu, S., Liu, Q., Wu, B.: Transcending forgery specificity with
latent space augmentation for generalizable deepfake detection. In: CVPR, pp.
8984–8994 (2024)
24. Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on
imagenet. In: ICCV, pp. 558–567 (2021)
25. Zhu, M., et al.: Gendet: towards good generalizations for AI-generated image detec-
tion. arXiv preprint arXiv:2312.08880 (2023)
Minimalist Preprocessing Approach
for Image Synthesis Detection
1 Introduction
In recent years, significant advancements in image generation have been achieved,
particularly with Generative Adversarial Networks (GANs) [11] and Diffusion
models [12, 14]. These approaches produce high-quality images that closely
resemble real-world visuals [31] and have garnered attention in academic and
societal circles. Generative models have found applications in various fields,
including virtual try-ons and personalized fashion recommendations in the fash-
ion industry [25], as well as in image editing [4, 39] and interior design [6].
Despite the valuable applications of image generation technology, signifi-
cant drawbacks exist. According to a survey conducted by Bauer and Bind-
schaedlerr [2], generative models can create fake information, particularly deep-
fakes, which depict fabricated scenarios involving famous individuals. In response
to these dangers, several US states [3, 20] have outlawed the malicious use of
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 88–99, 2025.
https://doi.org/10.1007/978-981-96-4282-3_8
Minimalist Preprocessing Approach for Image Synthesis Detection 89
deepfake technology, especially for harmful content like revenge and celebrity
pornography. To address the threats posed by synthetic images on digital com-
munication platforms and social media, it is essential to develop effective coun-
termeasures for verifying image authenticity directly on mobile devices. Given
the ubiquity and portability of these devices, real-time detection of generated
images is crucial for preventing misinformation and preserving the integrity
of visual content. However, the constrained computational capacity of mobile
devices presents a significant challenge. This paper introduces a simple yet effi-
cient solution for synthesized image detection, specifically the Adjacency Differ-
ence Orientation Filter (ADOF) for data preprocessing; this filter allows us to
compute the gradient in both the .x and .y directions. The direction of the gradient
reflects the behavior of grayscale variation among neighboring pixels, assisting
in distinguishing between real and generated images. Focusing on extracting
useful low-level features, our approach ensures generalization while utilizing a
lightweight CNN architecture for detecting generated images, without demand-
ing extensive computational resources. This strategy effectively reduces irrele-
vant information, enabling the model to concentrate on fine-grained variations,
ultimately leading to improved performance and generalization. In contrast to
existing methods [21, 28, 35] that require large deep learning architectures, such
as CLIP [30], ViT [7], Resnet50 [13], and significant computational resources, our
approach demands fewer resources while still ensuring generalization and achiev-
ing comparable accuracy. Figure 1 presents a comparative overview of results,
highlighting the advantages of this strategy.
Experiments on well-known datasets [28, 36, 37] demonstrate the effectiveness
of our method, achieving impressive accuracy of 94.9% on the Ojha dataset [28]
and 98.3% on the DiffusionForensis [37]. Additionally, there is a reduction in
90 H.-D. Vo and T.-N. Le
2 Related Work
Various methods have been developed to address the challenge of distinguish-
ing synthetic images from real ones, utilizing both traditional machine learn-
ing techniques and modern deep learning approaches. Durall et al. [8] applied
a Fourier Transform [1] to grayscale images and used azimuthal averaging to
convert the 2D frequency data into a 1D feature vector, retaining essential infor-
mation for classification. They then employed either Support Vector Machines
or K-means clustering to detect GAN-generated images. Alternatively, methods
like RINE [21] and Ojha et al. [28], along with similar approaches, leverage pre-
trained deep learning networks such as CLIP to enhance performance. This inte-
gration contributing to consistently high success rates in detecting synthesized
images through the integration of these networks into their frameworks. Notably,
the FatFormer [23] method focuses on the contrastive objectives between adapted
image features and text prompt embeddings, providing valuable information that
enables the deep learning models to learn more robust and discriminative rep-
resentations, ultimately improving their ability to accurately classify real and
generated images.
3 Proposed Method
3.1 Overview
Generative models, such as GAN [11] and Diffusion [14], currently use CNN lay-
ers for image synthesis, meaning that neighboring pixel regions are correlated
92 H.-D. Vo and T.-N. Le
Fig. 2. The left image represents the original image, the middle shows the gradient
calculation applied, and the right image illustrates the resulting gradient map.
at .(x, y). This filter captures variations in pixel intensity along the horizon-
tal direction. Similarly, .Dy (x, y) captures variations in pixel intensity along the
vertical direction. To determine the gradient magnitude and orientation, these
values are computed from .Dx and .Dy .
.Gm (x, y) = Dx (x, y)2 + Dy (x, y)2 , (3)
Dy (x, y)
Go (x, y) = arctan
. , (4)
Dx (x, y)
where .Dx (x, y) and .Dy (x, y) are as previously defined. The gradient orientation
Go , which represents the overall angle of gray-level changes at a pixel and indi-
.
cates the direction of these combined intensity variations, is referred to as the
Adjacency Difference Orientation Filter (ADOF) in this paper. Mean-
while, the gradient magnitude .Gm quantifies the strength of intensity changes
at that pixel. The result of this computational process is illustrated in Fig. 2.
4 Experiments
4.1 Implementation Details
In practice, we are more concerned with the flat regions of an image rather
than the edge areas where there is a significant variation in gray levels between
the x and y directions. This is because, in regions with large changes in gray
levels in one direction compared to the other, the gradient angles are close to
94 H.-D. Vo and T.-N. Le
± π2 . Although these angles both indicate edge regions in the image, the gradient
.
angles at edges typically take values of .± π2 , which are numerically distant from
each other despite conveying similar edge information. To exclude these areas,
we set the gradient values approaching .± π2 to 0, the experimental process has
demonstrated that this approach leads to higher accuracy for the model.
All experiments are conducted on a computing system using a NVIDIA RTX
A4000 GPU with 16 GB of memory and an AMD Ryzen 5 5600X 6-Core CPU.
We trained our model using parameters that are closely aligned with those used
in common methods [34–36] to ensure a fair comparison and demonstrate the
effectiveness of our method independent of specific hyperparameters. Further-
more, we utilized the source code provided by NPR [34] to streamline the training
process and maintain consistency. The model was trained using the Adam opti-
mizer with a learning rate of .2 × 10−4 and a batch size of 32. To accelerate the
training process, we adjusted the learning rate every 5 epochs instead of every
10 epochs and utilized 4 out of the 20 classes (car, cat, chair, horse) for training,
similar to the protocol used in existing works [15, 16, 34, 36].
4.2 Dataset
Training Set. To facilitate comparison between methods, we used the same
ForenSynths dataset with existing methods [17, 28, 34–36]. This dataset con-
sists of 20 object classes selected from the LSUN dataset. Each class contains
18,000 real-world images, with corresponding generative images generated using
the ProGAN [18] model. To verify the generalization of methods, all compared
method was trained on a subset of the ForenSynths [36] dataset consisting of 4
classes: car, cat, chair, horse.
AttGAN BEGAN CramerGAN InfoMaxGAN MMDGAN RelGAN S3GAN SNGAN STGAN Mean
Method
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDetection [36] 51.1 83.7 50.2 44.9 81.5 97.5 71.1 94.7 72.9 94.4 53.3 82.1 55.2 66.1 62.7 90.4 63.0 92.7 62.3 82.9
Frank [10] 65.0 74.4 39.4 39.9 31.0 36.0 41.1 41.0 38.4 40.5 69.2 96.2 69.7 81.9 48.4 47.9 25.4 34.0 47.5 54.7
Durall [9] 39.9 38.2 48.2 30.9 60.9 67.2 50.1 51.7 59.5 65.5 80.0 88.2 87.3 97.0 54.8 58.9 62.1 72.5 60.3 63.3
Patchfor [5] 68.0 92.9 97.1 100.0 97.8 99.9 93.6 98.2 97.9 100.0 99.6 100.0 66.8 68.1 97.6 99.8 92.7 99.8 90.1 95.4
F3Net 85.2 94.8 87.1 97.5 89.5 99.8 67.1 83.1 73.7 99.6 98.8 100.0 65.4 70.0 51.6 93.6 60.3 99.9 75.4 93.1
SelfBland [32] 63.1 66.1 56.4 59.0 75.1 82.4 79.0 82.5 68.6 74.0 73.6 77.8 53.2 53.9 61.6 65.0 61.2 66.7 65.8 69.7
GANDetection [26] 57.4 75.1 67.9 100.0 67.8 99.7 67.6 92.4 67.7 99.3 60.9 86.2 69.6 83.5 66.7 90.6 69.6 97.2 66.1 91.6
LGrad [35] 68.6 93.8 69.9 89.2 50.3 54.0 71.1 82.0 57.5 67.3 89.1 99.1 78.5 86.0 78.0 87.4 54.8 68.0 68.6 80.8
Ojha [28] 78.5 98.3 72.0 98.9 77.6 99.8 77.6 98.9 77.6 99.7 78.2 98.7 85.2 98.1 77.6 98.7 74.2 97.8 77.6 98.8
NPR [34] 83.0 96.2 99.0 99.8 98.7 99.0 94.5 98.3 98.6 99.0 99.6 100.0 79.0 80.0 88.8 97.4 98.0 100.0 93.2 96.6
ADOF(ours) 99.5 100.0 92.2 100.0 96.0 99.6 94.1 99.1 96.0 99.7 100.0 100.0 77.5 86.7 94.8 99.3 97.8 99.7 94.2 98.2
Stable Stable
ADM DDPM IDDPM LDM PNDM VQ-Diffusion Diffusion v1 Diffusion v2 Mean
Method
Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P. Acc. A.P.
CNNDetection [36] 53.9 71.8 62.7 76.6 50.2 82.7 50.4 78.7 50.8 90.3 50.0 71.0 38.0 76.7 52.0 90.3 51.0 79.8
Frank [10] 58.9 65.9 37.0 27.6 51.4 65.0 51.7 48.5 44.0 38.2 51.7 66.7 32.8 52.3 40.8 37.5 46.0 50.2
Durall [9] 39.8 42.1 52.9 49.8 55.3 56.7 43.1 39.9 44.5 47.3 38.6 38.3 39.5 56.3 62.1 55.8 47.0 48.3
Patchfor [5] 77.5 93.9 62.3 97.1 50.0 91.6 99.5 100.0 50.2 99.9 100.0 100.0 90.7 99.8 94.8 100.0 78.1 97.8
F3Net [29] 80.9 96.9 84.7 99.4 74.7 98.9 100.0 100.0 72.8 99.5 100.0 100.0 73.4 97.2 99.8 100.0 85.8 99.0
SelfBland [32] 57.0 59.0 61.9 49.6 63.2 66.9 83.3 92.2 48.2 48.2 77.2 82.7 46.2 68.0 71.2 73.9 63.5 67.6
GANDetection [26] 51.1 53.1 62.3 46.4 50.2 63.0 51.6 48.1 50.6 79.0 51.1 51.2 39.8 65.6 50.1 36.9 50.8 55.4
LGrad [35] 86.4 97.5 99.9 100.0 66.1 92.8 99.7 100.0 69.5 98.5 96.2 100.0 90.4 99.4 97.1 100.0 88.2 98.5
Ojha [28] 78.4 92.1 72.9 78.8 75.0 92.8 82.2 97.1 75.3 92.5 83.5 97.7 56.4 90.4 71.5 92.4 74.4 91.7
NPR [34] 88.6 98.9 99.8 100.0 91.8 99.8 100.0 100.0 91.2 100.0 100.0 100.0 97.4 99.8 93.8 100.0 95.3 99.8
ADOF(ours) 93.5 99.0 99.6 100.0 99.2 100.0 99.9 100.0 97.4 99.9 97.1 99.8 99.8 100.0 99.9 100.0 98.3 99.8
which only reaches 95.3% (see Table 2). It also surpasses DIRE [37], which reports
97.9% accuracy on its own dataset, despite our model being trained on Foren-
synths [22], in contrast to DIRE’s training on DiffusionForensics. Additionally,
this approach exceeds both RINE [21] and Ojha [28] (see Table 3), with the latter
achieving 91.1% on its dataset [28]. It is worth mentioning that both methods
utilize a large CLIP model for their evaluations.
96 H.-D. Vo and T.-N. Le
5 Conclusion
In this paper, we proposed a simple yet highly effective filter, namely ADOF,
for capturing pixel-level variations. By treating an image as a discrete digi-
tal signal, this method eliminates the average components of the signal. These
components typically carry semantic information, which is less helpful for distin-
guishing between real and synthetic images compared to the subtle traces that
the proposed filter is designed to detect. Experimental results indicates that our
proposed method significantly reduces model complexity while enhancing both
accuracy and generalization, even on previously unseen data.
References
1. Arunachalam, S., Khairnar, S., Desale, B.: The fast fourier transform algorithm
and its application in digital image processing. New J. Chem. 35(5) (2013)
2. Bauer, L.A., Bindschaedler, V.: Generative models for security: Attacks, defenses,
and opportunities. arXiv:2107.10139 (2021)
3. Cara Curtis: California makes deepfakes illegal to curb revenge porn and doctored
political videos (2019). https://bit.ly/4f40oaX. Accessed 24 Sept 2024
4. Casteleiro-Pitrez, J.: Generative artificial intelligence image tools among future
designers: a usability, user experience, and emotional analysis. Digital 4(2), 316–
332 (2024)
5. Chai, L., Bau, D., Lim, S.N., Isola, P.: What makes fake images detectable? under-
standing properties that generalize. In: European Conference on Computer Vision
(2020)
6. Chen, Z., Wang, X.: Application of AI technology in interior design 179 (2020)
7. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image
recognition at scale. CoRR abs/2010.11929 (2020)
8. Durall, R., Keuper, M., Pfreundt, F.J., Keuper, J.: Unmasking deepfakes with
simple features. ArXiv abs/1911.00686 (2019)
9. Durall, R., Keuper, M., Keuper, J.: Watch your up-convolution: Cnn based gen-
erative deep neural networks are failing to reproduce spectral distributions, pp.
7890–7899 (2020)
10. Frank, J.C., Eisenhofer, T., Schönherr, L., Fischer, A., Kolossa, D., Holz, T.: Lever-
aging frequency analysis for deep fake image recognition. ArXiv (2020)
11. Goodfellow, I.J., et al.: Generative adversarial networks. Commun. ACM 63, 139–
144 (2014)
12. Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In:
CVPR, pp. 10696–10706 (2022)
13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition,
pp. 770–778 (2016)
14. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. ArXiv
abs/2006.11239 (2020)
15. Jeong, Y., Kim, D., Ro, Y., Choi, J.: Frepgan: robust deepfake detection using
frequency-level perturbations. In: AAAI Conference on Artificial Intelligence (2022)
98 H.-D. Vo and T.-N. Le
16. Jeong, Y., Kim, D., Min, S., Joe, S., Gwon, Y., Choi, J.: Bihpf: bilateral high-pass
filters for robust deepfake detection. In: WACV, pp. 48–57 (2022)
17. Ju, Y., Jia, S., Ke, L., Xue, H., Nagano, K., Lyu, S.: Fusing global and local features
for generalized AI-synthesized image detection. In: ICIP, pp. 3465–3469 (2022)
18. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for
improved quality, stability, and variation. ArXiv abs/1710.10196 (2017)
19. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing
and improving the image quality of stylegan, pp. 8110–8119 (2020)
20. Korosec, K.: Deepfake revenge porn is now illegal in virginia (2019). https://
techcrunch.com/2019/07/01/deepfake-revenge-porn-is-now-illegal-in-virginia/.
Accessed 24 Sep 2019
21. Koutlis, C., Papadopoulos, S.: Leveraging representations from intermediate
encoder-blocks for synthetic image detection. arXiv:2402.19091 (2024)
22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. Commun. ACM 60, 84–90 (2012)
23. Liu, H., Tan, Z., Tan, C., Wei, Y., Wang, J., Zhao, Y.: Forgery-aware adaptive
transformer for generalizable synthetic image detection. In: CVPR, pp. 10770–
10780 (2024)
24. Liu, Z., Qi, X., Torr, P.H.: Global texture enhancement for fake face detection in
the wild. In: CVPR, pp. 8060–8069 (2020)
25. Lomov, I., Makarov, I.: Generative models for fashion industry using deep neural
networks. In: ICCAIS, pp. 1–6. IEEE (2019)
26. Mandelli, S., Bonettini, N., Bestagini, P., Tubaro, S.: Detecting gan-generated
images by orthogonal training of multiple cnns, pp. 3091–3095 (2022)
27. Mickens, R.E.: Difference equations: theory, applications and advanced topics. CRC
Press (2015)
28. Ojha, U., Li, Y., Lee, Y.J.: Towards universal fake image detectors that generalize
across generative models, pp. 24480–24489 (2023)
29. Qian, Y., Yin, G., Sheng, L., Chen, Z., Shao, J.: Thinking in frequency: face forgery
detection by mining frequency-aware clues. ArXiv abs/2007.09355 (2020)
30. Radford, A., et al.: Learning transferable visual models from natural language
supervision, pp. 8748–8763 (2021)
31. for Schools, B.: Spotting AI: Knowing how to recognise real vs AI images. https://
elearn.eb.com/real-vs-ai-images/ (2024). Accessed 21 Aug 2024
32. Shiohara, K., Yamasaki, T.: Detecting deepfakes with self-blended images, pp.
18720–18729 (2022)
33. Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Frequency-aware deepfake
detection: Improving generalizability through frequency space domain learning
38(5), 5052–5060 (2024)
34. Tan, C., Zhao, Y., Wei, S., Gu, G., Liu, P., Wei, Y.: Rethinking the up-sampling
operations in cnn-based generative network for generalizable deepfake detection,
pp. 28130–28139 (2024)
35. Tan, C., Zhao, Y., Wei, S., Gu, G., Wei, Y.: Learning on gradients: generalized arti-
facts representation for gan-generated images detection, pp. 12105–12114 (2023)
36. Wang, S.Y., Wang, O., Zhang, R., Owens, A., Efros, A.A.: Cnn-generated images
are surprisingly easy to spot... for now, pp. 8695–8704 (2020)
37. Wang, Z., et al.: Dire for diffusion-generated image detection, pp. 22445–22455
(2023)
38. Wikipedia Contributors: finite difference (2024). https://en.wikipedia.org/wiki/
Finite_difference. Accessed 21 Aug 2024
Minimalist Preprocessing Approach for Image Synthesis Detection 99
39. Wootaek Shin, P., Ahn, J.J., Yin, W., Sampson, J., Narayanan, V.: Can prompt
modifiers control bias? a comparative analysis of text-to-image generative models.
arXiv e-prints (2024)
40. Zhong, N., Xu, Y., Li, S., Qian, Z., Zhang, X.: Patchcraft: exploring texture patch
for efficient AI-generated image detection (2024)
KidRisk: Benchmark Dataset for Children
Dangerous Action Recognition
1 Introduction
2 Related Work
2.1 Action Recognition
Chen et al. [4] introduced various models based on convolutional neural net-
works (CNN) and achieved high accuracy in action recognition. This approach
can be divided into two main types: 2D CNN uses 2D filters to process each video
frame independently, capturing mainly spatial information but not explicitly
modeling temporal relationships. The advantage of 2D CNN is its smaller size
and lower computational cost. 3D CNN uses filters to process the video volume,
capturing both spatial and temporal information. However, 3D CNNs are larger
in size and more computationally expensive than 2D CNNs. Meanwhile, Lin
et al. [13] proposed Temporal Shift Module (TSM), capturing spatio-temporal
information similar to 3D-CNN models but with computational costs equivalent
to 2D-CNNs. Specifically, uni-directional TSM was developed to handle online
video processing by only using information from past frames.
CNNs perform well in action recognition. Studies indicate improved accuracy
with 3D CNNs compared to 2D CNNs. However, despite their ability to capture
both spatial and temporal information, 3D CNNs do not significantly outperform
2D CNNs in terms of accuracy. Research suggests that both 2D CNNs and
3D CNNs exhibit similar behavior regarding the learning of spatio-temporal
representations and the transfer of knowledge to new tasks.
On the other hand, recurrent neural networks (RNN) and their variant
Long Short-Term Memory (LSTM) have become powerful tools in video analysis,
particularly action recognition. RNNs are effective at capturing temporal rela-
tionships between frames, while LSTMs are designed to overcome the vanishing
gradient problem of RNNs. However, RNNs/LSTMs also have some limitations,
such as high computational costs and issues with vanishing/exploding gradients,
though LSTM mitigates this to some extent. Some improved methods have been
proposed, such as combining CNN and LSTM to reduce computational costs and
improve performance. Several works [5, 17] introduced advancements in terms of
computation and performance.
Children Dangerous Action Recognition 103
Graph Convolutional Networks (GCN) are the primary tool for analyzing
skeletal graphs and recognizing human actions. GCN helps capture the spatial
relationships between body parts. Many works [12, 20] used GCN to extract fea-
tures from skeleton graphs and achieved positive results. However, GCN models
struggle to capture complex temporal information from actions. Some studies
have employed attention mechanisms or separate streams for spatial and tem-
poral information to enhance the ability to capture temporal features.
3 Methodology
3.1 Proposed KidRisk Dataset
To address the challenges in child action recognition and safety detection, we
propose a new datasets, including two parts: children’s action videos and chil-
dren’s safety images.
Children’s Action Videos are developed based on the InfAct dataset [7],
which consists of short video clips capturing two actions performed by children
with a transition in between, such as sitting, standing, lying down, and crawling.
We extend this dataset by extracting and labeling additional video segments
from the source, focusing on basic child actions. After processing, the videos are
trimmed into shorter clips (up to 5 s), each containing only a single action, with
the transition periods removed (illustrated by Fig. 1).
Children’s Safety Images are compiled from various sources, including more
than 10,000 images depicting children in safe and dangerous situations. These
images are categorized into two groups: “Safe” and “Dangerous,” with dangerous
scenarios including children playing near stairs, handling sharp objects, or being
in situations near swimming pools. To increase the dataset’s diversity, we sup-
plemented it with additional dangerous situations, creating a rich dataset that
accurately reflects the real-world risks children may encounter (see Fig. 2).
These features are passed through the Q-Former block, which employs a cross-
attention mechanism to link visual and textual information, creating feature vec-
tors that represent the visual information. Similarly, action labels or dangerous
situations are converted into corresponding feature vectors of the same length
using the Q-Former. Cosine similarity (formular 1) is then applied to compare
the visual and textual vectors, helping to determine the closest matching action
106 M.-K. Nguyen et al.
dangerous situation. By applying transfer learning with BLIP-2, the model can
extract important features from each frame, which are then input into classi-
fication layers with a sigmoid activation function to predict the probability of
danger. This approach not only improves the accuracy of detecting dangerous
situations but also enhances the safety of children in everyday activities. Mon-
itoring and analyzing each moment allows parents or caregivers to intervene
promptly, reducing the risk of accidents.
Our training process involves several key steps to ensure that the model’s
parameters are optimized for achieving the best performance in action recogni-
tion and danger detection tasks. First, the input data undergoes preprocessing.
Images from the video are normalized to fit the input format of the BLIP-2
model, typically including resizing to the standard size of (224, 224) and nor-
malizing pixel values. For video data, to reduce load and retain important infor-
mation, only representative frames are selected from each second of the video for
processing. During the training process, the loss function plays a crucial role in
guiding the model to optimize its parameters. For the danger detection task, the
Binary Cross-Entropy (BCE) loss function is used. This function is suitable for
binary classification problems, where it compares the model’s predicted proba-
bilities with the actual labels. The BCE loss formula helps the model adjust its
parameters so that predictions are as close as possible to the true labels. For the
action recognition task, the Cross-Entropy loss function is applied, allowing the
model to accurately classify actions across multiple classes. A significant factor in
the training process is the issue of overfitting. To address this, L2 regularization
is employed. L2 regularization helps mitigate the risk of the model’s parame-
ters becoming too large, thereby enhancing the model’s ability to generalize to
unseen data. The regularization coefficient is adjusted to control the impact of
regularization on the loss function.
One notable challenge in training is data imbalance, especially in the case of
danger detection. Typically, the number of samples labeled as dangerous is much
fewer than those labeled as safe, leading the model to become biased toward
the more prevalent class. To overcome this, samples labeled as dangerous are
augmented by duplicating them, thus balancing the quantity with safe samples.
The training process runs with a learning rate of .α = 0.0001 and the compu-
tational resources used include a single T4 GPU, allowing the model to optimize
effectively for both tasks.
4 Experimental Results
4.1 Children’s Action Classification
The challenge of not having access to a large-scale dataset for children’s action
recognition highlights the motivation for developing a zero-shot learning app-
roach for classifying children’s actions. In this study, we tested several advanced
models, including S3D, Alpro, and BLIP-2, to evaluate their effectiveness in clas-
sifying children’s actions using zero-shot learning. Models utilizing the ViT back-
bone demonstrated significantly higher performance compared to those based on
108 M.-K. Nguyen et al.
While action recognition using zero-shot learning with the BLIP-2 model
has achieved some impressive results, it cannot yet be considered truly effective
in classifying children’s actions. Specifically, the performance of this method
remains limited, suggesting that the lack of contextual information from train-
ing data can hinder the model’s ability to accurately recognize complex actions.
However, when applying transfer learning, the results obtained are highly promis-
ing. Fine-tuning the BLIP-2 model on a specific dataset has led to a significant
improvement in classification performance, with accuracy increasing by 21.3%
compared to the previous zero-shot learning method (see Table 1). This demon-
strates that using transfer learning not only allows the model to learn from
the features of the target data but also enhances its generalization ability and
accuracy in recognizing children’s actions.
Table 2. Experimental results of transfer learning and other experiments on the danger
situation images.
Methods Accuracy
Resnet 85.1%
ViT 75.4%
BLIP-2 (zero-shot) 56.1%
BLIP-2 + Transfer learning 96.1%
5 Conclusion
In this paper, we introduced the comprehensive KidRisk dataset, encompassing
video clips of children’s actions and images of hazardous situations, designed to
push the boundaries of risk recognition in children’s activities. We also develop
a simple yet efficient approach for identifying dangerous actions in children
through the use of the vision-language BLIP-2 model. Our experimental findings
reveal that the integration of BLIP-2 with transfer learning techniques not only
delivers exceptional performance but also underscores the potential of vision-
language models in real-world applications. Experimental results demonstrated
that the combination of BLIP-2 with transfer learning techniques achieved high
performance. These results highlight the feasibility of vision-language models
110 M.-K. Nguyen et al.
in advancing child safety, paving the way for more intelligent, context-aware
monitoring systems capable of preemptively identifying and mitigating risks in
unsupervised environments.
References
1. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. Adv.
Neural. Inf. Process. Syst. 35, 23716–23736 (2022)
2. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A
video vision transformer. In: Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pp. 6836–6846 (2021)
3. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for
video understanding? In: ICML, vol. 2, p. 4 (2021)
4. Chen, C.F.R., et al.: Deep analysis of cnn-based spatio-temporal representations
for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 6165–6175 (2021)
5. Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for
statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
6. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image
recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
7. Huang, X., et al.: Posture-based infant action recognition in the wild with very
limited data. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 4912–4921 (2023)
8. Jia, C., et al.: Scaling up visual and vision-language representation learning with
noisy text supervision. In: International Conference on Machine Learning, pp.
4904–4916. PMLR (2021)
9. Li, D., Li, J., Li, H., Niebles, J.C., Hoi, S.C.: Align and prompt: Video-and-language
pre-training with entity prompts. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 4953–4963 (2022)
10. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-
training with frozen image encoders and large language models. In: International
Conference on Machine Learning, pp. 19730–19742. PMLR (2023)
11. Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training
for unified vision-language understanding and generation. In: International Con-
ference on Machine Learning, pp. 12888–12900. PMLR (2022)
12. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural
graph convolutional networks for skeleton-based action recognition. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.
3595–3603 (2019)
13. Lin, J., Gan, C., Han, S.: Tsm: temporal shift module for efficient video under-
standing. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp. 7083–7093 (2019)
Children Dangerous Action Recognition 111
14. Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end
learning of visual representations from uncurated instructional videos. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 9879–9889 (2020)
15. Nie, Q., Wang, X., Wang, J., Wang, M., Liu, Y.: A child caring robot for the
dangerous behavior detection based on the object recognition and human action
recognition. In: 2018 IEEE International Conference on Robotics and Biomimetics
(ROBIO), pp. 1921–1926. IEEE (2018)
16. Radford, A., et al.: Learning transferable visual models from natural language
supervision. In: International Conference on Machine Learning, pp. 8748–8763.
PMLR (2021)
17. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convolu-
tional lstm network: A machine learning approach for precipitation nowcasting.
In: Advances in Neural Information Processing Systems, vol. 28 (2015)
18. Wang, C., Zhang, H., Zhai, Z., et al.: Real time dangerous action warning system
based on graph convolution neural network. Acad. J. Comput. Inform. Sci. 5(6),
89–94 (2022)
19. Wang, Z., Yu, J., Yu, A.W., Dai, Z., Tsvetkov, Y., Cao, Y.: Simvlm: sim-
ple visual language model pretraining with weak supervision. arXiv preprint
arXiv:2108.10904 (2021)
20. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for
skeleton-based action recognition. In: Proceedings of the AAAI Conference on Arti-
ficial Intelligence, vol. 32 (2018)
DOLG-CNet: Deep Orthogonal Fusion
of Local and Global Features Combined
with Contrastive Learning and Deep
Supervision for Polyp Segmentation
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 112–126, 2025.
https://doi.org/10.1007/978-981-96-4282-3_10
DOLG-CNet 113
1 Introduction
Colorectal cancer (CRC) is the second most lethal and third most common can-
cer globally, with millions of new cases and deaths reported annually. Research
indicates that most CRCs start as adenomatous polyps, which may progress from
benign mucosa to adenocarcinoma before becoming cancerous [3]. This preva-
lence underlines the necessity for early detection and removal of these polyps
through colonoscopy to prevent CRC [32]. Despite its critical role, the effec-
tiveness of colonoscopy is compromised by a significant missed detection rate,
which ranges from 6% to 27% [1]. With CRC accounting for 9.4% of all cancer
deaths globally in 2020, enhancing the precision of polyp detection through auto-
mated segmentation technologies is imperative to improve treatment outcomes
and reduce mortality rates [24].
Early methods for polyp segmentation relied on manually crafted features
such as color, texture, shape, and appearance, utilizing classifiers that often
failed due to the limitations of these features [26]. In recent years, Convolutional
Neural Networks (CNNs) based models like UNet [21] have proven highly effec-
tive for medical image segmentation, particularly for polyps. The UNet archi-
tecture [21] features a symmetric encoder-decoder structure with skip connec-
tions that preserve information across different levels, enabling the generation of
detailed feature maps. This has established UNet as a fundamental architecture
in biomedical image segmentation [7, 14, 35].
Despite their effectiveness, CNN-based models can sometimes miss contex-
tual and spatial details due to pooling and convolution striding, which affects
their ability to model long-range dependencies [6]. To overcome these limitations,
there has been a shift towards Transformer-based approaches like TransUNet [6],
TransFuse [33], and TransNetR [15], which employ self-attention mechanisms to
capture long-range dependencies better, thus enhancing accuracy. However, these
methods can be complex and resource-demanding. Their dependence on larger
patch sizes might also impact their performance in translational equivariance,
essential for biomedical imaging tasks [16]. Moreover, Transformers may result
in less accurate segmentations as they struggle to incorporate low-level details
[6]. CNN-based or Transformer-based methods may especially encounter chal-
lenges in extracting detailed, fine-grained features. They are also constrained
by specific scenarios, easily affected by different clinical settings or changes in
augmentation, and lack extensive exploration into how local and global feature
representations are integrated. This oversight could mean missing opportunities
to enhance feature attributes.
In this paper, we introduce DOLG-CNet, a novel framework designed for
polyp segmentation. This framework uses the state-of-the-art CNN backbone,
ConvNeXt [17], for its segmentation capabilities. It also incorporates an orthog-
onal fusion module, effectively capturing global and local feature relationships.
Additionally, we propose a novel training strategy that combines contrastive
learning with segmentation training, supplemented by auxiliary deep supervi-
sion loss to enhance performance. Our contributions are multifaceted and are
presented in four-fold:
114 T.-H. Nguyen-Mau et al.
2 Related Works
2.1 Polyp Segmentation
Over the past decade, deep learning has achieved significant advancements, par-
ticularly in early polyp diagnosis using endoscopic images. UNet [21], a well-
known architecture for medical image segmentation, features a CNN encoder-
decoder structure with a contracting path to gather context and an expanding
path to enhance detailed accuracy. Building upon the UNet architecture, sev-
eral variants [7, 14, 35] have emerged, each enhancing segmentation capabilities.
UNet++ [35] employs a complex network of nested and densely connected skip
pathways to minimize the semantic gap between encoder and decoder feature
DOLG-CNet 115
maps. Subsequently, SFA [9] and PraNet [8] aim to delineate the precise bound-
ary of polyps from surrounding tissues. In particular, PraNet [8] employs reverse
attention modules to refine boundary details using a global feature map pro-
duced by a parallel partial decoder that utilizes high-level features. MSNet [34]
introduces a multi-scale subtraction network that effectively reduces redundancy
while harnessing complementary features across multiple scales.
In recent developments, Transformer-based models have also demonstrated
remarkable effectiveness in polyp image segmentation [6, 15]. For instance, Tran-
sUNet [6] employs a combined CNN-transformer encoder to grasp long-range
dependencies and utilizes a cascaded CNN upsampler in its decoder to discern
local contextual relationships between pixels. TransNetR [15] is a Transformer-
based residual network with an encoder-decoder structure, offering efficient seg-
mentation capabilities for both in-distribution and out-of-distribution datasets.
Recent advancements in local feature learning from images have been driven
by deep learning techniques [31]. DELF [18] is a prominent framework that
develops attentive local feature descriptors for large-scale image retrieval. Global
features are generally obtained through operations like GeM pooling [19]. Inte-
grating local and global features is beneficial as feature maps at the local scale in
image representation models act like visual words [23]. Yang et al.[30] introduce
the Deep Orthogonal Local and Global (DOLG) information fusion framework
for enhanced image retrieval. This includes an orthogonal fusion module that
merges local and global information, improving the final descriptor. Our study
applies these advancements to polyp segmentation, incorporating the Orthogonal
Fusion Module from the DOLG framework [30] to enhance detection accuracy
and robustness by leveraging both feature types.
3 Proposed Method
This section provides a general framework for our method. Figure 1 presents
an overview of DOLG-CNet. By incorporating principles of contrastive learning
and commonly used segmentation training with deep supervision, DOLG-CNet
adopts a one-stage, end-to-end framework. Its objective is to analyze image repre-
sentations in different augmentation forms and segment them, thereby improving
polyp segmentation through various types of deep supervision. The framework
includes several key components: a novel developed segmentation backbone. This
contrastive learning task analyzes the representations of images with high and
low augmentations, and an auxiliary deep supervision loss.
as .x ∈ RH×W ×3 for input. Initially, we extract feature maps from each stage
of ConvNeXt [17] and input them into residual blocks. The feature maps at
H W
stage .i are .Fi ∈ R 2i+1 × 2i+1 ×96 , with .i ranging from 1 to 4. Subsequently, these
feature maps, together with features from the Orthogonal Fusion Module, are
upsampled to match the original image resolution of .H × W . After concatenat-
ing these upsampled maps, they are processed through a residual block to derive
the final encoded features. The output stage involves a convolution layer with a
kernel size of 1 and a sigmoid activation function. This layer is responsible for
predicting the pixel-wise label map of the input image at the original resolution.
Additionally, for deep supervision purposes, a corresponding pixel label map is
generated at a resolution of . 2i+1H
× 2W
i+1 from the corresponding feature map .Fi
analysis, leveraging both localized and holistic information from the image.
After that, we employ a novel Orthogonal Fusion Module (OFM) [30] specif-
ically designed to aggregate the global feature tensor .fg and the local feature
tensor .fl . Following the orthogonal fusion process, a final compact descriptor is
generated, effectively integrating both local and global information.
Fig. 3. Architectural diagrams of (a) Orthogonal Fusion Module and (b) Residual
Block.
Residual Block. Residual blocks [12], shown in Fig. 3b, use skip connections
to learn residuals, tackling the vanishing gradient problem and boosting infor-
mation processing. These connections also enhance generalization and prevent
overfitting [11, 12]. Our architecture features sequences of convolution, batch nor-
malization, and Swish activation [20], chosen for its effectiveness across a range
of input values, outperforming traditional activations like ReLU.
DOLG-CNet 119
3.2 DOLG-CNet
The segmentation backbone was optimized using both the Dice loss .(LDice ) and
the Binary Cross Entropy loss .(LBCE ), following the methodologies described in
[28]. Supervision, including deep supervision, was consistently applied at each
output layer of the model. Let .P and .G denote the predicted and ground truth
values, respectively, both assumed to be at the same resolution. The weighting
coefficients, .λ1 and .λ2 , were set to 1 for simplicity. Therefore, the segmentation
loss .LSegment is formulated as:
4 Experiments
4.1 Datasets and Evaluation Metrics
We evaluate our DOLG-CNet on five public polyp segmentation datasets: Kvasir-
SEG [13], CVC-ClinicDB [2], CVC-ColonDB [25], CVC-T [27], and ETIS [22],
following protocols from [8, 34] with identical training and testing splits. The
training set includes 1450 images, with 550 from CVC-ClinicDB and 900 from
Kvasir-SEG, and the testing set comprises 798 images across all datasets.
For evaluation, we use six metrics: mean Dice score (mDice), mean Intersec-
tion over Union (mIoU), weighted .Fβ -measure (.Fβw ), structure measure (.Sα ),
enhanced-alignment measure .Eφmax , and mean absolute error (MAE). Aa lower
MAE indicates better performance, whereas higher values are preferable for the
other metrics.
DOLG-CNet is benchmarked against six leading methods: UNet [21],
UNet++ [35], SFA [9], PraNet [8], MSNet [34], and TransNetR [15], using pub-
lished results or replicated experiments to ensure comparable training conditions.
Fig. 5 presents a visual comparison between our method and other counterparts.
Our proposed technique performs better in segmenting polyps of various sizes
and shapes. Moreover, our approach demonstrates precise segmentation capabil-
ities, particularly for polyps that are challenging to detect, as shown in the .3rd ,
.4 and .5th rows.
th
5 Ablation Study
performance, with mDice and mIoU increase 42.0% and 42.6% on CVC-
ClinicDB, and 55.6% and 53.3% on CVC-ColonDB, respectively. These results
highlight ConvNeXt’s effectiveness in polyp segmentation, a field where minor
enhancements significantly impact clinical diagnostics. Besides, incorporating
the Orthogonal Fusion Module and modifying the loss function, including Deep
Supervision and Contrastive Loss, further improved performance by 3.8% to
14.1%, underscoring their value in enhancing diagnostic accuracy and efficiency.
For a comprehensive understanding of the models, we have utilized heatmaps
to visualize the activation within the DOLG-CNet during image processing, as
depicted in Fig. 6. These heatmaps are generated by averaging feature map chan-
nels and applying color mapping to highlight the most responsive areas of each
layer. Brighter colors indicate higher activations. The analysis ranges from the
initial to final stages, showing a progression from scattered attention in early
layers to a focused emphasis on key image features in later layers, such as the
polyp region, which is particularly enhanced by the Orthogonal Fusion Mod-
ule. This visualization aids in refining model performance by demonstrating the
model’s effective learning and recognition capabilities, especially in challenging
scenarios like distinguishing polyps from complex backgrounds.
6 Conclusion
This paper introduces DOLG-CNet, a deep learning framework for polyp image
segmentation using ConvNeXt as its backbone. It employs contrastive learning
alongside standard segmentation training and deep supervision loss to enhance
performance. Images undergo varied augmentation levels, with the model trained
to align vector embeddings using an orthogonal fusion module for effective
global and local feature merging. With deep supervision, integrating contrastive
and segmentation losses accelerates convergence. DOLG-CNet achieves notable
124 T.-H. Nguyen-Mau et al.
results with dice scores of 0.913 (Kvasir-SEG), 0.761 (CVC-ColonDB), and 0.722
(ETIS), surpassing existing methods qualitatively and quantitatively across mul-
tiple datasets. Future work aims to optimize training efficiency for larger net-
works and enhance the utilization of local and global features for superior seman-
tic segmentation.
References
1. Ahn, S.B., Han, D.S., Bae, J.H., Byun, T.J., Kim, J.P., Eun, C.S.: The miss rate for
colorectal adenoma determined by quality-adjusted, back-to-back colonoscopies.
Gut Liver 6(1), 64 (2012)
2. Bernal, J., et al.: Wm-dova maps for accurate polyp highlighting in colonoscopy:
Validation vs. saliency maps from physicians. Comput. Med. Imaging Graph. 43,
99–111 (2015)
3. Bernal, J., Sánchez, J., Vilarino, F.: Towards automatic polyp detection with a
polyp appearance model. Pattern Recogn. 45(9), 3166–3182 (2012)
4. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In:
Proceedings of the International Conference on Computer Vision (ICCV) (2021)
5. Chaitanya, K., Erdil, E., Karani, N., Konukoglu, E.: Contrastive learning of global
and local features for medical image segmentation with limited annotations. Adv.
Neural. Inf. Process. Syst. 33, 12546–12558 (2020)
6. Chen, J., et al.: Transunet: transformers make strong encoders for medical image
segmentation. arXiv preprint arXiv:2102.04306 (2021)
7. Diakogiannis, F.I., Waldner, F., Caccetta, P., Wu, C.: Resunet-a: a deep learning
framework for semantic segmentation of remotely sensed data. ISPRS J. Pho-
togramm. Remote. Sens. 162, 94–114 (2020)
8. Fan, D.-P., et al.: PraNet: parallel reverse attention network for polyp segmenta-
tion. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12266, pp. 263–273.
Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59725-2_26
9. Fang, Y., Chen, C., Yuan, Y., Tong, K.: Selective feature aggregation network
with area-boundary constraints for polyp segmentation. In: Shen, D., et al. (eds.)
MICCAI 2019. LNCS, vol. 11764, pp. 302–310. Springer, Cham (2019). https://
doi.org/10.1007/978-3-030-32239-7_34
10. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invari-
ant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR 2006), vol. 2, pp. 1735–1742. IEEE (2006)
11. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In:
Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp.
630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770–778 (2016)
13. Jha, D., et al.: Kvasir-SEG: a segmented polyp dataset. In: Ro, Y.M., et al. (eds.)
MMM 2020. LNCS, vol. 11962, pp. 451–462. Springer, Cham (2020). https://doi.
org/10.1007/978-3-030-37734-2_37
DOLG-CNet 125
14. Jha, D., et al.: Resunet++: an advanced architecture for medical image segmenta-
tion. In: 2019 IEEE International Symposium on Multimedia (ISM), pp. 225–2255.
IEEE (2019)
15. Jha, D., Tomar, N.K., Sharma, V., Bagci, U.: Transnetr: transformer-based residual
network for polyp segmentation with multi-center out-of-distribution testing. In:
Medical Imaging with Deep Learning, pp. 1372–1384. PMLR (2024)
16. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted win-
dows. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision, pp. 10012–10022 (2021)
17. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for
the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 11976–11986 (2022)
18. Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with
attentive deep local features. In: Proceedings of the IEEE International Conference
on Computer Vision, pp. 3456–3465 (2017)
19. Radenović, F., Tolias, G., Chum, O.: Fine-tuning CNN image retrieval with no
human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1655–1668
(2018)
20. Ramachandran, P., Zoph, B., Le, Q.V.: Swish: a self-gated activation function.
arXiv preprint arXiv:1710.05941 7(1), 5 (2017)
21. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4_28
22. Silva, J., Histace, A., Romain, O., Dray, X., Granado, B.: Toward embedded detec-
tion of polyps in wce images for early diagnosis of colorectal cancer. Int. J. Comput.
Assist. Radiol. Surg. 9, 283–293 (2014)
23. Siméoni, O., Avrithis, Y., Chum, O.: Local features and visual words emerge in
activations. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 11651–11660 (2019)
24. Sung, H., el al.: Global cancer statistics 2020: Globocan estimates of incidence
and mortality worldwide for 36 cancers in 185 countries. CA: Can. J. Clin. 71(3),
209–249 (2021)
25. Tajbakhsh, N., et al.: Automated polyp detection in colonoscopy videos using shape
and context information. IEEE Trans. Med. Imaging (2015)
26. Tajbakhsh, N., Gurudu, S.R., Liang, J.: Automated polyp detection in colonoscopy
videos using shape and context information. IEEE Trans. Med. Imaging 35(2),
630–644 (2015)
27. Vázquez, D., et al.: A benchmark for endoluminal scene segmentation of
colonoscopy images. J. Healthcare Eng. 2017 (2017)
28. Wei, J., Hu, Y., Zhang, R., Li, Z., Zhou, S.K., Cui, S.: Shallow attention network
for polyp segmentation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol.
12901, pp. 699–708. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-
87193-2_66
29. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations
for deep neural networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1492–1500 (2017)
30. Yang, M., et al.: Dolg: single-stage image retrieval with deep orthogonal fusion of
local and global features. In: Proceedings of the IEEE/CVF International confer-
ence on Computer Vision, pp. 11772–11781 (2021)
126 T.-H. Nguyen-Mau et al.
31. Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: LIFT: learned invariant feature trans-
form. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS,
vol. 9910, pp. 467–483. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-
46466-4_28
32. Zauber, A.G., et al.: Colonoscopic polypectomy and long-term prevention of
colorectal-cancer deaths. N. Engl. J. Med. 366(8), 687–696 (2012)
33. Zhang, Y., Liu, H., Hu, Q.: TransFuse: fusing transformers and cnns for medi-
cal image segmentation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS,
vol. 12901, pp. 14–24. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-
87193-2_2
34. Zhao, X., Zhang, L., Lu, H.: Automatic polyp segmentation via multi-scale subtrac-
tion network. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12901, pp.
120–130. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87193-2_12
35. Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: a nested
u-net architecture for medical image segmentation. In: Deep Learning in Medi-
cal Image Analysis and Multimodal Learning for Clinical Decision Support: 4th
International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS
2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September
2018, Proceedings 4, pp. 3–11. Springer (2018)
VisChronos: Revolutionizing Image
Captioning Through Real-Life Events
Abstract. This paper aims to bridge the semantic gap between visual
content and natural language understanding by leveraging historical
events in the real world as a source of knowledge for caption gener-
ation. We propose VisChronos, a novel framework that utilizes large
language models and dense captioning models to identify and describe
real-life events from a single input image. Our framework can automati-
cally generate detailed and context-aware event descriptions, enhancing
the descriptive quality and contextual relevance of generated captions
to address the limitations of traditional methods in capturing contextual
narratives. Furthermore, we introduce a new dataset, EventCap (https://
zenodo.org/records/14004909), specifically constructed using the pro-
posed framework, designed to enhance the model’s ability to identify
and understand complex events. The user study demonstrates the effi-
cacy of our solution in generating accurate, coherent, and event-focused
descriptions, paving the way for future research in event-centric image
understanding.
1 Introduction
Image captioning aims to generate descriptive captions for images. However,
most current methods tend to produce captions with a limited understanding of
the image, focusing primarily on identifying objects, actions, and basic physical
attributes [8, 9, 14]. These approaches fall short in conveying deeper context,
as they lack the ability to infer meaningful information about the events or
interactions taking place in the image. This limitation becomes especially evident
when the goal is to describe not just what is visible but also the underlying story
or context associated with the image.
In many cases, the generated captions are too superficial to capture complex
scenarios where additional information, such as who is involved, what is hap-
pening, where and when the event took place, and its significance, is critical. As
P.-T. Nguyen and H. Nguyen—Both authors contributed equally to this research.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 127–140, 2025.
https://doi.org/10.1007/978-981-96-4282-3_11
128 P.-T. Nguyen et al.
Fig. 1. Comparison between general caption by Grit [14] and our event-based caption,
highlighting additional details such as location, date, identity, and event purpose.
a result, these methods are inadequate for providing rich, informative captions
that align with more sophisticated user needs, such as understanding or retriev-
ing images related to real-world events. To overcome this, a new approach is
needed-one that integrates contextual details and event-related information to
create comprehensive, narrative-driven captions that go beyond simple object
recognition.
Our research introduces the task of Event-Enriched Image Captioning
(EEIC), which aims to generate captions that provide richer, more compre-
hensive information about an image. This approach is demonstrated through a
sample depicted in Fig. 1, where we showcase the result caption that our method
generates. This example illustrates the enhanced descriptive quality and contex-
tual depth that EEIC can bring to image captioning. These captions go beyond
simple visual descriptions by offering deeper insights, including the names and
attributes of objects, the timing, context, outcomes of events, and other crucial
details-information that cannot be gleaned from merely observing the image.
This approach facilitates the creation of more coherent and detailed narratives,
capturing not only the visible elements but also the underlying context and sig-
nificance of the scene, ultimately offering a more complete understanding of what
the image represents.
The core idea of our approach is to harness event-related information from
credible sources while leveraging the reasoning capabilities of both vision-
language models and large language models (LLMs). We propose a fully auto-
mated four-step framework, called VisChronos, to analyze both the visual con-
tent and the temporal, event-based aspects of the scene, ensuring a more com-
prehensive understanding. VisChronos operates through a systematic process
designed to extract, analyze, and synthesize information from images and asso-
ciated events. First, a vision-language model identifies and describes the most
important aspects of the image, including both those specified by prompts and
VisChronos: Revolutionizing Image Captioning Through Real-Life Events 129
those deemed important by the model itself. Next, in the second step, an LLM
generates questions about the image based on the key aspects identified in the
first step, which includes both mandatory and optional questions. In the third
step, another LLM answers these questions using event information that we pro-
vide. Finally, a separate LLM synthesizes and processes the information from the
previous three steps to infer and generate the final caption for the image. Unlike
traditional models that rely on learning from a massive dataset and risk generat-
ing information that may be fabricated or irrelevant, our method addresses this
issue by incorporating factual sourcescredible, human-authored articles that pro-
vide real, context-rich information. By drawing directly from authentic articles
and aligning this information with image content, our framework ensures that
captions accurately represent real events in human history. This framework is
designed to establish an efficient information mining flow, systematically divid-
ing different stages of information extraction across each step. This approach
ensures that the framework can mine the most useful and relevant information
at each stage, ultimately resulting in rich and contextually accurate captions.
Extensive human evaluations of captions generated by the VisChronos reveal
that they are comparable to captions crafted by human annotators. These
machine-generated captions were particularly praised for their completeness,
coherence, conciseness, and the inclusion of relevant information not explicitly
visible in the images.
Using VisChronos, we have created a dataset named EventCap, consisting
of 3140 event-based image-caption pairs. Each pair has been carefully curated
using images and related information sourced from a wide range of credible
articles. This collection serves as a valuable resource for training and evaluating
the performance of image captioning models in understanding and describing
complex real-world events. To the best of our knowledge, no similar dataset
exists that is specifically designed for the task of event-enriched image captioning,
making EventCap a unique and essential resource for advancing research in this
area.
Our main contributions can be summarized as follows:
2 Related Work
2.1 Dense Captioning
Dense captioning, which aims to generate detailed descriptions for objects in
scenes or videos, remains a challenging task. Over the years, several methods
have been developed to address this problem, each employing distinct approaches
and making notable contributions. Chen et al. [5] introduced Scan2Cap, an end-
to-end method for dense captioning in RGB-D scans, utilizing 3D point cloud
inputs to generate bounding boxes and corresponding object descriptions. Build-
ing on this, Wang et al. [13] presented PDVC, a framework that formulates dense
video captioning as a task of set prediction, enabling efficient parallel decoding.
Aafaq et al. [1] proposed the ViSE framework and VSJM-Net, which leverage
early linguistic information fusion to model word-context distributional proper-
ties for improved dense video captioning. Shao et al. [10] further advanced the
field by incorporating textual context-aware methods that generate diverse and
context-rich captions. Similarly, Jiao et al. [7] presented MORE, a model that
captures complex scene relations for superior captioning accuracy. Furthermore,
Wu et al. [14] developed GRiT, a generative region-to-text transformer model
that emphasizes object understanding in dense captioning tasks.
Despite the advancements made by these approaches, they typically generate
conventional captions that lack real-world semantic depth, as they rely solely on
information from the image or video. Moreover, most existing methods produce
captions in a single pass through learned representations, which can result in
missing critical details. In contrast, our method continuously supplements the
captioning process through an interactive dialogue between models, allowing for
the extraction of more nuanced and semantically rich information.
Large Language Models (LLMs) have garnered significant attention due to their
ability to generate human-like text and solve complex tasks across various
domains. GPT (Generative Pre-trained Transformer), developed by OpenAI,
is one of the most well-known LLMs. Floridi et al. [6] introduced GPT-3, a
third-generation autoregressive language model that generates human-like text
using deep learning techniques. They explored its nature, scope, limitations, and
potential consequences, emphasizing that GPT-3 is not intended to pass complex
mathematical, semantic, or ethical tests. Brown et al. [4] further demonstrated
that scaling language models, such as GPT-3 with 175 billion parameters, sig-
nificantly improves few-shot performance across various tasks, sometimes out-
performing state-of-the-art (SOTA) approaches. In subsequent developments,
Achiam et al. [2] presented GPT-4, a large-scale multimodal model capable of
processing both image and text inputs to generate text outputs, marking a sig-
nificant advancement in multimodal language modeling. Meanwhile, Yang et
al. [15] explored GPT-4V, expanding GPT-4’s capabilities to include vision-
based tasks, opening new avenues for large multimodal models. In addition,
VisChronos: Revolutionizing Image Captioning Through Real-Life Events 131
Wang et al. [12] evaluated the trustworthiness of GPT models, concluding that
GPT-4 is generally more reliable than GPT-3.5 but remains vulnerable to adver-
sarial attacks, such as jailbreaking or misleading prompts.
Gemma is a lightweight, SOTA open model that builds upon the research
and technology developed for Gemini models. As outlined in recent studies,
Gemma has exhibited strong performance across various benchmarks for lan-
guage understanding, reasoning, and safety [11]. Notably, Gemma has been rig-
orously evaluated alongside other large language models (LLMs) to assess its
capabilities across different languages, modalities, models, and tasks [3]. The
advancements in Gemma are a result of ongoing improvements in multimodal
research, with enhancements in understanding and processing non-textual data
inputs such as images and speech, which significantly contribute to its versatil-
ity in both academic and practical applications. Furthermore, by being available
as an open-source model, Gemma provides accessible, cutting-edge AI technol-
ogy to the broader research community, while also offering paid solutions for
advanced functionalities and commercial deployment [11].
In this framework, we integrate the use of dense captioning models with
both paid and open-source LLMs, including models such as GPT and Gemma,
to leverage their combined strengths for enhanced performance across a range
of tasks. By this framework, we also generate the EventCap dataset, the first
dataset specifically designed for event captioning, providing a unique resource
for accurately describing event contexts in diverse applications.
132 P.-T. Nguyen et al.
Category Description
Time Questions addressing when the event occurred or its timeline to
understand the temporal context.
Context or Reason Questions exploring the circumstances or motivations behind the event,
seeking to clarify why the event took place.
Main Events Questions focusing on the central actions or occurrences within the event,
ensuring a clear understanding of the primary narrative.
Outcome Questions that inquire about the result or conclusion of the event, aiming
to highlight the final impact or resolution.
Impact Questions exploring the broader consequences or effects of the event on
individuals, groups, or larger contexts.
Objects or People Several questions delving into the key people or objects mentioned in the
dense caption, ensuring that all significant elements are covered.
Special Figures Specific questions about notable or important figures involved in the
event, shedding light on their roles and influence.
Emotions and Reactions Questions designed to explore the emotional states or reactions of the
people in the image, providing insight into the human element of the
scene.
Background Details Questions addressing the setting or background elements in the image,
helping to paint a fuller picture of the environment in which the event
takes place.
Future Implications Questions speculating on the potential future outcomes or ramifications of
the event, aiming to place the event within a broader temporal and
societal context.
image. The questions are designed to extract more information from the accom-
panying article or external sources, leading to a comprehensive understanding of
the depicted event.
A structured approach is employed to generate questions that comprehen-
sively address the event or context depicted in the image. The questions are
crafted to ensure that all key aspects of the scenario are explored, providing a
robust foundation for the subsequent explanation and synthesis stages as shown
in Table 1.
The design of the question-generation model ensures that each of these
dimensions is adequately explored, leading to a comprehensive inquiry into the
event depicted in the image. This stage serves as the foundation for the next
phase, where the generated questions are used to retrieve detailed answers from
external knowledge sources.
The third stage focuses on extracting detailed answers to the questions gen-
erated in Stage 2 by leveraging external knowledge sources such as articles or
accompanying text related to the image. The Explanation Bot ensures that the
answers are accurate and directly relevant to the event depicted in the image.
134 P.-T. Nguyen et al.
This stage requires a highly precise approach to ensure that the model pro-
vides accurate answers grounded in the available information. A key princi-
ple guiding the model’s behavior during this phase is certainty. The model is
instructed to answer questions only when it is 100% confident in the informa-
tion it provides. This ensures that the answers are factually reliable and directly
linked to the knowledge from the article or dense caption.
Information Retrieval: The model focuses on extracting factual data from the
article or other external knowledge sources and cross-referencing this information
with the dense caption to ensure consistency and accuracy. The goal is to answer
each question as fully as possible, but only when the necessary information is
available and can be confidently inferred.
This strict adherence to certainty ensures that the answers provided are reli-
able and fact-based, maintaining the integrity of the image-captioning process.
By instructing the model to explicitly state “no information” when necessary,
the framework avoids overgeneralization or the inclusion of speculative answers.
specifically created for this task, there is no comparable dataset available for
direct benchmarking. To address this, we conducted a user study aimed at assess-
ing both the effectiveness of our method and the quality of the generated dataset.
VisChronos: Revolutionizing Image Captioning Through Real-Life Events 137
Fig. 4. Sample distribution in EventCap dataset by year (best view in color & zoom-
in). (Color figure online)
Apparatus and Procedure: Our study was conducted both online and on-site in
our lab, where participants completed the tasks. Each participant received clear
instructions on evaluating captions and writing their own for comparison. They
were required to spend at least 4 min evaluating and 5 min writing the caption
for each image including reading the corresponding article. The total time for
the study sessions was approximately 120 min per participant.
First, participants were asked to write their own captions for 5-10 images
from 2 articles. After that, participants evaluated the quality of captions for
approximately 10-15 images from 4 different articles. Half of the captions were
generated by our framework, while the other half were written by humans (i.e.,
other participants).
The participants were asked to rate the performance of the captions on a
scale of 1 to 5 across three metrics, based on their individual perspectives. The
comparison was based on several key metrics:
– Faithfulness: Whether the content of the caption fully describes the key
events depicted in the image.
– Comprehensibility: Whether the caption is concise, easy to read, and free
of unnecessary information.
138 P.-T. Nguyen et al.
Fig. 5. Sample distribution in EventCap dataset by category and section (best view in
color & zoom-in). (Color figure online)
5 Conclusion
References
1. Aafaq, N., Mian, A., Akhtar, N., Liu, W., Shah, M.: Dense video captioning with
early linguistic information fusion. IEEE Trans. Multimedia (2022)
2. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., et al.: GPT-4 technical report.
arXiv preprint arXiv:2303.08774 (2023)
3. Ahuja, S., Aggarwal, D., Gumma, V., et al.: Megaverse: benchmarking large
language models across languages, modalities, models and tasks. arXiv preprint
arXiv:2308.05698 (2023)
4. Brown, T., et al.: Language models are few-shot learners. In: NIPS (2020)
5. Chen, D.Z., Gholami, A., Nießner, M., Chang, A.X.: Scan2cap: context-aware dense
captioning in RGB-D scans. arXiv preprint arXiv:2012.02202 (2020)
6. Floridi, L., Chiriatti, M.: GPT-3: its nature, scope, limits, and consequences. Minds
Mach. (2020)
7. Jiao, Y., Chen, S., Jie, Z., Chen, J., Ma, L., Jiang, Y.G.: More: multi-order relation
mining for dense captioning in 3D scenes. In: ECCV (2022)
8. Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for gen-
erating descriptive image paragraphs. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 317–325 (2017)
9. Li, C., et al.: mPLUG: effective and efficient vision-language learning by cross-
modal skip-connections. arXiv preprint arXiv:2205.12005 (2022)
10. Shao, Z., Han, J., Debattista, K., Pang, Y.: Textual context-aware dense captioning
with diverse words. IEEE Trans. Multimedia (2023)
11. Team, G., Mesnard, T., Hardin, C., et al.: Gemma: open models based on Gemini
research and technology. arXiv preprint arXiv:2401.01234 (2024)
12. Wang, B., Chen, W., Pei, H., et al.: Decodingtrust: a comprehensive assessment of
trustworthiness in GPT models. NIPS (2023)
13. Wang, T., Zhang, R., Lu, Z., Zheng, F., Cheng, R., Luo, P.: End-to-end dense
video captioning with parallel decoding. arXiv preprint arXiv:2107.12589 (2021)
14. Wu, J., et al.: Grit: a generative region-to-text transformer for object understand-
ing. arXiv preprint arXiv:2203.15806 (2022)
15. Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.C., Liu, Z., Wang, L.: The dawn of LMMs:
preliminary explorations with GPT-4V(ision). arXiv preprint arXiv:2309.00332
(2023)
TI-JEPA: An Innovative Energy-Based
Joint Embedding Strategy for Text-Image
Multimodal Systems
1 Introduction
In the era of Artificial Intelligence, the ability to process and understand infor-
mation from multiple modalities simultaneously has become increasingly crucial
[31, 32]. Multimodal fusion, the process of integrating information from various
sensory inputs to form a coherent understanding, stands at the forefront of this
challenge. Among the myriad of multimodal tasks, text-image alignment has
emerged as a fundamental problem with far-reaching applications in areas such
as visual question answering, image captioning, and cross-modal retrieval [26].
K. H. N. Vo and D. P. T. Nguyen—Contributed equally to this paper.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 141–154, 2025.
https://doi.org/10.1007/978-981-96-4282-3_12
142 K. H. N. Vo et al.
2 Related Works
2.1 Multimodal Fusion
Multimodal fusion has been a growing area of interest in machine learning
[20, 21]. Early works [11, 12, 28] focused on feature-level fusion, combining repre-
sentations from different modalities using simple concatenation or averaging.
More advanced techniques have since emerged. Xu et al. [11] introduced
attention-based fusion, allowing models to dynamically focus on relevant fea-
tures across modalities. They also proposed tensor-based methods for capturing
higher-order interactions between modalities. Lu et al. [12] explored the use of
Transformer architecture for multimodal fusion, leveraging their ability to model
long-range dependencies across different data types.
TI-JEPA 143
Text-image alignment has seen significant advancements in recent years [3, 15,
25]. Frome et al. [3] introduced the concept of visual-semantic embeddings in
DeViSE, learning a joint space where semantically similar text and images are
close to each other. While Radford et al. [25] demonstrated the power of con-
trastive learning with large-scale image-text data for text-image alignment with
the CLIP model. This approach has since been extended by methods like ALIGN
[4], which further improved performance through larger-scale training. More
recently, Li et al. [8] proposed UNIMO, a unified framework for text-image pre-
training, incorporating multiple pretext tasks to learn robust representations.
3 Our Approach
The proposed TI-JEPA architecture integrates cross-attention mechanisms to
effectively align textual and visual information for predicting masked image
patches, which is demonstrated in Fig. 1. Before getting into details, denote .I
as the original image, .Icontext as the same image but after performing context
masking, and its caption as .T . The high-level aspect of our architecture can be
described in smaller components as below:
– Image encoder .fI processes on full image .I and masked image .Icontext , gen-
erate its full embedding representation and masked context parts (context
block) representation.
– Text encoder .fT converts image description .T into a dense representation
that captures semantic information.
– We employs two blocks of text-to-image (t2i) cross attention, namely block
.X and block .X̃, to align the encoded text features with the visual features
from the image encoder.
– The output of the t2i cross-attention block .X is passed through a predictor
.gφ , which generates the final predictions for the representations of the target
patches.
Fig. 1. The proposed TI-JEPA architecture, where cross-attention between text and
image encodings is leveraged to predict masked patches.
the target image patches. Given an input image .I, we divide it into .N non-
overlapping patches, which are passed through the image encoder .fI to obtain
corresponding representations .sI = {sI1 , sI2 , . . . , sIN }, where .sIk is the repre-
sentation for the .k th patch. The paired text .T will be divided into .L tokens,
which are passed through the target encoder .fT to obtain corresponding repre-
sentations .sT = {sT1 , sT2 , . . . , sTL }, where .sTk is the representation for the .k th
token of the text. Finally they will go through the cross-attention to generate
the final target representations .sy = X̃(sT , sI ). To obtain the targets for the
loss, from those representations we sample .M blocks, which are going to include
the patches that needs to be predicted. We denote the mask corresponding to
the .ith block with .Bi , and its patch-level representation with .sy (i) = {syj }j∈Bi .
target mask .Bi , the predictor takes as input the attention output .sx and a set of
mask tokens .{mj }j∈Bi for each patch that needs to be predicted. The predictor
146 K. H. N. Vo et al.
then outputs a prediction .{ŝyj }j∈Bj = gφ (sx , {mj }j∈Bi ). The mask tokens are
parameterized by a shared learnable vector with added positional encoding.
We obtain the final predictions .ŝy (1), ŝy (2), . . . , ŝy (M ) by applying the pre-
dictor .M times, each time we rely on the mask tokens for the corresponding
target-block locations.
M M
1 1
LP = D(ŝy (i), sy (i)) = ŝyj − syj 22
M i=1 M i=1
j∈Bi
4 Training TI-JEPA
4.1 Dataset
dAll experiments were conducted on two NVIDIA GeForce GTX 1080 Ti GPUs.
The training process spanned 300 epochs, with the largest model requiring
approximately 188 h on our hardware setup.
1
Checkpoint available here: https://github.com/facebookresearch/ijepa?
tab=readme-ov-file#pretrained-models.
TI-JEPA 147
Predictor Module: The predictor is inherited from the original I-JEPA pre-
dictor, which is a shallow vision transformer with a depth of 12 layers and 12
attention heads per layer.
Parameter Value
Batch size .1024
– One of the major challenges in JEPA models is energy collapse, where the
model converges to a state where multiple inputs result in similar out-
puts, significantly reducing representational diversity. By freezing the pre-
trained encoder-decoder modules, we ensure that these components retain
their capacity to extract diverse and meaningful features from text and image
inputs, thus mitigating this degenerative phenomenon.
– The encoder-decoder components used in our framework are pretrained on
extensive datasets, providing a rich set of learned representations. By freez-
ing these layers, we effectively reuse the knowledge encapsulated in the pre-
trained weights, allowing us to focus computational resources on optimizing
the cross-attention modules. This approach not only enhances the scalability
and stability of our model but also ensures that the pretrained components
contribute effectively to overall performance without being disrupted by fur-
ther training.
collected from Twitter and annotated for multimodal sentiment analysis. How-
ever, we conducted a preprocessing step to remove emotionally inconsistent sam-
ples, where the sentiment labels of the image and text conflicted.
For the MVSA-Single dataset, we processed the data by first addressing
instances where both the text and image labels were identical, which we retained
as trivial cases. We removed any image-text pairs where one label was positive
and the other was negative, considering such contradictions unreliable. For cases
where one component (either text or image) had a neutral label and the other a
positive or negative label, we assigned the final label based on the non-neutral
component. This ensured consistency between the text and image sentiment
annotations. The MVSA-Multi dataset, however, contains sentiment annotations
from three annotators, which required a majority voting approach for determin-
ing the final sentiment labels for both the text and image components. For each
pair, we calculated the majority sentiment for both the text and image anno-
tations. In cases where the annotations were perfectly balanced, such as one
annotator labeling the text as neutral, another as positive, and the third as neg-
ative, we considered the label ambiguous and removed the pair from the dataset.
This approach ensured that only clear sentiment pairs were retained for further
analysis.
After preprocessing, the MVSA-Single dataset was reduced to .4, 511 pairs,
and the MVSA-Multi dataset to .17, 027 pairs. The number of records for each
table is shown in Table 3 as follows:
We then split the revised datasets into training, validation, and test sets with
the ratio of .8 : 1 : 1. To adapt our model for this task, we fine-tuned the pre-
trained TI-JEPA by adding a classification head consisting of a simple linear
layer on top. The classification head was trained for .40 epochs using the Adam
optimizer with a learning rate of .0.001 and the loss function is cross-entropy loss.
In our study, we compared the proposed model with several benchmark mod-
els, evaluating their accuracy and F1-score. Traditional models like SentiBank
and SentiStrength [2] rely on statistical feature extraction and struggle to cap-
ture intrinsic multimodal features, leading to relatively low performance. On the
other hand, CNNMulti [24] processes text and image features separately using
two distinct CNNs, leveraging deep learning’s capacity to capture emotional
expressiveness and improving prediction by merging these features.
The DNN-LR model [6] employs transfer learning with pretrained models and
utilizes logistic regression for decision making. The CoMemory model [29] intro-
duces a multimodal fusion mechanism, which enhances the interaction between
text and image features, improving sentiment prediction. The MVAN model [30]
applies a memory network on top of a multi-view attention mechanism, enabling
richer semantic interactions between image and text and achieving better results.
Moreover, the CLMLF model [9] utilizes contrastive learning to enhance the
representation of multimodal features, fostering stronger associations between
image and text inputs, thereby improving model performance. Besides, the ITIN
model [33] implements cross-modal alignment operations along with an adaptive
fusion module, leading to substantial gains in accuracy for sentiment analysis
tasks. And lastly, the CLIP-CA-CG model [13] utilizes pre-trained RoBERTa
and ResNet50 models to extract visual and textual features, which are further
processed through CLIP contrastive learning to acquire features that are more
level-deeper.
We compared three configurations of our proposed TI-JEPA model - Small,
Medium, and Large - against mentioned baselines. Table 4 presents the compar-
ative results, demonstrating the performance of each configuration of TI-JEPA
across both the MVSA-Single and MVSA-Multi datasets.
TI-JEPA 151
7 Conclusion
In this paper, we introduced TI-JEPA, a novel energy-based model for text-
image alignment in multimodal fusion. Our approach addresses the challenge
of bridging the semantic gap between visual and textual modalities, offering a
flexible framework for various multimodal tasks. The success of TI-JEPA can be
attributed to its joint embedding space and predictive architecture, enabling the
model to learn robust and generalizable representations.
References
1. Assran, M., et al.: Self-supervised learning from images with a joint-embedding
predictive architecture. arXiv: 2301.08243 [cs.CV] (2023)
2. Borth, D., et al.: SentiBank: large-scale ontology and classifiers for detecting senti-
ment and emotions in visual content. In: Proceedings of the 21st ACM International
Conference on Multimedia. MM ’13. Barcelona, Spain: Association for Comput-
ing Machinery, pp. 459–460 (2013). isbn: 9781450324045. https://doi.org/10.1145/
2502081.2502268
3. Frome, A., et al.: DeViSE: a deep visual-semantic embedding model. In: Burges,
C.J., et al. (eds.) Advances in Neural Information Processing Systems, vol. 26.
Curran Associates, Inc. (2013)
TI-JEPA 153
4. Jia, C., et al.: Scaling up visual and vision-language representation learning with
noisy text supervision. arXiv:2102.05918 (2021). https://api.semanticscholar.org/
CorpusID:231879586
5. Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models
from self-supervised synchronization. In: Bengio, S., et al. (eds.) Advances in Neural
Information Processing Systems, vol. 31. Curran Associates, Inc. (2018)
6. Krishna, R., et al.: Visual genome: connecting language and vision using crowd-
sourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017).
https://doi.org/10.1007/s11263-016-0981-7
7. LeCun, Y., et al.: A tutorial on energy-based learning (2006)
8. Li, W., et al.: UNIMO: towards unified-modal understanding and generation via
cross-modal contrastive learning. In: Zong, C., et al (eds.) Proceedings of the 59th
Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 1: Long
Papers). Association for Computational Linguistics, pp. 2592–2607 (2021). https://
doi.org/10.18653/v1/2021.acl-long.202, https://aclanthology.org/2021.acllong.202
9. Li, Z., et al.: CLMLF: a contrastive learning and multi-layer fusion method for
multimodal sentiment detection. In: Findings of the Association for Computational
Linguistics: NAACL 2022. Association for Computational Linguistics (2022)
10. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: European Con-
ference on Computer Vision (2014). https://api.semanticscholar.org/CorpusID:
14113767
11. Liu, Z., et al.: Efficient low-rank multimodal fusion with modality- specific fac-
tors. In: Annual Meeting of the Association for Computational Linguistics (2018).
https://api.semanticscholar.org/CorpusID:44131945
12. Lu, J., et al.: ViLBERT: pretraining task-agnostic visiolinguistic representations
for vision-and-language tasks. In: Wallach, H., et al. (eds.) Advances in Neural
Information Processing Systems, vol. 32. Curran Associates, Inc. (2019)
13. Lu, X., Ni, Y., Ding, Z.: Cross-modal sentiment analysis based on CLIP image-text
attention interaction. Int. J. Adv. Comput. Sci. Appl. 15(2) (2024). https://doi.
org/10.14569/IJACSA.2024.0150290
14. Nguyen, C.-D., et al.: Expand BERT representation with visual information via
grounded language learning with multimodal partial alignment. In: Proceedings of
the 31st ACM International Conference on Multimedia, pp. 5665–5673 (2023)
15. Nguyen, C.-D., et al.: Improving multimodal sentiment analysis: supervised angular
margin-based contrastive learning for enhanced fusion representation. In: Findings
of the Association for Computational Linguistics: EMNLP 2023, pp. 14714–14724
(2023)
16. Nguyen, C.-D., et al.: KDMCSE: knowledge distillation multimodal sentence
embeddings with adaptive angular margin contrastive learning. In: North Ameri-
can Chapter of the Association for Computational Linguistics (2024). https://api.
semanticscholar.org/CorpusID:268691429
17. Nguyen, T., et al.: Adaptive contrastive learning on multimodal transformer for
review helpfulness prediction. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.)
Proceedings of the 2022 Conference on Empirical Methods in Natural Language
Processing. Abu Dhabi, United Arab Emirates: Association for Computational Lin-
guistics, pp. 10085–10096 (2022). https://doi.org/10.18653/v1/2022.emnlp-main.
686, https://aclanthology.org/2022.emnlp-main.686
18. Nguyen, T., et al.: DemaFormer: damped exponential moving average transformer
with energy-based modeling for temporal language grounding. In: Findings of the
Association for Computational Linguistics: EMNLP 2023, pp. 3635–3649 (2023)
154 K. H. N. Vo et al.
19. Nguyen, T., et al.: Video-language understanding: a survey from model architec-
ture, model training, and data perspectives. In: Ku, L.-W., Martins, A., Srikumar,
V. (eds.) Findings of the Association for Computational Linguistics ACL 2024.
Bangkok, Thailand and virtual meeting: Association for Computational Linguis-
tics, pp. 3636–3657 (2024). https://aclanthology.org/2024.findings-acl.217
20. Nguyen, T., et al.: Vision-and-language pretraining. arXiv preprint
arXiv:2207.01772 (2022)
21. Nguyen, T.T., et al.: Encoding and controlling global semantics for long-form video
question answering. arXiv preprint arXiv:2405.19723 (2024)
22. Nguyen, T.T., et al.: Topic modeling as multi-objective contrastive optimization.
In: The Twelfth International Conference on Learning Representations (2024).
https://openreview.net/forum?id=HdAoLSBYXj
23. Ou, Z.: Energy-based models with applications to speech and language processing.
Found. Trends Signal Process. 18, 1–199 (2024). https://api.semanticscholar.org/
CorpusID:268459305
24. Ouyang, X., et al.: Sentiment analysis using convolutional neural network. In:
2015 IEEE International Conference on Computer and Information Technology;
Ubiquitous Computing and Communications; Dependable, Autonomic and Secure
Computing; Pervasive Intelligence and Computing, pp. 2359–2364 (2015). https://
doi.org/10.1109/CIT/IUCC/DASC/PICOM.2015.349.
25. Radford, A., et al.: Learning transferable visual models from natural language
supervision. arXiv: 2103.00020 [cs.CV] (2021)
26. Siebert, T., et al.: Multi-modal fusion transformer for visual question answering in
remote sensing (2022)
27. Wang, H., Ren, C., Yu, Z.: Multimodal sentiment analysis based on cross-instance
graph neural networks. Appl. Intell. 54(4), 3403–3416 (2024). https://doi.org/10.
1007/s10489-024-05309-0
28. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual
attention. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Con-
ference on Machine Learning, vol. 37. Proceedings of Machine Learning Research,
pp. 2048–2057. PMLR, Lille, France (2015). https://proceedings.mlr.press/v37/
xuc15.html
29. Xu, N., Mao, W., Chen, G.: A co-memory network for multimodal sentiment anal-
ysis. In: SIGIR ’18, The 41st International ACM SIGIR Conference on Research
& Development in Information Retrieval, pp. 929–932. Association for Computing
Machinery, Ann Arbor, MI, USA (2018). https://doi.org/10.1145/3209978.3210093
30. Yang, X., et al.: Image-text multimodal emotion classification via multi-view atten-
tional network. IEEE Trans. Multimedia 23, 4014–4026 (2021). https://doi.org/
10.1109/TMM.2020.3035277
31. Zhao, G., Li, Y., Xu, Q.: From emotion AI to cognitive AI. Int. J. Network Dyn.
Intell. 1(1), 65–72 (2022). https://doi.org/10.53941/ijndi0101006, https://www.
sciltp.com/journals/ijndi/article/view/115
32. Zhao, J., et al.: Cognitive psychology-based artificial intelligence review. Front.
Neuroscience 16, 1024316 (2022). https://doi.org/10.3389/fnins.2022.1024316
33. Zhu, T., et al.: Multimodal sentiment analysis with image-text interaction network.
Trans. Multi. 25, 3375–3385 (2023). https://doi.org/10.1109/TMM.2022.3160060
A Lightweight End-to-End Multi-task
Learning System for Vietnamese Speaker
Verification
Mai Hoang Dao1,2 , Son Thai Nguyen2 , Duy Minh Le2 , Cong Tran2(B) ,
and Cuong Pham1,2
1
VinAI Research, Hanoi, Vietnam
[email protected]
2
Posts and Telecommunications Institute of Technology, Hanoi, Vietnam
{congtt,cuongpv}@ptit.edu.vn
ASV in the language. In this study, we aim to address this issue by developing
a meticulously designed Vietnamese dataset suitable for novel AI model-based
tasks that are currently receiving widespread attention from the research com-
munity worldwide.
Recent deep learning models for ASV have a significant number of param-
eters and require substantial computational resources to perform accurately
[1, 3, 11, 13, 22, 29]. Deep neural networks have been widely used for ASV, either
as standalone models [6–8, 18, 32] or as feature extractors for other classifiers
[5]. Moreover, previous studies on ASV tend to explore large pretrained speech
presentation models [4, 25, 30]. Though these models achieve outstanding per-
formances, they have a massive size and long inference time. Thus, using such
models in low-capacity IoT devices is challenging. Furthermore, current ASV
systems focus on specific tasks, while real-world applications require the abil-
ity to handle multiple tasks simultaneously. Using single model for each task
significantly increases the operations that the hardware system needs to per-
form, which thereby leads to certain latency and degrades the user experience.
As a result, developing compact and fast multi-task learning models that can
be embedded in low-capacity devices is crucial for ASV applications. Despite
previous efforts, no dataset or prior models address multi-task learning for Viet-
namese speaker verification. Additionally, most existing deep learning models
are either exclusive to a single task or have an excessive number of parameters,
while our goal is to build a comprehensive and lightweight system suitable for
low-capacity devices. In this study, we aim to address these challenges by devel-
oping multi-task learning models that are both compact and fast for Vietnamese
speaker verification.
For all the aforementioned motivations, we push this research field forward by
introducing a new dataset and a lightweight model. Our dataset includes 6480
audio and text label pairs from 162 individuals and 65 types of AI synthetic
voice for three key ASV tasks. Our Vi-LMM model is a lightweight multitasking
model that incorporates an attention layer to integrate information between
tasks. We have reduced the number of parameters of Vi-LMM by 3.5 times
using recent advances in knowledge distillation. Our contributions to this field
are summarized as follows:
– We introduce the first public Vietnamese dataset as training data of three
sub-tasks of ASV, namely command detection, fake voice recognition, and
speaker verification;
– We propose two lightweight models, termed Vi-LMM and Vi-LMM-S, for joint
learning of the three tasks;
– Experimental results on our dataset show that (i) while requiring a signif-
icantly smaller number of parameters, our proposed models exhibit compa-
rable performance to other strong single baselines [6, 7, 11, 13, 18, 24] and (ii)
our joint learning method improves the overall performance of the model on
the three sub-tasks.
A Lightweight End-to-End Multi-task Learning System 157
We publicly release our dataset and model implementations for research or edu-
cational purpose.2 We hope that our dataset and model can serve as a starting
point for future Vietnamese speech processing research and applications.
2 Our Dataset
2.1 Multi-tasking Dataset
Our goal is to develop a comprehensive dataset that can be used to train a
multi-task learning model capable of performing three tasks: command detec-
tion, fake voice recognition, and speaker verification. The model should be able
to differentiate between authentic user speech and different types of distractors,
including synthetic AI speeches, non-command speeches with similar patterns,
and noises from other speakers. To achieve this, we select two widely-used com-
mands in IoT applications,
(Close the door), and define four categories of speech: A) exact com-
mand, B) conversational speech containing one command, C) speech with no
command, and D) speech with similar words to the command but should not
be identified as a correct one. Examples for each category are presented in the
Table 1. The annotated dataset is divided into training, validation, and test sets
in a 5/2/3 ratio, ensuring that the distribution of order utterance types and
gender is well-balanced across all subsets. The dataset statistics are displayed in
Table 2.
Table 1. Examples of four speech categories in Vietnamese and their English translated
version
Type Example
A
Turn on the camera.
B
He has already come, turn on the camera.
C
It’s her scene, prepare the clothes for me.
D
The camera of this phone is so bad.
2
Our dataset and model will be released upon acceptance.
158 M. H. Dao et al.
Revision. We conduct a manual quality check of each audio and its label file
to ensure consistency and remove samples that did not meet the criteria. For
the verification task, we label each performer and their corresponding audio
accordingly. For command detection, audios belonging to groups A and B are
labeled as True, while others are labeled as False. Additionally, all speeches
generated by HiFiGAN are labeled as AI synthesized speech for the fake voice
recognition task. The final Vietnamese dataset contains 6480 audios from 162
subjects and 65 different AI voices.
Table 2. Statistics of our Vietnamese dataset. (s) represents for “seconds” and (t)
represents for “tokens”.
Statistic A B C D Total
# audios 810 810 810 810 3240
# subjects 162 162 162 162 162
Minimum length (s) 0.75 1.88 1.54 0.96 0.75
Maximum length (s) 5.63 10.36 10.08 6.76 10.36
Average length (s) 2.38 4.43 4.00 2.79 3.38
Minimum length (t) 3 7 6 3 3
Maximum length (t) 3 37 29 23 37
Average length (t) 3.91 11.64 11.29 6.02 8.22
# AI synthetic audios 810 810 810 810 3240
Total duration (s) 5238 9602 9552 4572 28964
2.3 Discussion
attention layer to explicitly feed the information from the fake voice detection
to the two remaining tasks, including command detection and speaker verifica-
tion. Specifically, the cross-task attention layer inputs .{hf , hc , hr } and produces
specific cross-task attention weights that illustrate the influence of one task on
another. Formally, to incorporate the information from fake voice detection to
speaker verification, the layer first creates a cross information-concentrated vec-
tor .xf r ∈ R3Dst by multiplying a weight matrix .Wf r with the feature vector
.hf . Next, we compute the attention weight between the two tasks using an at-
tention weight .λf r and concatenates the result vector to the original .hr vector
as follows:
.h r = (λf r xf r ) hr .
The layer produces the .h c to integrate useful information from fake voice recog-
nition to the command detection task in a similar manner. The vectors .h r , h c ,
and .hf are then passed into a fully-connected layer (FC) where the output of
each FC is the predicted label for each task.
where .LC , LF , and .LR are cross-entropy losses computed based on labels from
command detection, fake voice recognition, and speaker verification, respectively.
The loss coefficients .α and .β are fined tune during training to figure out the
optimal loss function.
where .LST can be established similarly as (1) and .LDI are the weighted sum of
cross-entropy losses computed based on soft labels from teacher model and soft
predictions from student models.
162 M. H. Dao et al.
4 Experiment
We conduct experiments on our dataset to study a quantitative comparison be-
tween Vi-LMM, Vi-LMM-S, and recent strong methods in terms of performance,
model’s size, and inference time.
We compare our proposed models to five strong baselines across various domains,
including:
– Rawnet2 [24] & Rawnet3 [13]: end-to-end DNNs classifier for raw wave-
form speaker recognition.
– GFCC-ResNet101 [18]: a recent deep model designed for Vietnamese
speaker authentication problem.
– FastAudio [7]: an end-to-end framework for audio classification problem.
– AASIST [11]: current state-of-the-art model on the ASVspoof 2019 LA
dataset.
– AutoSpeech [6]: a derived CNN architectures for the task of speaker veri-
fication on the VoxCeleb1 dataset.
Note that, our approach does not include large pre-trained models as encoders
e.g. wav2vec2.0 [2], HuBERT [10]. Therefore, models that utilize pre-trained
models are incomparable to our system. Besides, we also perform an ablation
study by removing the cross-task attention layer to create a model termed Vi-
LMM-S i.e. the feature vectors are fed directly into the classifiers after passing
through the task-specific MGOs.
Table 3. Results on the test set. “Command Dec.”, “Fake Voice Rec.”, “Speaker Ver.” de-
note command detection, fake voice recognition, and speaker verification, respectively.
“Avg-EER” and “Avg-.F1 ” denote Average .F1 and Average EER, respectively. Here, Vi-
LMM-S is the compact variant of Vi-LMM and Vi-LMM-C is Vi-LMM without the
cross-task attention layer.
Model # Parameters Inference Time Command Dec. Fake Voice Rec. Speaker Ver. Avg-EER Avg-.F1
.F1 EER .F1 EER .F1
Rawnet2 40.14 M 135 ms 90.72 15.27 79.26 19.83 73.18 17.55 81.05
Rawnet3 52.38 M 223 ms 91.82 4.57 90.51 13.82 79.08 9.19 87.14
GFCC-ResNet101 128.4 M 630 ms 95.81 8.36 87.31 15.32 78.54 11.84 87.22
FastAudio 40.2 M 150 ms 91.32 5.26 89.45 14.03 78.92 9.65 85.56
AASIST 41.4 M 174 ms 92.19 4.06 90.72 13.65 79.12 8.86 87.34
AutoSpeech 54 M 267 ms 93.27 7.92 87.65 15.76 78.24 11.84 86.39
Vi-LMM 14 M 64 ms 93.58 4.58 90.45 13.87 79.43 9.22 87.82
Vi-LMM-S 4M 46 ms 91.82 5.86 88.97 16.52 77.63 11.19 86.14
Vi-LMM-C 14 M 60 ms 93.21 4.87 89.95 14.03 78.96 9.45 87.37
A Lightweight End-to-End Multi-task Learning System 163
Table 3 reports the performances of the chosen baselines and our system. It is
worth noting that each baseline is trained specifically for each task. Thus, in
order to make a fair comparison, the number of parameters of each single model
presented in Table 3 is tripled compared that of the original study.
In general, our findings indicate that both Vi-LMM and Vi-LMM-S demon-
strate competitive performance compared to other strong baselines, while en-
joying a significantly lower time and space complexity. Notably, Vi-LMM out-
performs all other methods with the highest Average-.F1 -score of 87.82%. In
terms of Average-EER, Vi-LMM is the third-best performer following AASIT
and Rawnet3. It is noteworthy that Vi-LMM only requires 14 million param-
eters, whereas Rawnet2, which is the second-smallest method, requires 40.14
million parameters.
Our system performs comparably well to other models in terms of individ-
ual task performance. For command detection, Vi-LMM achieves an .F1 -score of
93.58%, close to that of the highest-performing model GFCC-ResNet101, which
has approximately nine times more parameters. For fake voice recognition, Vi-
LMM’s performance is comparable to that of AASIST, the highest-performing
model, in terms of EER and .F1 -score, despite having significantly fewer parame-
ters. For speaker verification, AASIST has the best EER, but Vi-LMM achieves
the highest .F1 -score of 79.43%, showing the effectiveness of information feeding
from the fake voice detection task.
To reflect the speed advantage of Vi-LMM, we also report the inference time
for each model. It should be noted that other models require three runs to obtain
outputs for a single data sample, whereas our model only requires one. Our
results indicate that Vi-LMM has the fastest inference time, taking only 64ms,
while GFCC-ResNet101 and AASIST take 630ms and 174ms, respectively.
164 M. H. Dao et al.
5 Conclusions
In this study, we introduced the initial public dataset for Vietnamese speaker
verification, which comprises three sub-tasks: command detection, fake voice
recognition, and speaker verification. In addition, we proposed two simple yet
effective models, Vi-LMM and Vi-LMM-S, for jointly learning the three tasks.
Particularly, Vi-LMM extends AASIST by integrating three task-specific MGO
branches and a cross-task attention layer, while Vi-LMM-S employs knowledge
distillation techniques and has only 4 million parameters. The experimental eval-
uation shows that both models surpass most of the strong methods in terms of
Average-.F1 while using significantly fewer parameters. Furthermore, we verified
that joint learning of the three sub-tasks via a cross-task attention layer is ben-
eficial to enhance the performance of all the tasks. We hope that our dataset
and model can serve as a starting point for future Vietnamese speech processing
research and applications.
Acknowledgments. This work was supported by the research project coded DT.
18/24, funded by the Ministry of Information and Communication, 2024.
Bibliography
1. Aravind, P., Nechiyil, U., Paramparambath, N., et al.: Audio spoofing verifica-
tion using deep convolutional neural networks by transfer learning. arXiv preprint
arXiv:2008.03464 (2020)
2. Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-
supervised learning of speech representations. CoRR (2020)
3. Bai, Z., Zhang, X.L.: Speaker recognition based on deep learning: an overview.
Neural Networks (2021)
4. Chen, Z., et al.: Large-scale self-supervised speech representation learning for auto-
matic speaker verification. In: Proceedings of 2022 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP) (2022)
5. Chen, Z., Xie, Z., Zhang, W., Xu, X.: Resnet and model fusion for automatic spoof-
ing detection. In: Proceedings of the 18th Annual Conference of the International
Speech Communication Association (INTERSPEECH), pp. 102–106 (2017)
6. Ding, S., Chen, T., Gong, X., Zha, W., Wang, Z.: Autospeech: neural architecture
search for speaker recognition. In: Proceedings of the 21st Annual Conference of the
International Speech Communication Association (INTERSPEECH), pp. 916–920
(2020)
A Lightweight End-to-End Multi-task Learning System 165
7. Fu, Q., Teng, Z., White, J., Powell, M., Schmidt, D.C.: Fastaudio: a learnable audio
front-end for spoof speech detection. arXiv preprint arXiv:2109.02774 (2021)
8. Ge, Z., Iyer, A.N., Cheluvaraja, S., Sundaram, R., Ganapathiraju, A.: Neural net-
work based speaker classification and verification systems with enhanced features.
In: Proceedings of 2017 Intelligent Systems Conference (IntelliSys), pp. 1089–1094
(2017)
9. Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural net-
work. arXiv preprint arXiv:1503.02531 (2015)
10. Hsu, W., Bolte, B., Tsai, Y.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.:
Hubert: self-supervised speech representation learning by masked prediction of
hidden units. CoRR (2021)
11. Jung, J., et al.: Aasist: audio anti-spoofing using integrated spectro-temporal graph
attention networks. arXiv preprint arXiv:2110.01200 (2021)
12. Jung, J.W., Heo, H.S., Yu, H.J., Chung, J.S.: Graph attention networks for speaker
verification. In: ICASSP 2021-2021 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 6149–6153 (2021)
13. Jung, J.w., Kim, Y.J., Heo, H.S., Lee, B.J., Kwon, Y., Chung, J.S.: Pushing the lim-
its of raw waveform speaker recognition. In: Proceedings of the 23rd Annual Con-
ference of the International Speech Communication Association (INTERSPEECH)
(2022)
14. Kong, J., Kim, J., Bae, J.: Hifi-GAN: generative adversarial networks for efficient
and high fidelity speech synthesis. arXiv preprint arXiv:2010.05646 (2020)
15. van Leeuwen, D.A.: Speaker verification systems and security considerations. In:
Proceedings of 8th European Conference on Speech Communication and Technol-
ogy (Eurospeech), pp. 1661–1664 (2003)
16. Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identifica-
tion dataset. In: Proceedings of the 18th Annual Conference of the International
Speech Communication Association (INTERSPEECH), pp. 2616–2620 (2017)
17. Nagrath, P., Jain, R., Madan, A., Arora, R., Kataria, P., Hemanth, J.: Ssdmnv2:
a real time DNN-based face mask detection system using single shot multibox
detector and mobilenetv2. Sustain. Cities Soc. (2021)
18. Nguyen, S.T., Lai, V.D., Dam-Ba, Q., Nguyen-Xuan, A., Pham, C.: Vietnamese
speaker authentication using deep models. In: Proceedings of the International
Symposium on Information and Communication Technology, pp. 177–184 (2018)
19. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: in-
verted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
20. Saquib, Z., Salam, N., Nair, R.P., Pandey, N., Joshi, A.: A survey on automatic
speaker recognition systems. In: Communications in Computer and Information
Science, pp. 134–145 (2010)
21. Sukhavasi, M., Adapa, S.: Music theme recognition using CNN and self-attention.
arXiv preprint arXiv:1911.07041 (2019)
22. Tak, H., weon Jung, J., Patino, J., Kamble, M., Todisco, M., Evans, N.: End-to-end
spectro-temporal graph attention networks for speaker verification anti-spoofing
and speech deepfake detection. In: Proceedings of 2021 Edition of the Automatic
Speaker Verification and Spoofing Countermeasures Challenge (2021)
23. Tak, H., Jung, J.w., Patino, J., Todisco, M., Evans, N.: Graph attention networks
for anti-spoofing. arXiv preprint arXiv:2104.03654 (2021)
24. Tak, H., Patino, J., Todisco, M., Nautsch, A., Evans, N., Larcher, A.: End-to-end
anti-spoofing with rawnet2. In: Proceedings of 2021 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 6369–6373 (2021)
166 M. H. Dao et al.
25. Tak, H., Todisco, M., Wang, X., Jung, J.w., Yamagishi, J., Evans, N.: Automatic
speaker verification spoofing and deepfake detection using wav2vec 2.0 and data
augmentation (2022)
26. Todisco, M., et al.: Asvspoof 2019: future horizons in spoofed and fake audio de-
tection. In: Proceedings of the 20th Annual Conference of the International Speech
Communication Association (INTERSPEECH), pp. 1008–1012 (2019)
27. Van, T.P., Quang, N.T.N., Thanh, T.M.: Deep learning approach for singer voice
classification of vietnamese popular music. In: Proceedings of the International
Symposium on Information and Communication Technology, pp. 255–260 (2019)
28. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph
attention networks. In: Proceedings of 6th International Conference on Learning
Representations (2018)
29. Wang, X., Yamagishi, J.: A comparative study on recent neural spoofing coun-
termeasures for synthetic speech detection. In: Proceedings of the 22nd Annual
Conference of the International Speech Communication Association (INTER-
SPEECH), pp. 4259–4263 (2021)
30. Wang, Y., Boumadane, A., Heba, A.: A fine-tuned wav2vec 2.0/hubert benchmark
for speech emotion recognition, speaker verification and spoken language under-
standing. CoRR (2021)
31. Yamagishi, J., et al.: ASVspoof 2021: accelerating progress in spoofed and deepfake
speech detection. In: Proceedings of Edition of ASVspoof, pp. 47–54 (2021)
32. Yang, J., Das, R.K., Zhou, N.: Extraction of octave spectra information for spoof-
ing attack detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2373–2384
(2019)
33. Zhang, Z., Gu, Y., Yi, X., Zhao, X.: FMFCC-a: a challenging mandarin dataset
for synthetic speech detection. arXiv preprint arXiv:2110.09441 (2021)
Domain Generalization in Vietnamese
Dependency Parsing: A Novel Benchmark
and Domain Gap Analysis
1 Introduction
Fig. 1. A dependency tree following the NIIVTB DT-1 treebank [21] format
Dependency parsing plays a crucial role in the field of NLP research because
it provides syntactic information about language. A high-quality dependency
parsing system can be deployed to improve the performance of various down-
stream tasks, such as information extraction [6], name entity recognition [8,24],
question answering [3], machine translation [2,22], text summarization [23] and
multi-task learning [4].
With the advent of deep neural networks, current dependency parsing models
have achieved significantly high performance. For example, the Biaffine model
[9] reached 94.10% LAS1 on Penn Treebank. However, parsing models have still
faced difficulties when there is a distribution difference between training and
evaluation data, known as the domain gap. Studies by Blodgett et al. on African-
American English [1] and Kanerva et al. [11] on Finnish have demonstrated that
the performance of parsing systems dropped remarkably in domain generaliza-
tion setups, which the worst decrease was by 24.90% LAS when the parser is
evaluated on the clinical domain [11].
Additionally, we have found that there is a limited number of studies on Viet-
namese dependency parsing. To our knowledge, there are no published studies
examining the domain gap challenge in this task. Furthermore, available depen-
dency treebanks in Vietnamese are not designed to evaluate the effects of the
domain gap despite this being a common issue with real-world data. As a result,
it becomes difficult to assess the model’s performance across diverse concepts
completely.
To accommodate further research on Vietnamese cross-domain depen-
dency parsing, we introduce DGDT (Vietnamese Domain Generalization
Dependency T reebank), a multi-domain Vietnamese dependency treebank.
Moreover, we also released DGDTMark, a benchmark suite to evaluate a
dependency parser on different scenarios, both with and without the domain
gap, in several settings of domain generalization on our novel treebank. In our
1
There are two standard evaluation metrics in dependency parsing: UAS (unlabeled
attachment score) and LAS (labeled attachment score).
Domain Generalization in Vietnamese Dependency Parsing 169
benchmark suite, we also combine our treebank and released datasets to extend
the domain gap to demonstrate its effects on Vietnamese dependency parsing
comprehensively.
2 Related Works
2.1 Data Arrangement Methods
3 Our Treebank
Typically, there are several approaches to creating a dependency treebank: man-
ually annotating the dependency trees, using an automatic parser on raw text
followed by manual adjustment (semi-automatic parsing), or transforming an
existing constituency treebank to a dependency treebank.
Due to the high cost of manually annotating a dependency treebank, we
decided to use an automatic converter to transform an existing constituency
treebank into our dependency treebank. After carefully reviewing available con-
verters for Vietnamese, we select the converter released by Truong et al. [21]. This
converter employs a new dependency label set with novel Vietnamese-specific
dependency labels, making it more effective at capturing Vietnamese linguis-
tic characteristics. For example, previous works [15,17] did not propose specific
dependency class for sino-vietnamese words or classifier nouns, which are com-
monly seen in Vietnamese text. Moreover, Truong et al.’s approach builds the
dependency relations based on both syntactic and semantic features rather than
relying solely on functional tags as seen in the VnDT treebank [15].
With the aim to construct a multi-domain treebank, we choose the second
subset (NIIVTB-2) of the NIIVTB constituency treebank [19] because this subset
is organized into 14 distinct topics, crawled from Thanh Nien 5 newspaper. Since
the authors of this dataset hide the raw text because of copyright, we have
to collect them from newspaper sites and match them with the corresponding
annotations. In fact, this process encountered several challenges such as defunct
hyperlinks, and inconsistency in the number of words between raw text and the
constituency treebank, among others. Additionally, we found some duplication
within the dataset, which needs to be removed to guarantee data reliability.
Consequently, our treebank contains 9,765 sentences.
Converter Validation: To ensure the quality of our treebank, we build a gold
dataset of 1,245 sentences in Law and Life of youth topics of NIIVTB-2 by
manually correcting the automatically-parsed trees from the converter. Then,
we compare the gold dataset with the initially-parsed trees from the converter
and receive the accuracy of 95.62% UAS and 89.49% LAS. These results are
sufficiently high to guarantee our treebank’s quality (Fig. 2).
In NIIVTB-2, every topic is present in all three sets: train, dev, and test.
This setup makes it challenging to analyze the impact of the domain gap on
the dependency parsing task. To address this issue and support the domain
generalization task, in the initial setup of DGDT, we decided to treat each topic
as a domain and rearrange the dataset so that each domain appears exclusively
in one of the train, dev, or test sets. Moreover, we process the domain allocation
step not only to maintain a suitable sentence ratio across these sets but also
5
https://thanhnien.vn.
Domain Generalization in Vietnamese Dependency Parsing 171
to emphasize the difficulty of the test set. Table 1 describes the structure of our
treebank, which totally contains 9,765 dependency trees (245,006 tokens), across
14 domains along with individual statistics information for each domain included
in the treebank.
Besides, as shown in Table 2, we can observe that the distributions over labels
in DGDT are imbalanced. In detail, labels like PUNCT (punctuation), NN (noun
compound modifier), OBJ (object), and PREP (prepositional modifier) appear
more frequently, while labels related to expressions such as VOCATIVE, SOUND
are found only in a small minority of cases in our treebank. This imbalance is
understandable because almost all input sentences end with punctuation, and
sentences may contain reported speech (using colons and quotation marks) or
include extra information within parentheses. Additionally, NN, OBJ, and PREP
are also common components in natural language and frequently appear in text.
Meanwhile, words expressing sound are relatively rare in general, particularly in
our treebank, where the content of its domains does not relate much to sound.
Moreover, both sound expressions and vocatives are more commonly found in
spoken language than in newspaper text. Although an imbalanced data distri-
bution is not ideal in machine learning, it is often unavoidable when collecting
real-world data.
In comparison, our dataset has some key differences. Although our dataset
is only behind the VnDT treebank in terms of sentence count, it has more total
number of tokens than all mentioned treebanks. Furthermore, our treebank also
contains a large volume of long sentences, with 1,087 sentences in DGDT that
each of them has more than 40 tokens. This quantity is 36% more than VnDT
and 1,312% more than UD Vietnamese-VTB. Hence, this dataset can effectively
examine how models handle long-distance dependency relations, which pose a
challenge for parsers. Further distinctions including sentence distribution by
length, number of domains, and number of tokens between DGDT and other
Vietnamese dependency treebanks are shown in Table 3 and Fig. 3.
4 Experiments
4.1 Baseline Model
Introduced in 2016, biaffine parser [9] quickly became a highly influential model
in the field of dependency parsing. Many of today’s state-of-the-art dependency
parsing models are based on the biaffine approach with customizations involving
modifying the encoder, the number of MLP layers, and additional linguistics
features. Due to its feasible deployment and high performance, we select this
model as the baseline model for our experiments on DGDT. Specifically, we adopt
172 V.-H. D. Huynh et al.
Lastly, the non-projective MST algorithm is applied to search for the highest-
scoring projective dependency tree from obtained arc scores.
The implementation of Zhang et al. [25] (Supar) customized the biaffine by
replacing POS tag embeddings with CharLSTM word representation vectors. In
6
https://github.com/yzhangcs/parser/.
Domain Generalization in Vietnamese Dependency Parsing 173
Treebank No. of labels No. of domains No. of sentences No. of tokens Manual annotation
BKTreebank 26 unknown 6,909 unknown ✓
UD-VTB 84 1 3,323 58,069 ✗
VnDT 33 1 10,197 218,749 ✗
DGDT (ours) 40 14 9,765 245,006 ✗
addition, the first-order Eisner algorithm [10] is also used instead of the non-
projective MST algorithm.
With the main goal of handling the task in Vietnamese, we replace the
BiLTSMs in the encoder layer with two options: PhoBERT [16] or XLM-
RoBERTa [5] (XLM-R). PhoBERT is a pre-trained language model for Viet-
namese, based on RoBERTa [14], an advanced version of BERT [7] with some
modifications in pre-training procedure. XLM-R, on the other hand, is a multilin-
gual masked language model pre-trained on text in 100 languages, also inherited
from RoBERTa.
– Scenario 1 (in-domain): We split each domain into three subsets train, dev,
and test with an 8:1:1 ratio. Then, merge the corresponding parts from dif-
ferent domains to create the overall train, dev, and test sets. While this setup
is commonly used in almost dependency parsing experiments, it is not specif-
ically designed to make clear the effects of the domain gap. We implement
this setup with the aim of evaluating the model’s performance in the absence
of a domain gap.
– Scenario 2 (domain-k-fold): To examine how each domain in DGDT affects
the parser differently from the others, we choose k-fold as the evaluating
approach. Each one of 14 domains should be utilized to assess the model,
while the rest of the treebank is merged into the train set. The average result
between folds is the overall performance of the parser.
– Scenario 3 (domain-generalization): We assign each domain to appear exclu-
sively in one of the train, dev, or test sets to observe how the model handles
the effects of the domain gap. In this setup, we use Entertainment and Infor-
mation Technology domains to construct the dev set and merge Economic
and Life domains to organize the test set, leaving other domains to form the
train set, as in Table 1.
– Scenario 4 (dataset-generalization): We hypothesize that the domain gap can
be better demonstrated by using training data which have a different source
from testing data. Hence, we set up this experiment as follows: we use the
train set of NIIVTB DT-1 dependency treebank [21] as data for the model
training phase, then use the dev and test sets of DGDT for model selection and
evaluation, respectively. The reason for choosing NIIVTB DT-1 is that this
treebank follows the same annotation guideline as our dataset, which makes
possible the model evaluation process. Moreover, the data for this treebank
was derived from Tuoi Tre newspaper, while our treebank is built on Thanh
Domain Generalization in Vietnamese Dependency Parsing 175
Nien, which satisfies the requirement of data source separation. The differ-
ence in the data source may result in variations in writing style. Besides, the
data from NIIVTB DT-1 was published in the early 2000 s, whereas DGDT’s
data was released at the beginning of the 2010 s. As vocabulary continuously
expands daily, we believe that the passage of time can cause shifts in both
writing style and language diversity.
4.3 Results
Table 4. Results in the DGDTMark benchmark suite via different scenarios
To better evaluate how the model performs in our main concern - the domain-
generalization scenario, we further analyze the results based on label type and
sentence length. From Table 6, we can observe that the parser has a remarkable
performance in handling popular cases like labels related to subjects, objects,
modifiers, and unique cases such as punctuation, number, or determining root. In
contrast, difficult labels including CCOMP (clausal complement), PARATAXIS,
and CONJ (conjunct) have low accuracy because they demonstrate a connection
between clauses, or from a token to its complement clause, which also requires
the model to choose not only a suitable dependent but also a sufficient main
word for the dependent clause. Moreover, clause-linking relations are usually
not dependent on explicit word forms, which makes the task become more diffi-
cult. On the other hand, ambiguous cases, for example, NN and NP ADVMOD
(noun phrase as adverbial modifier), both represent noun modifiers but with
different roles in semantic meaning, cause a dramatic decrease (by 24.82% LAS)
in parsing performance. This statistic demonstrates that there is still room for
model development in the field of acknowledging deep linguistic meaning.
We also found that the effectiveness of the parser is vastly influenced by the
distance of relations. The results shown in Fig. 4 indicate that the more distant
the relations, the worse the parser performs in selecting the correct head for a
word. The cause of these results, from our perspective, is that the frequency
of short-distance relations is extremely higher, which makes the model biased.
However, the performance by LAS metric rises remarkably, from 50% to nearly
90% when handling long-distance relations. We explain this interesting increase
by the capped number of options the parser annotates the relations because
Domain Generalization in Vietnamese Dependency Parsing 177
it with from 50 to 250 sentences (in steps of 50 sentences) from the test set and
evaluate its performance on the remaining portion of the test set. We stopped
feeding at 250 sentences to avoid overshadowing the domain gap setup.
As shown in Fig. 5, allocating sentences from the test to the train set enhances
the model’s performance. Moving 250 sentences results in an increase of UAS by
approximately 1%, while the parser only shows a slight improvement by the LAS
metric. From our perspective, the enhancement in this situation occurs because
the model receives only a small amount of knowledge from domains in the test
set, leading to better performance, but not significant. Although domain adap-
tation can handle the domain gap by lowering distribution differences between
the source and target domains, it depends on the assumption that target data
is accessible, which is not always the case in practice.
5 Conclusion
References
1. Blodgett, S.L., Wei, J., O’Connor, B.: Twitter universal dependency parsing for
African-American and mainstream American English. In: Gurevych, I., Miyao,
Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers), pp. 1415–1425. Association for
Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.
18653/v1/P18-1131, https://aclanthology.org/P18-1131
2. Bugliarello, E., Okazaki, N.: Enhancing machine translation with dependency-
aware self-attention. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.)
Proceedings of the 58th Annual Meeting of the Association for Computa-
tional Linguistics, pp. 1618–1627. Association for Computational Linguistics,
July 2020. https://doi.org/10.18653/v1/2020.acl-main.147, https://aclanthology.
org/2020.acl-main.147
3. Chen, C., Bunescu, R., Marling, C.: A semantic parsing pipeline for context-
dependent question answering over temporally structured data. Nat. Lang. Eng.
29(3), 769–793 (2023). https://doi.org/10.1017/S1351324921000292
4. Clark, K., Luong, M.T., Manning, C.D., Le, Q.: Semi-supervised sequence mod-
eling with cross-view training. In: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii,
J. (eds.) Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, pp. 1914–1925. Association for Computational Linguistics,
Brussels, Belgium, Oct-Nov 2018. https://doi.org/10.18653/v1/D18-1217, https://
aclanthology.org/D18-1217
5. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In:
Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451.
Association for Computational Linguistics, July 2020. https://doi.org/10.18653/
v1/2020.acl-main.747, https://aclanthology.org/2020.acl-main.747
6. Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Pro-
ceedings of the 42nd Annual Meeting of the Association for Computational Lin-
guistics (ACL-04), pp. 423–429. Barcelona, Spain, July 2004. https://doi.org/10.
3115/1218955.1219009, https://aclanthology.org/P04-1054
7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep
bidirectional transformers for language understanding. In: Burstein, J., Doran,
C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for
Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/
10.18653/v1/N19-1423, https://aclanthology.org/N19-1423
8. Dou, C., Sun, X., Wang, Y., Ji, Y., Ma, B., Li, X.: Domain-adapted dependency
parsing for cross-domain named entity recognition. In: Proceedings of the Thirty-
Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference
on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on
Educational Advances in Artificial Intelligence. AAAI’23/IAAI’23/EAAI’23, AAAI
Press (2023). https://doi.org/10.1609/aaai.v37i11.26498, https://doi.org/10.1609/
aaai.v37i11.26498
9. Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing
(2017). https://arxiv.org/abs/1611.01734
10. Eisner, J.: Bilexical grammars and their cubic-time parsing algorithms, pp. 29–
61. Springer Netherlands, Dordrecht (2000). https://doi.org/10.1007/978-94-015-
9470-7 3
180 V.-H. D. Huynh et al.
11. Kanerva, J., Ginter, F.: Out-of-domain evaluation of Finnish dependency parsing.
In: Calzolari, N., et al. (eds.) Proceedings of the Thirteenth Language Resources
and Evaluation Conference, pp. 1114–1124. European Language Resources Asso-
ciation, Marseille, France, June 2022. https://aclanthology.org/2022.lrec-1.120
12. Li, Y., Li, Z., Zhang, M.: Semi-supervised domain adaptation for dependency pars-
ing via improved contextualized word representations. In: Scott, D., Bel, N., Zong,
C. (eds.) Proceedings of the 28th International Conference on Computational Lin-
guistics, pp. 3806–3817. International Committee on Computational Linguistics,
Barcelona, Spain, December 2020. https://doi.org/10.18653/v1/2020.coling-main.
338, https://aclanthology.org/2020.coling-main.338
13. Li, Z., Peng, X., Zhang, M., Wang, R., Si, L.: Semi-supervised domain adaptation
for dependency parsing. In: Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceed-
ings of the 57th Annual Meeting of the Association for Computational Linguistics,
pp. 2386–2395. Association for Computational Linguistics, Florence, Italy, July
2019. https://doi.org/10.18653/v1/P19-1229, https://aclanthology.org/P19-1229
14. Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach (2019).
https://arxiv.org/abs/1907.11692
15. Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., Nguyen, P.T., Nguyen, M.L.: From
treebank conversion to automatic dependency parsing for vietnamese. In: Interna-
tional Conference on Applications of Natural Language to Data Bases/Information
Systems, pp. 196–207. Springer (2014)
16. Nguyen, D.Q., Tuan Nguyen, A.: PhoBERT: pre-trained language models for Viet-
namese. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Compu-
tational Linguistics: EMNLP 2020, pp. 1037–1042. Association for Computational
Linguistics, November 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.92,
https://aclanthology.org/2020.findings-emnlp.92
17. Nguyen, K.H.: BKTreebank: building a vietnamese dependency treebank. In: Cal-
zolari, N., et al. (eds.) Proceedings of the Eleventh International Conference on
Language Resources and Evaluation (LREC 2018). European Language Resources
Association (ELRA), Miyazaki, Japan, May 2018. https://aclanthology.org/L18-
1341
18. Nguyen, P.T., Vu, X.L., Nguyen, T.M.H., Nguyen, V.H., Le, H.P.: Building a large
syntactically-annotated corpus of Vietnamese. In: Stede, M., Huang, C.R., Ide, N.,
Meyers, A. (eds.) Proceedings of the Third Linguistic Annotation Workshop (LAW
III), pp. 182–185. Association for Computational Linguistics, Suntec, Singapore,
August 2009. https://aclanthology.org/W09-3035
19. Nguyen, Q.T., Miyao, Y., Le, H., Nguyen, N.: Ensuring annotation consistency
and accuracy for Vietnamese treebank. Lang. Resour. Eval. 52(1), 269–315 (2017).
https://doi.org/10.1007/s10579-017-9398-3
20. Sato, M., Manabe, H., Noji, H., Matsumoto, Y.: Adversarial training for cross-
domain universal dependency parsing. In: Hajič, J., Zeman, D. (eds.) Proceedings
of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Uni-
versal Dependencies, pp. 71–79. Association for Computational Linguistics, Van-
couver, Canada, August 2017. https://doi.org/10.18653/v1/K17-3007, https://
aclanthology.org/K17-3007
21. Truong, C.M., Pham, T.V., Phan, M.N., Le, N.D.T., Nguyen, T.V., Nguyen, Q.T.:
Converting a constituency treebank to dependency treebank for vietnamese. In:
2022 RIVF International Conference on Computing and Communication Tech-
nologies (RIVF), pp. 256–261 (2022). https://doi.org/10.1109/RIVF55975.2022.
10013806
Domain Generalization in Vietnamese Dependency Parsing 181
22. Xu, P., Kang, J., Ringgaard, M., Och, F.: Using a dependency parser to improve
SMT for subject-object-verb languages. In: Ostendorf, M., Collins, M., Narayanan,
S., Oard, D.W., Vanderwende, L. (eds.) Proceedings of Human Language Technolo-
gies: The 2009 Annual Conference of the North American Chapter of the Associ-
ation for Computational Linguistics. pp. 245–253. Association for Computational
Linguistics, Boulder, Colorado, June 2009. https://aclanthology.org/N09-1028
23. Yoshida, Y., Suzuki, J., Hirao, T., Nagata, M.: Dependency-based discourse parser
for single-document summarization. In: Moschitti, A., Pang, B., Daelemans, W.
(eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pp. 1834–1839. Association for Computational
Linguistics, Doha, Qatar, October 2014. https://doi.org/10.3115/v1/D14-1196,
https://aclanthology.org/D14-1196
24. Yu, J., Bohnet, B., Poesio, M.: Named entity recognition as dependency parsing.
In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, pp. 6470–6476.
Association for Computational Linguistic, July 2020. https://doi.org/10.18653/
v1/2020.acl-main.577, https://aclanthology.org/2020.acl-main.577
25. Zhang, Y., Li, Z., Zhang, M.: Efficient second-order TreeCRF for neural depen-
dency parsing. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceed-
ings of the 58th Annual Meeting of the Association for Computational Linguistics,
pp. 3295–3305. Association for Computational Linguistics, July 2020. https://doi.
org/10.18653/v1/2020.acl-main.302, https://aclanthology.org/2020.acl-main.302
Distribution-Guided Object Counting
with Optimal Transport and DINO-Based
Density Refinement
1 Introduction
Object counting in images is a crucial task within the field of computer
vision, with extensive applications across various domains such as surveillance,
autonomous driving, wildlife monitoring, and retail analytics. Despite signifi-
cant advancements in these areas, accurately counting unseen object categories
- those not present in the training data - remains a major challenge. Current
regression-based techniques, as highlighted in references [11], typically generate
a 2D density map from which the total object count is derived by summing the
density values across all spatial locations of the density map. For images with a
large number of objects, this density map estimation approach has been shown
to be more robust than the detection-then-counting approach.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 182–192, 2025.
https://doi.org/10.1007/978-981-96-4282-3_15
Distribution-Guided Object Counting 183
2 Related Work
Computer vision researchers have long faced difficulties in visual counting, with
much of the study concentrating on particular categories such as cars, cells,
184 N. X. Cuong and T.-D. Mai
3 Proposed Method
3.1 Overview
Fig. 1. Overall Framework of the Proposed Model. The framework consists of four key
components: encoders, feature interaction module, decoders, and the density refinement
module.
In object counting, L2 pixel-wise loss is widely used as a loss function for model
optimization, which is inappropriate since spatial distribution is ignored. For
example, a small background change and an object density change are equivalent
if we use MSE loss even though the former can cause a large localization error
while the latter change is considered a small error. As a consequence, a model
trained with MSE tends to be misleading in the case of ambiguity, leading to
high errors in object counting. Therefore, a loss function to penalize both count
and distribution mismatch is compulsory. To overcome this, we propose OT loss,
186 N. X. Cuong and T.-D. Mai
a loss function that considers both elements for model training. The total loss
includes OT loss .LOT and counting loss .LC :
where .λ1 and .λ2 are hyperparameters for OT loss (.LOT ) and counting loss (.LC ).
The former measures the distribution difference while the latter measures the
counting difference between predicted and annotated maps.
OT Loss: Before calculating the OT loss, we turn the predicted density map to
probability distribution by normalizing it. Considering the normalized density
map .A = {(ai , xi )}ni=1 where .ai ≥ 0 and .xi ∈ R2 represents the probability and
the position of pixel .i respectively and ground truth map .B = {(bj , yj )}m j=1 ,
2
where .bj = 1 and .yj ∈ R represents the .j − th object location. Our loss function
is based on the Sinkhorn distance between .A and .B:
where .C is the quadric transport cost, defined as: .c(zi , ẑj ) = zi − ẑj 22 and .P is
the transport plan.
The Sinkhorn distance in 3 finds the optimal transport plan P whose element
.Pij is the density transported from .xi to .yj that minimizes the total transport
cost. Following [9], the solution for the Optimal transport plan can be expressed
in matrix form:
.P = diag(u)Kdiag(v). (4)
The variables .u and .v must satisfy the following nonlinear equations, which cor-
respond to the mass conservation constraints inherent in the optimal transport
problem .U (a, b):
diag(u)Kdiag(v)1m = a
. (5)
diag(v)K diag(u)1n = b.
. (6)
These two equations can be further simplified, as .diag(v)1m is simply .v, and
the multiplication of .diag(u) with .Kv is given by:
u (Kv) = a
. (7)
v (K u) = b,
. (8)
where . denotes element-wise multiplication of vectors.
An intuitive approach to solving these equations is to use an iterative method,
where .u is first adjusted to satisfy the left-hand side of (7), followed by adjusting
.v to satisfy the right-hand side. These two updates define Sinkhorn’s algorithm:
Distribution-Guided Object Counting 187
a
u(+1) =
. (9)
Kv ()
b
.v (+1) = (+1) (10)
K u
which is initialized with an arbitrary positive vector .v (0) = 1m . The division
operator between two vectors is understood element-wise.
Count Loss: For count loss, we simply apply MSE between predicted and
annotated map following [6]:
1
LC = L(ŷi , yi ) =
. ||yi − ŷi ||22 (11)
HW
where .yi , ŷi ∈ RH×W ×1 represent the ground truth and predicted density map
respectively. .H and .W are the height and width of the image.
After filtering, we select the three boxes with the highest confidence scores.
To refine the density map, we normalize it by dividing the density values within
each selected box by the total sum of the density values within that box. This
approach ensures a more accurate representation of object counts within the
density map, mitigating errors due to overlapping or clustered objects.
4 Experiment
4.1 Dataset and Metric
We experiment on FSC-147 [11], which is a multi-class few-shot object counting
dataset containing 6135 images. The number of counted objects in each image
varies widely, ranging from 7 to 3731, with an average of 56. The dataset also
provides three randomly selected object instances annotated by bounding boxes
as exemplars in each image. The training set includes 89 object categories, while
the validation and test sets each contain 29 disjoint categories, making FSC-147
an open-set object counting dataset.
We use two standard metrics to measure the performance of our model,
namely, Mean Absolute Error (.M AE) and Root Mean Squared Error (.RM SE).
MAE .↓ RMSE .↓
.OptiCountM SE 15.88 106.29
.OptiCountOT 15.65 108.22
.OptiCountOT +DR 13.84 107.18
addresses occlusions, thereby enhancing the quality of the samples used for nor-
malization. This process leads to a more representative dataset for accurate
density estimation, further optimizing the performance of OptiCount.
6 Conclusion
In this paper, we propose a novel framework for prompt-based object counting.
By enhancing the loss function with spatial information with optimal transport
loss and proposing density-refinement module, our method enables model to
reduce counting errors in challenging cases. Experiments on the FSC147 dataset
demonstrate that our model performs reliably, especially in scenarios with over-
lapping and self-similarity objects. In future work, we plan to focus on extremely
dense regions to further enhance the model’s performance.
References
1. Amini-Naieni, N., Amini-Naieni, K., Han, T., Zisserman, A.: Open-world text-
specified object counting. arXiv preprint arXiv:2306.01851 (2023)
192 N. X. Cuong and T.-D. Mai
2. Jiang, R., Liu, L., Chen, C.: Clip-count: towards text-guided zero-shot object
counting. In: MM, pp. 4535–4545 (2023)
3. Kang, S., Moon, W., Kim, E., Heo, J.P.: Vlcounter: text-aware visual representa-
tion for zero-shot object counting. In: AAAI, pp. 2714–2722 (2024)
4. Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
5. Li, G., Li, X., Wang, Y., Wu, Y., Liang, D., Zhang, S.: Pseco: pseudo labeling and
consistency training for semi-supervised object detection. In: ECCV, pp. 457–472.
Springer (2022)
6. Liu, C., Zhong, Y., Zisserman, A., Xie, W.: Countr: transformer-based generalised
visual counting. arXiv preprint arXiv:2208.13721 (2022)
7. Liu, S., et al.: Grounding dino: marrying dino with grounded pre-training for open-
set object detection. arXiv preprint arXiv:2303.05499 (2023)
8. Paiss, R., et al.: Teaching clip to count to ten. In: ICCV, pp. 3170–3180 (2023)
9. Peyré, G., Cuturi, M.: Computational optimal transport. Found. Trends Mach.
Learn. 11(5–6), 355–607 (2019)
10. Ranjan, V., Nguyen, M.H.: Exemplar free class agnostic counting. In: ACCV, pp.
3121–3137 (2022)
11. Ranjan, V., Sharma, U., Nguyen, T., Hoai, M.: Learning to count everything. In:
CVPR, pp. 3394–3403 (2021)
12. Shi, M., Hao, L., Feng, C., Liu, C., Cao, Z.: Represent, compare, and learn: a
similarity-aware framework for class-agnostic counting. In: CVPR (2022)
13. Shi, Z., Sun, Y., Zhang, M.: Training-free object counting with prompts. In: WACV,
pp. 323–331 (2024)
14. Tyagi, A.K., et al.: DeGPR: deep guided posterior regularization for multi-class
cell detection and counting. In: CVPR, pp. 23913–23923 (2023)
15. -Dukić, N., Lukežič, A., Zavrtanik, V., Kristan, M.: A low-shot object counting
network with iterative prototype adaptation. In: ICCV, pp. 18872–18881 (2023)
16. Wang, B., Liu, H., Samaras, D., Nguyen, M.H.: Distribution matching for crowd
counting. In: NeurIPS, vol. 33, pp. 1595–1607 (2020)
17. Xu, J., Le, H., Nguyen, V., Ranjan, V., Samaras, D.: Zero-shot object counting.
In: CVPR, pp. 15548–15557 (2023)
18. You, Z., Yang, K., Luo, W., Lu, X., Cui, L., Le, X.: Few-shot object counting with
similarity-aware feature enhancement. In: WACV, pp. 6315–6324 (2023)
Motion Analysis in Static Images
Vietnam
1 Introduction
We have long been fascinated by motion illusions, the fascinating visual puzzles that
play tricks on our eyes. This research takes a deep dive into the realm of motion illusions,
aiming to advance our understanding of how machines interpret these visual phenomena
[1]. Beyond the intrigue of optical illusions, the focus is on equipping computers with
the ability to recognize and comprehend illusory motion patterns within static images as
in Fig. 1. This introductory section sets the stage for two primary areas of exploration:
the critical role of bespoke datasets in training effective machine learning models [2]
and a preliminary observation hinting at the superiority of colored images over grayscale
ones in motion illusion classification [3].
Motion illusions, such as the iconic rotating snakes or barber pole illusions, pose
unique challenges for computational systems [4]. While human vision effortlessly nav-
igates these illusions, teaching machines to discern the intricacies of illusory motion
demands a specialized focus. This research is positioned at the intersection of cognitive
psychology and computer vision, seeking to unravel the mysteries of motion illusions
and their computational interpretation.
One of the critical revelations in our exploration lies in the recognition of the inad-
equacy of generic datasets in capturing the diverse nuances of motion illusions [5].
Consequently, we advocate for the creation of bespoke datasets, meticulously tailored to
the specific characteristics of illusory motion [6]. These datasets serve as more than just
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 193–202, 2025.
https://doi.org/10.1007/978-981-96-4282-3_16
194 K. Agrawal et al.
training grounds for deep learning models; they offer insights into the features crucial
for machines to discern illusory movement. The imperative here is to understand the
impact of dataset specificity on model interpretability.
While specific details about the employed models remain undisclosed in this section,
our research delves into the intricacies of computational models when confronted with
motion illusions [7]. Deep learning architectures, known for their prowess in pattern
recognition, confront unique challenges in decoding illusory motion. The objective is
to unravel the decision-making processes within these architectures when tasked with
distinguishing illusory movement from static scenes. The aim is to offer insights that
transcend the specifics of the models used, contributing to the broader discourse on the
interpretability of deep learning in perceptual tasks.
2 Related Work
Motion illusions in static images have captivated researchers across cognitive psychology
and computer vision, prompting a multidisciplinary exploration. This section delves into
key contributions, laying the groundwork for our investigation and referencing ten studies
not covered in the introduction.
In a pioneering work, Johansson, G. [13] investigated the perceptual mechanisms
underlying motion illusions, elucidating the intricacies of how the human visual system
interprets dynamic phenomena. This foundational work serves as a compass, guiding
our understanding of the cognitive processes involved in perceiving motion illusions.
Williams et al. [14] addressed challenges in creating specialized datasets for motion
illusion studies, emphasizing the importance of tailored datasets to capture intricate
variations in illusory motion patterns. Wang et al. [15] introduces objective methods for
perceptual image quality assessment, focusing on quantifying the visibility of errors in
distorted images. It proposes a structural similarity index, demonstrating its effectiveness
through intuitive examples and subjective evaluations.
Later, Watanabe et al. [12] demonstrate that DNNs accurately replicate the direction
of illusory rotation but fail to detect motion components in negative control. The study
sheds light on the capability of DNNs to simulate complex perceptual phenomena like
illusory motion. Overall, the findings contribute to understanding the computational
mechanisms underlying visual perception in neural networks.
Kobayashi et al. [9] investigates the extraction of motion illusion-like patterns from
photographs and artworks employing predictive deep neural networks. Their study
demonstrates the successful replication of illusory motion observed in visual stimuli
using deep learning techniques. By leveraging predictive deep neural networks, the
research contributes to understanding and reproducing complex visual phenomena.
Meanwhile, Luckiesh’s et al. [11] explored visual illusions, delving into their causes,
characteristics, and practical applications. It provides a comprehensive study of visual
illusions, offering insights into their underlying mechanisms and practical implications.
This seminal work continues to be relevant for understanding the complexities of visual
perception. Next, Sun et al. [16] explored multisensory integration in motion perception,
shedding light on how combining visual and auditory cues influences the interpretation of
motion illusions. This complements our understanding of motion illusions by incorporat-
ing a multisensory perspective. In another work, Nishida and Johnston [17] investigated
neurophysiological correlates of motion illusions, providing insights into the neural
mechanisms underlying the perception of dynamic visual phenomena. Understanding
these correlates enriches the broader discussion on motion illusion recognition.
Taylor et al. [18] explores how viewers perceive and physiologically respond to frac-
tal patterns in Jackson Pollock’s art. It discusses the positive responses to fractal patterns,
indicating aesthetic appreciation and physiological engagement. By analyzing both per-
ceptual and physiological aspects, the research sheds light on the intricate relationship
between art and human cognition. This investigation expands understanding of fractals’
impact on human experience. In summary, this related work section incorporates diverse
perspectives from recent research, extending our understanding of motion illusion recog-
nition within static images. Each study contributes uniquely to our exploration, forming
the mosaic of knowledge guiding our investigation.
196 K. Agrawal et al.
Fig. 2. The examples of motion images in our collected dataset. Please see the color figures in
pdf with 400% zoom.
3 Dataset Collection
There are some efforts of collecting datasets [19] for motion illusion. However, these
datasets are small and not well organized. The need for creating this dataset arises from
the limited availability of publicly accessible datasets specifically designed for studying
motion perception in static images.
Therefore, in this work, we collect a new dataset, Motion Illusion in Static Scene,
dubbed MISS. We use Google Image Search Engine [20] with different input keywords,
for example, motion illusion, optical illusion, eye trick motion. Then, we use Google
Lens [21] to find similar images to the ones we initially collected with keywords. To
ensure the quality and relevance of the dataset, images were meticulously curated based
on established criteria for motion illusion stimuli. Each image was assessed for its effec-
tiveness in eliciting the perception of motion through manual inspection and validation
by seven individuals with normal vision and expertise in visual perception research.
The dataset comprises a diverse range of images with motion exhibiting different
patterns and configurations known to evoke the perception of motion in observers as
shown in Fig. 2. These patterns include but are not limited to radial, concentric, spi-
ral, and grid-like structures that exploit visual processing mechanisms to create the
illusion of movement. The MISS Dataset comprises not only images depicting motion
illusions but also a significant portion of non-motion images as shown in Fig. 3. These
non-motion images serve as crucial counterparts to their motion counterparts, providing
essential context for comparison and model training. Captured from various sources and
meticulously selected, the non-motion images encompass scenes devoid of any apparent
motion or illusionary effects. Their inclusion ensures a balanced dataset representa-
tion, enabling models to discern between genuine motion illusions and static scenes
accurately. By incorporating non-motion images, the dataset offers a comprehensive
Motion Analysis in Static Images 197
spectrum of visual stimuli, facilitating robust model training and evaluation for motion
perception analysis. In total, the dataset consists of 600 high-resolution images, with
an equal distribution between motion and non-motion categories in both the color and
grayscale datasets. This balanced dataset composition ensures robustness and reliability
in subsequent model training and evaluation processes.
4 Experiments
4.1 Model Training
The experiments involved training multiple deep learning models, MobileNet [22],
MobileNetV2 [23], ResNet50 [24], ResNetRS200 [25], Xception [26], EfficientNetB5
[27], EfficientNetV2S [28], InceptionV3 [29], NASNetMobile [30], and NASNetLarge
[30], on both the color and grayscale versions of the MISS dataset. The training process
included feeding the models with the training dataset, comprising 272 motion images
and 128 non-motion images for the color dataset, and an equivalent distribution for the
198 K. Agrawal et al.
grayscale dataset. Meanwhile, 100 images (50 motion and 50 non-motion images) are
used for the validation set and 100 images 50 motion and 50 non-motion images) are
used for the testing purpose. Stochastic gradient descent with momentum was utilized
as the optimization algorithm, with the following update rule:
θ{t+1} = θt − α · ∇J (θt ) + β · θt − θ{t−1} .
where θ_t is the parameter vector at iteration t, α is the learning rate, ∇J(θ_t) is the
gradient of the loss function J with respect to θ_t, and β is the momentum term.
The learning rate (α) was set to 0.001, and the momentum (β) was set to 0.9 to
balance between fast convergence and avoiding oscillations.
Fig. 4. Motion illusion in color (left) vs. grayscale (right). Please see the color figures in pdf with
200% zoom.
After training, the models were evaluated on separate testing sets containing 50 motion
and 50 non-motion images for both the color and grayscale datasets. Evaluation is done
based on the testing accuracy, which was calculated from model predictions.
The experimental results revealed the efficacy of the trained models in accurately
classifying motion illusions in static images. In addition to testing accuracy, precision,
recall, and F1-score metrics were calculated to provide a comprehensive evaluation of
model performance.
Precision measures the accuracy of positive predictions. It is calculated as the ratio of
true positive predictions to the total number of positive predictions made by the model.
Recall, also known as sensitivity or true positive rate, measures the proportion of
actual positive instances that were correctly identified by the model. It is calculated as
the ratio of true positive predictions to the total number of actual positive instances.
Motion Analysis in Static Images 199
Table 1. Experimental results on the collected dataset. Each model is tested on both color and
grayscale images
We aim to assess the performance of various deep learning models on detecting motion
in static images, using both colored and grayscale datasets. According to Table 1. The
models tested include MobileNet, MobileNetV2, ResNet50, ResNetRS200, Xception,
EfficientNetB5, EfficientNetV2S, InceptionV3, NASNetMobile, and NASNetLarge. For
evaluation, the mean Average Precision (mAP) was used as the primary performance
metric.
The results clearly indicate that models generally perform better on the colored
dataset compared to the grayscale dataset. The drop in performance when switching to
grayscale is observed across all models, though the extent of the performance degradation
varies.
200 K. Agrawal et al.
Top Performing Model. ResNet50 achieved the highest mAP for both colored (81%)
and grayscale (76.99%) datasets, making it the most robust across both image types.
MobileNet also performed well, with 80% mAP on the colored dataset and a 7% drop
when tested on the grayscale dataset.
Performance Impact. EfficientNetB5 and NASNetMobile had the largest drops in per-
formance when switching to grayscale. EfficientNetB5, for example, went from 75%
mAP on colored images to just 55% on grayscale. NASNetMobile also dropped signif-
icantly, from 72% on colored images to 54% on grayscale. These models seem to rely
more on color information to understand motion in static images.
Models that Adapt Well. Some models, like ResNet50 and MobileNetV2, showed
smaller performance drops when trained on grayscale data. For instance, ResNet50 only
dropped by about 4%, and MobileNetV2 by 7%. This suggests that these models are
better at finding important features in images, even without color.
The results of this experiment highlight that color images are generally more useful
than grayscale images for detecting motion in static images. Models tend to perform
better when they have access to color, which provides more detailed information. How-
ever, some models, such as ResNet50, still manage to perform well even with grayscale
images. This means they can focus on other details like textures and shapes, even when
color is missing.
Moreover, examining precision and recall values can offer deeper insights into the
models’ behavior. A high precision value indicates that the model rarely misclassifies
non-motion illusion samples, while a high recall value suggests the model effectively
captures most of the actual motion illusion samples. Balancing these two metrics is
crucial, as prioritizing one over the other may lead to biased performance evaluations.
In this paper, we explored how different deep learning models perform when detecting
motion in static images using both colored and grayscale datasets. The results of our
experiments show that color images consistently lead to better performance compared
to grayscale images across all the models tested. This highlights the importance of color
information in helping models recognize motion-related patterns.
Among the models tested, ResNet50 stood out as the best performer for both colored
and grayscale images. Although all models saw a drop in accuracy when trained on
grayscale data, some models—like ResNet50 and MobileNetV2—handled the absence
of color better than others. Models like EfficientNetB5 and NASNetMobile, on the
other hand, struggled more with grayscale images, experiencing significant drops in
performance.
Overall, our findings suggest that color information plays a key role in motion detec-
tion tasks. While some models can still perform reasonably well with grayscale images,
the results show that including color data generally leads to more accurate and reliable
motion detection. Therefore, if color data is available, it should be used to maximize the
performance of the models.
Motion Analysis in Static Images 201
For future work, we can enhance motion perception classification by exploring novel
deep learning architectures tailored for this task and incorporating semantic segmenta-
tion and attention mechanisms. Collaboration with experts in psychology and neuro-
science can deepen our understanding of motion perception mechanisms. Expanding
and diversifying the dataset will improve model generalization. Real-world applica-
tions, such as human-computer interaction and autonomous systems, warrant explo-
ration, along with user studies to assess model impact. Developing explainable AI tech-
niques will increase model transparency and trustworthiness. Addressing these directions
will advance motion perception analysis and its application in various domains.
Acknowledgment. This research was supported by the National Science Foundation (NSF) under
Grant 2025234.
References
1. Carbon, C.C.: Understanding human perception by human-made illusions. Front. Hum.
Neurosci. 8, 566 (2014)
2. Koch, B., Denton, E., Hanna, A., Foster, J.G.:. Reduced, Reused and Recycled: The Life of
a Dataset in Machine Learning Research. arXiv preprint arXiv:2112.01716 (2021)
3. Kitaoka, A.: Color-dependent motion illusions in stationary images and their phenomenal
dimorphism. Perception 43(9), 914–925 (2014)
4. Otero-Millan, J., Macknik, S.L., Martinez-Conde, S.: Microsaccades and blinks trigger
illusory rotation in the ‘rotating snakes’ illusion. J. Neurosci. 32(17), 6043 (2012)
5. Salari, A., Djavadifar, A., Liu, X., Najjaran, H.: Object recognition datasets and challenges:
a review. Neurocomputing 495, 129–152 (2022)
6. Chung, S.T., Patel, S.S., Bedell, H.E., Yilmaz, O.: Spatial and temporal properties of the
illusory motion-induced position shift for drifting stimuli. Vision. Res. 47(2), 231–243 (2007)
7. Gomez-Villa, A., Martín, A., Vazquez-Corral, J., Bertalmío, M., Malo, J.: Color illusions
also deceive CNNs for low-level vision tasks: analysis and implications. Vision. Res. 176,
156–174 (2020)
8. Sowmya, V., Govind, D., Soman, K.P.: Significance of contrast and structure features for
an improved color image classification system. In: 2017 IEEE International Conference on
Signal and Image Processing Applications (ICSIPA), pp. 210–215. IEEE (2017)
9. Kobayashi, T., Kitaoka, A., Kosaka, M., Tanaka, K., Watanabe, E.: Motion illusion-like pat-
terns extracted from photo and art images using predictive deep neural networks. Sci. Rep.
12(1), 3893 (2022)
10. Kirubeswaran, O.R., Storrs, K.R.: Inconsistent illusory motion in predictive coding deep
neural networks. Vision. Res. 206, 108195 (2023)
11. Luckiesh, M.: Visual Illusions, their Causes, Characteristics and Applications. D. Van
Nostrand Company (1922)
12. Watanabe, E., Kitaoka, A., Sakamoto, K., Yasugi, M., Tanaka, K.: Illusory motion reproduced
by deep neural networks trained for prediction. Frontiers in Psychology 345 (2018)
13. Johansson, G.: Visual perception of biological motion and a model for its analysis. Percept.
Psychophys. 14, 201–211 (1973)
14. Williams, R.M., Yampolskiy, R.V.:. Optical Illusions Images Dataset. arXiv preprint arXiv:
1810.00415, 2 (2018)
15. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error
visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
202 K. Agrawal et al.
16. Sun, H.J., Campos, J.L., Chan, G.S.: Multisensory integration in the estimation of relative
path length. Exp. Brain Res. 154, 246–254 (2004)
17. Nishida, S., Johnston, A.: Marker correspondence, not processing latency, determines
temporal binding of visual attributes. Curr. Biol. 24(15), 1677–1686 (2014)
18. Taylor, R.P., Spehar, B., Van Donkelaar, P., Hagerhall, C.M.: Perceptual and physiological
responses to Jackson Pollock’s fractals. Front. Hum. Neurosci. 5, 60 (2011)
19. Akiyoshi Kitaoka’s website. Ritsumeikan University. https://www.ritsumei.ac.jp/~akitaoka/
index-e.html. Last access February 2024
20. Bitirim, Y.: Retrieval effectiveness of google on reverse image search. J. Imaging Sci. Technol.
66, 010505–010511 (2022). https://doi.org/10.2352/J.ImagingSci.Technol.2022.66.1.010505
21. Taffel, S.: Google’s lens: computational photography and platform capitalism. Media Cult.
Soc. 43(2), 237–255 (2021)
22. Sinha, D., El-Sharkawy, M.: Thin mobilenet: an enhanced mobilenet architecture. In: 2019
IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference
(UEMCON), pp. 0280–0285. IEEE (2019)
23. Dong, K., Zhou, C., Ruan, Y., Li, Y.: MobileNetV2 model for image classification. In: 2020
2nd International Conference on Information Technology and Computer Application (ITCA),
pp. 476–480. IEEE (2020)
24. Koonce, B., Koonce, B.: ResNet 50. Convolutional Neural Networks with Swift for
Tensorflow: Image Recognition and Dataset Categorization, pp. 63–72 (2021)
25. Bello, I., et al.: Revisiting resnets: improved training and scaling strategies. Adv. Neural. Inf.
Process. Syst. 34, 22614–22627 (2021)
26. Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017)
27. Bhawarkar, Y., Bhure, K., Chaudhary, V., Alte, B.: Diabetic retinopathy detection from fundus
images using multi-tasking model with EfficientNet B5. In ITM Web of Conferences, 44,
p. 03027. EDP Sciences (2022)
28. Tan, M., Le, Q.:. Efficientnetv2: smaller models and faster training. In: International
Conference on Machine Learning, pp. 10096–10106. PMLR (2021)
29. Wang, C., et al.: Pulmonary image classification based on inception-v3 transfer learning
model. IEEE Access 7, 146533–146541 (2019)
30. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable
image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 8697–8710 (2018)
Motorcycle Helmet Detection Benchmarking
Vietnam
1 Introduction
In the field of computer vision and intelligent transportation systems, the precise identifi-
cation of safety equipment, particularly helmets, is pivotal for advancing road safety. This
research embarks on a transformative journey to push the boundaries of helmet detection,
harnessing the power of sophisticated deep-learning methodologies. The imperative for
robust and efficient helmet detection becomes particularly pronounced in the domain of
traffic surveillance, where traditional methods often prove inadequate in addressing the
multifaceted challenges posed by real-world scenarios.
As urban landscapes undergo a notable surge in the prevalence of motorcycles and
electric bikes [1], the imperative to ensure the safety of riders has become an increasingly
critical concern in contemporary society. Helmets, recognized as fundamental safety
accessories, play a crucial role in mitigating the risk of head injuries during accidents.
However, the effectiveness of helmets is intricately linked to their proper usage, empha-
sizing the urgent need to develop advanced systems capable of precisely and reliably
identifying the presence of helmets in various scenarios.
This paper introduces a diverse array of innovative approaches to helmet detection,
as depicted in Fig. 1, with a deliberate focus on creating a new dataset and harnessing
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 203–215, 2025.
https://doi.org/10.1007/978-981-96-4282-3_17
204 K. Agrawal et al.
the capabilities of different object detection models such as YOLO [2] (You Only Look
Once), Faster RCNN [3], RT-DETR (Real Time Detection Trans-former) [4] and Detec-
tron2 [5]. These real-time object detection algorithms are strategically selected for their
ability to swiftly and efficiently identify objects in dynamic scenarios, rendering them
especially well-suited for applications such as traffic surveillance. The inherent robust-
ness of these models is further emphasized through the incorporation of advanced tech-
niques like Spatial Pyramid Pooling, thereby augmenting their effectiveness in intricate
and varied environments.
Moreover, the research extends its exploration into the domain of ensemble methods,
aiming to fortify the overall robustness and reliability of the helmet detection system.
This involves the integration of multiple models within an ensemble framework, with the
strategic objective of synergizing their individual strengths. By doing so, the system’s
performance is enhanced across a broad spectrum of conditions, solidifying its position
as a comprehensive solution for accurate helmet detection in settings that continuously
evolve and present dynamic challenges.
In response to the escalating prevalence of motorcycles and electric bikes, the inno-
vative approaches presented in this paper not only address the immediate concerns
surrounding helmet detection but also contribute to the broader narrative of rider safety
in urban environments of developing countries. Beyond the field of technical intrica-
cies, the research anticipates and responds to the evolving landscape of transportation,
where intelligent systems are essential components in the quest for enhanced safety and
efficiency.
The significance of this research transcends its technical intricacies and resonates
within the broader domain of intelligent transportation systems. By elevating the pre-
cision of helmet detection in challenging conditions, the outcomes directly align with
the overarching goal of such systems – the reduction of accidents and the improvement
Motorcycle Helmet Detection Benchmarking 205
2 Related Work
The field of helmet detection has undergone profound transformations driven by the
continuous evolution of computer vision and deep learning techniques. This section
strives to offer a comprehensive review of relevant literature, shedding light on key
contributions in the domain. This exploration not only synthesizes existing knowledge
but also establishes a contextual foundation for the proposed architecture.
A noteworthy aspect of recent research involves the exploration of YOLOv5s in
the realm of object detection tasks. Huang et al. [8] pioneered an advanced YOLOv5s-
based method specifically tailored for electric bike helmet recognition. Their innovative
approach yielded enhanced detection efficiency and practicality, acting as a catalyst
for further investigations in specialized domains. This underscores the adaptability of
YOLOv5s in addressing nuanced challenges within the realm of helmet detection.
Chen et al. [9] embarked on the development of lightweight helmet detection algo-
rithms, a crucial pursuit for ensuring real-time processing in safety applications. Their
work placed significant emphasis on safety helmet-wearing detection in industrial set-
tings, advocating for algorithms that offer swift and accurate recognition. This research
substantially contributes to the intersection of real-time safety applications and com-
puter vision, recognizing the importance of expeditious and precise helmet detection in
critical environments.
In a parallel vein, Fan et al. [10] delved into the application of ensemble methods
in helmet detection. Their deep learning-based ensemble method showcased advance-
ments in minimizing false positives, ensuring a more reliable helmet detection system.
This work not only addresses the challenges of false positives but also makes valuable
strides in enhancing the overall robustness of object detection models. The integration of
ensemble methods adds a layer of complexity and efficacy to helmet detection systems.
The YOLO series has gained popularity for real-time object detection due to its
effective balance between speed and accuracy, but its performance is hindered by the
Non-Maximum Suppression (NMS) step. Transformer-based detectors like DETR offer
an alternative by eliminating NMS but suffer from high computational costs that limit
their practicality. To address these issues, Lv et al. [4] proposed the Real-Time DEtection
206 K. Agrawal et al.
TRansformer (RT-DETR), an end-to-end object detector that maintains high speed and
accuracy by employing an efficient hybrid encoder for rapid multi-scale feature process-
ing and an uncertainty-minimal query selection to enhance initial query quality, while
also allowing flexible speed tuning through adjustable decoder layers.
Recent advancements in deep learning have significantly improved image classi-
fication, segmentation, and object detection, including detecting helmets on bike rid-
ers to enhance road safety. Singh et al. [5] analyze various approaches and experi-
ments with state-of-the-art models like Detectron2 and EfficientDet, demonstrating their
effectiveness in helmet detection.
The synthesis of the reviewed literature underscores the diverse approaches employed
in helmet detection, with a particular emphasis on the YOLOv5s architecture, lightweight
algorithms, integration of SPP, utilization of ensemble methods, and the significance
of curated datasets. Building upon these insights, the proposed architecture aspires to
contribute to ongoing advancements in intelligent transportation systems and road safety.
By amalgamating strengths and addressing the limitations highlighted in the literature,
the proposed architecture seeks to elevate the precision, efficiency, and adaptability of
helmets detection in dynamic real-world scenarios. This endeavor aligns with the broader
trajectory of advancements in computer vision and deep learning, fostering a safer and
more intelligent future for transportation systems.
3 Proposed Work
Our methodology started with the collection of the Motorcycle Helmet Detection Dataset
(MHDD), a foundational element crucial for developing a robust helmet detection system
capable of adapting to a myriad of diverse environmental conditions and traffic scenarios.
There are a few similar datasets available such as the Caltech Pedestrian dataset [12]
(taken from a vehicle driving) and multiple datasets focusing on biker’s helmets. No
such public dataset is readily available that focuses on motorcyclist helmets from a
traffic camera view. This made us create a new dataset that overcomes these issues and
provides a readily available dataset for future use.
Compiling our dataset involves a comprehensive sourcing strategy, tapping into var-
ious channels to ensure a rich and diverse representation. We leverage public video
feeds from traffic cameras and surveillance systems in Vietnam from different regions.
This exhaustive selection ensures the inclusion of a broad spectrum of traffic scenarios
and environmental conditions in developing countries, significantly contributing to the
robustness and adaptability of our helmet detection system.
The cornerstone of our methodology lies in rigorous data annotation performed by
trained annotators. This involves the meticulous marking of regions of interest (ROIs)
containing motorcyclists and the precise indication of helmet presence [13]. In our work,
we use Roboflow [14] for annotation. In particular, this tool empowers annotators to
create high-quality annotations efficiently, thereby contributing to the depth and accuracy
of our dataset. The graphical user interface of this tool is illustrated in Fig. 2.
Our envisioned framework for robust helmet detection in traffic videos is strategically
crafted to surmount challenges posed by fluctuating weather conditions, ultimately
ensuring the safety of motorcyclists. This comprehensive framework consists of key
components, each playing a pivotal role in the overall system, as illustrated in Fig. 3.
Motorcycle Helmet Detection Benchmarking 209
3.4 Implementation
We adopt pre-trained deep convolutional neural networks (CNN) that serve as the base
model for feature extraction. Consideration will be given to well-established architec-
tures such as ResNet [19], VGG [20], or Inception [21]. Then, the selected base model
will undergo fine-tuning using our annotated traffic video dataset. Transfer learning tech-
niques will be applied, leveraging knowledge from large-scale datasets like ImageNet to
210 K. Agrawal et al.
adapt the model for helmet detection. An object detection head, such as a Region Pro-
posal Network (R-CNN), will be added to the base model to identify ROIs containing
motorcyclists’ heads. For the helmet detection head, a subnetwork for helmet detection
will be integrated within the ROIs identified in the previous step. Architecture options
include Faster R-CNN [3] or YOLO [2] for object detection.
Regarding the loss function, an appropriate loss function, combining classification
loss and localization loss (e.g., Smooth L 1 loss), will be defined for helmet detection.
The model will be trained using annotated data, employing an optimizer such as Adam
[22]. Training progress will be monitored using validation data, and techniques like
learning rate schedules and early stopping will be used for optimization.
4 Benchmarking
Table 1. The experimental results of different models trained on traffic data and tested on traffic
data. Both training and testing data are included in MHDD dataset. The best performance is marked
as boldface.
numerical scores and metrics, these images furnish the research community with tangible
proof of the model’s performance in practical, real-world situations. Such visual assess-
ments contribute significantly to a holistic and insightful interpretation of the models’
overall effectiveness and suitability for deployment in dynamic environments.
4.4 Discussions
The comprehensive evaluation has unveiled valuable insights into the models’ perfor-
mance as shown in Table 1. The critical metric of precision highlights the exemplary
capabilities of Models 3 and 10 (different families). Model 4 (Yolov7-w6 Grayscale)
distinguishes itself with an impressive mean Average Precision (mAP) of 0.899, closely
trailed by Model 10 (FasterRCNN-ResNet50 Grayscale) with a commendable mAP of
0.871.
In the realm of Recall, Model 3 demonstrates outstanding performance, surpassing
its counterparts with a mAP of 0.881. Model 10 exhibits competitive recall capabilities,
boasting a mAP of 0.871. Meanwhile, Model 1 (Yolov5s Colored) achieves the highest
F1-Score, with a mAP of 0.857. This metric signifies a harmonious blend of precision
and recall, positioning it as a noteworthy contender adept at balancing these two crucial
aspects of helmet detection.
A pivotal consideration lies in the mAP, where Model 10 (FasterRCNN-ResNet50
Grayscale) outperforms others with a score of 0.871. This underscores the model’s
consistency and effectiveness across a spectrum of conditions, reinforcing its reliability
in diverse helmet detection scenarios. Considering all the results we can broadly say the
model works better in grayscale over colored ones.
These insights empower users to make informed decisions by comprehending the
trade-offs between precision, recall, and adaptability. Although RT-DETR and Detec-
tron2 performed poorly compared to all other models in the range of mAP 0.7–0.8 but
whether prioritizing precise helmet identification or comprehensive coverage, the varied
strengths of each model facilitate customized selections aligned with specific application
needs.
Table 2. The experimental results on the test traffic data with different models and training
datasets.
Fig. 6. Visualization of failure cases: a) wrongly detected helmet b) wrongly detected helmet and
motorcycle c) dark hair detected as helmet and d) cap detected as helmet.
and technologies. We plan to explore multimodal data fusion, where integrating infor-
mation from various sensors could significantly elevate detection accuracy, particularly
in scenarios with challenging visibility conditions as shown in Fig. 6. The expansion
of benchmark datasets and the exploration of advanced domain adaptation techniques
are pivotal steps toward creating more robust models capable of handling diverse and
complex environments.
For future work, we focus on optimizing helmet detection systems for real-time
processing, facilitating practical deployment in traffic scenarios. Enhancing model per-
formance in adverse weather conditions remains a critical challenge, warranting further
investigation. Multimodal data fusion, encompassing data from various sensors, could
enhance detection accuracy, especially in challenging visibility conditions. Expanding
benchmark datasets, exploring domain adaptation techniques, and broadening the scope
to include anomaly detection for comprehensive traffic safety are avenues for future
exploration.
Acknowledgment. This research was supported by the National Science Foundation (NSF) under
Grant 2025234.
References
1. Huang Ma, C., Yang, D., Zhou, J., Feng, Z., Yuan, Q.: Risk riding behaviors of urban e-bikes:
a literature review. Int. J. Environ. Res. Public Health 16(13), 2308 (2019)
2. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You Only Look Once: Unified, Real-Time
Object Detection. arXiv preprint arXiv:1506.02640 (2016)
3. Redmon, J., Farhadi, A.: You only look once: unified, real-time object detection. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 779–788 (2016)
4. Lv, W., et al.: DETRs Beat YOLOs on Real-time Object Detection (2023)
5. Singh, R., Shetty, S., Patil, G., Bide, P.J.: Helmet detection using detectron2 and efficient-
det. In: 2021 12th International Conference on Computing Communication and Networking
Technologies (ICCCNT), Kharagpur, India, pp. 1–5 (2021)
6. Nguyen, X.-D., et al.: Adaptive multi-vehicle motion counting. J. Signal Image Video
Processing 16(8), 2193–2201 (2022)
7. Nguyen, T.V., et al.: Data-driven city traffic planning simulation. ISMAR Adjunct, pp. 859–
864 (2022)
8. Huang, B., et al.: An improved YOLOv5s-based helmet recognition method for electric bikes.
Appl. Sci. 13(15), 8759 (2023)
9. Chen, J., Deng, S., Wang, P., Huang, X., Liu, Y.: Lightweight helmet detection algorithm
using an improved YOLOv4. Sensors 23(3), 1256 (2023)
10. Fan, Z., Peng, C., Dai, L., Cao, F., Qi, J., Hua, W.: A deep learning-based ensemble method
for helmet-wearing detection. PeerJ. Computer Sci. 6, e311 (2020)
11. Shen, J., Xiong, X., Li, Y., He, W., Li, P., Zheng, X.: Detecting safety helmet wearing on
construction sites with bounding-box regression and deep transfer learning. Computer-Aided
Civil and Infrastructure Eng. 36(2), 180–196 (2021)
12. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual
object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010)
Motorcycle Helmet Detection Benchmarking 215
13. Zhou, Y., Liu, L., Shao, L., Mellor, M.: Fast automatic vehicle annotation for urban traffic
surveillance. IEEE Trans. Intell. Transp. Syst. 19(6), 1973–1984 (2017)
14. Dwyer, B., Nelson, J., Solawetz, J., et al.: Roboflow (Version 1.0) (2022). https://roboflow.
com.
15. Wang, W., Zhou, T., Porikli, F., Crandall, D., Van Gool, L.: A survey on Deep Learning
Technique for Video Segmentation. arXiv e-prints, arXiv-2107 (2021)
16. Li, J., Wang, D., Li, S., Zhang, M., Song, C., Chen, X.: Deep learning based adaptive sequential
data augmentation technique for the optical network traffic synthesis. Opt. Express 27(13),
18831–18847 (2019)
17. Ren, D., Sun, T., Yu, C., Zhou, C.: Research on safety helmet detection for construction site. In:
2021 International Conference on Computer Information Science and Artificial Intelligence
(CISAI), pp. 186–189. IEEE (2021)
18. Flach, P., Kull, M.: Precision-recall-gain curves: PR analysis done right. Advances in Neural
Information Processing Systems 28 (2015)
19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 770–778 (2016)
20. Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image
Recognition. arXiv preprint arXiv:1409.1556 (2014)
21. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception archi-
tecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 2818–2826 (2016)
22. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:
1412.6980 (2014)
MEPC: Multi-level Product Category
Recognition Image Dataset
1 Introduction
E-commerce platforms have become more and more popular over the years.
The digital transformation 4.0 further stimulated public interest in e-commerce,
resulting in a boom in e-commerce businesses [1, 2]. As a result, the e-commerce
industry has become more competitive, driving firms to make considerable
expenditures to improve their platforms. In recent years, hierarchical classifi-
cation has emerged as a powerful tool in the online retail industry [3], assist-
ing sellers in auto-filling fast product category information. With the speed-up
growth of online products, efficiently hierarchical classifying these products has
T.L. Nguyen and M.Q. Do—These authors contributed equally to this work.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 216–225, 2025.
https://doi.org/10.1007/978-981-96-4282-3_18
MEPC: Multi-level Product Category Recognition Image Dataset 217
become crucial for success in the online retail sector. Applying deep learning
and machine learning techniques to retail data enhances the seller experience on
e-commerce platforms.
Category Prediction (CP), which aims to recognize the intent categories of
given texts, is regarded as one of the most fundamental machine-learning tasks in
an e-commerce system [4]. For example, this predicted category information will
influence product ranking in the search and recommendation system. Different
from the traditional classification [5, 6]. Category Prediction is formally catego-
rized as a hierarchical classification problem since categories in most e-commerce
websites are organized as a hierarchical tree (we consider the situation that the
categories are organized as a hierarchical tree, but not a directed acyclic graph).
Figure 1 shows a simplified fragment of one category architecture.
To resolve the problem of category prediction of business then data has always
been one of the key points to driving AI-based category recognition research. Dif-
ferent training data may lead to different training results. No matter the target,
product image datasets are needed to evaluate the performance of the proposed
deep-learning models. In this article, We proposed a new multi-level product
category dataset with more than 164.000 images in Table 1. We experimented
on many different pre-trained models. The results show that this new dataset
is appropriate for improving the model’s performance when researching deep
learning models to predict multiple categories.
2 Related Work
Statistics Value
Train Val
Number of images 131,292 32,824
Number of 1st level categories 28
Number of 2nd level categories 193
Number of 3rd level categories 659
Fig. 2. The summary of random images from the datasets for hierarchical image clas-
sification.
The first CIFAR-100 [8] a commonly used benchmark for hierarchical classi-
fication, has 20 coarse (super) classes, and each class is associated with five fine
classes (i.e., the course class “people” has five fine classes: “baby”, “boy”, “girl”,
“man”, and “woman”), for a total of 100 fine classes.
The next, for ETH Entomological Collection (ETHEC) dataset [9] compris-
ing images of Lepidoptera specimens with their taxonomy tree. The real-world
dataset has variations in terms of the images per category and a significant imbal-
ance in the structure of the taxonomical tree. For CUB-200–2011 (CUB) [10],
CUB follows the setting to organize the label hierarchy of birds to 200 species,
122 genera, 37 families, and 13 orders. CUB datasets contain 11,788 images of
200 subcategories belonging to birds, 5,994 for training and 5,794 for testing.
Each image has detailed annotations: 1 subcategory label, 15 part locations, 312
binary attributes, and 1 bounding box.
MEPC: Multi-level Product Category Recognition Image Dataset 219
Fig. 3. The categories-cloud of 1nd /2nd /3nd - level keywords of photos. The larger
the font size, the more products the corresponding categories.
chy image dataset focused on product e-commerce for hierarchical image classi-
fication tasks.
MEPC dataset has multiple backgrounds to increase the variety of real
images. We also evaluate the impact of MEPC dataset with EffcientNet,
ResNet, VGG, and MobiNet series models. The results benchmark models with
a top-1 accuracy score of 92.055% for MEPC-10 and a top-5 accuracy score of
57.36% for MEPC-1000. We hope that the introduced dataset will be of good
assistance in fine-tuning the model for image classification in the online retail
sector systems.
3 Dataset
In this section, we introduce Multi-level E-commerce Product Categorization
(MEPC) dataset.
Dataset Description. All of the images in our MEPC dataset are col-
lected from the Internet data collection method introduced in the research
article“End-to-End System For Data Crawling, Monitoring, And Analyzation
Of E-Commerce Websites” at ICTA2024 [14]. There are nearly +164,000 images
in total. For the practical application scenes, the distribution of image amount
is imbalanced data, as shown in Fig. 4.
We conducted important statistical analyses better to understand the struc-
ture and characteristics of the data. First, we analyzed the number of categories
by hierarchy, as illustrated in Fig. 6. This analysis helps us identify the dis-
tribution of categories and detect any imbalances among the levels. We used
MEPC: Multi-level Product Category Recognition Image Dataset 221
4 Methods
In this study, we used the deep learning models ResNet50 [15], VGG16 [16],
MoBiNet [17] and EfficientNet [18], to evaluate a new dataset. These models have
been proven to be highly effective in various image recognition and classification
tasks. We will provide a detailed account of how we prepare the data, train the
models, and evaluate the results.
First, we resized the images to 224 × 224 and normalized the image data.
Next, we used the Adam [19] optimizer with a learning rate of 1e−3 . When
visualizing the random number of images in each class as shown in Fig. 8, we
noticed that the MEPC data was imbalanced among the classes. Therefore, we
used the focal loss [20] function to penalize the model whenever it incorrectly
predicted classes with fewer samples. We used k-fold and metrics such as Top-1
Accuracy for the MEPC-10 dataset to evaluate performance. The experimental
results are described in Table 3. In addition, We split the MEPC-10 and MEPC-
1000 datasets into 80% training and 20% evaluation sets and used the YOLOv8
[18] model for image classification as follows in Table 4.
Based on Tables 3 and 4, we look at the MEPC-10 dataset has a Top-1 Acc of
92.055%. Therefore, MEPC-10 works well for improving deep learning models or
for evaluating new deep learning model architectures. However, the MEPC-1000
dataset has a Top-5 Acc of 57.36% so this dataset will be a big challenge for the
hierarchical image classification problem.
222 T. L. Nguyen et al.
Fig. 5. Label-only embeddings visualizing label connections of the MEPC dataset, with
multi-colored labels for level 1, yellow for level 2, gray for level 3, red connections from
level 1 to level 2, and blue connections from level 2 to level 3. (Color figure online)
Fig. 6. Statistics of the number of multi-level categories in the two datasets MEPC-10
and MEPC-1000.
MEPC: Multi-level Product Category Recognition Image Dataset 223
Fig. 7. Statistics of image sizes of MEPC dataset. It can be seen that the images of
the dataset are square with various aspect ratios.
Fig. 8. Random visualization of classes in MEPC dataset. It is easy to see that the
MEPC data is imbalanced.
224 T. L. Nguyen et al.
5 Conclusion
We have released an MEPC dataset for HIC tasks, where the images present
numerous challenges such as background noise, uneven lighting, diverse image
sizes, complex parent-child linkages, etc. This dataset includes a variety of prod-
uct images, focusing on predicting product categories for e-commerce systems.
In our study, we also provided an overview of the dataset, visualized the data
from various perspectives, and tested this dataset with well-known deep learn-
ing models. In the future, we will collect additional semantic descriptions of the
images from the real-world or large language models to optimize and improve
the classification performance of this dataset.
References
1. Barat, M.I., Haque, M.M.: Small business boom in ecommerce: an in-depth research
exploration. Int. J. Bus. Manage. Finan. Res. 2, 1–14 (2024)
2. Azam, A., Ansari, A.M.: The emerging role of e-commerce in today’s business: a
conceptual study. Asian J. Manage. Commer. 05, 428–439 (2024)
3. Wei, Y., Tran, S., Xu, S., Kang, B., Springer, M.: Deep learning for retail product
recognition: challenges and techniques. Comput. Intell. Neurosci. 2020(1), 8875910
(2020)
4. Cevahir, A., Murakami, K.: Large-scale multi-class and hierarchical product cat-
egorization for an E-commerce giant. In: Proceedings of COLING 2016, the 26th
International Conference on Computational Linguistics: Technical Papers (Y. Mat-
sumoto and R. Prasad, eds.), (Osaka, Japan), The COLING 2016 Organizing Com-
mittee, pp. 525–535 (2016)
MEPC: Multi-level Product Category Recognition Image Dataset 225
5. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. IEEE 86, 2278–2324 (1998)
6. Larkey, L.S., Croft, W.B.: Combining classifiers in text categorization. In: Proceed-
ings of the 19th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’96, New York, NY, USA, Associa-
tion for Computing Machinery, pp. 289–297 (1996)
7. Zhu, X., Bain, M.: B-CNN: branch convolutional neural network for hierarchical
classification (2017)
8. Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. Technical
Report (2009)
9. Dhall, A.: Eth Entomological Collection (ETHEC) dataset [Palearctic Macrolepi-
doptera, Spring 2019] (2019). Please cite the dataset and our work if you use it or
report results based on it: “Learning Representations for Images with Hierarchical
Labels” (https://arxiv.org/abs/2004.00909) and “Hierarchical Image Classification
using Entailment Cone Embeddings” (https://arxiv.org/abs/2004.03459)
10. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD
Birds-200-2011 Dataset (2011)
11. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual
classification of aircraft (2013)
12. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-
grained categorization. In: 2013 IEEE International Conference on Computer
Vision Workshops, pp. 554–561 (2013)
13. He, L., Song, D., Zheng, L.: Hierarchical image classification with a literally toy
dataset (2021)
14. Do, M.Q.: End-to-end system for data crawling, monitoring, and analyzation of
e-commerce websites (2024). https://doi.org/10.1007/978-3-031-80943-9 107
15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition
(2015)
16. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition (2015)
17. Phan, H., Huynh, D., He, Y., Savvides, M., Shen, Z.: MoBiNet: a mobile binary
network for image classification (2019)
18. Tan, M. and Le, Q. V.: EfficientNet: rethinking model scaling for convolutional
neural networks (2020)
19. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2017)
20. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object
detection (2018)
21. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2:
inverted residuals and linear bottlenecks (2019)
22. Howard, A., et al: Searching for mobileNetV3 (2019)
A Simple Approach Towards Frame
Filtering for Efficient Gaussian Splatting
1 Introduction
Gaussian Splatting has become a popular technique for creating highly realis-
tic, real-time renderings of 3D models. Its ability to capture fine details and
produce lifelike visualizations makes it a go-to choice for various applications
in 3D rendering. However, one major hurdle in using Gaussian Splatting is the
initial step of initialization, which usually requires the Structure-from-Motion
(SfM) [5] algorithm to extract the key points of the images at the beginning,
often implemented using a tool like COLMAP [2]. While effective, this method
T.-P. Tran and M.-Q. Nguyen—These authors contributed equally to this work.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 226–238, 2025.
https://doi.org/10.1007/978-981-96-4282-3_19
Frame Filtering for Gaussian Splatting 227
can be extremely slow and computationally heavy due to the overhead in the
COLMAP preprocessing period (which includes the extraction of features and
reconstruction of the sparse scene using the key points).
Recognizing this bottleneck, some researchers have been working on ways to
skip the initialization phase, aiming to speed up the process. However, with-
out COLMAP, the problem shifts—especially when processing video at slower
speeds. We believe that slowing down the video adds its layer of inefficiency, mak-
ing the overall rendering process time-consuming once again due to the duplica-
tion of frames with minor transitions compared to previous frames - which does
not bring major information gain about the scanned surface of the reconstructed
scene and might introduce blurry artifacts into the rendered model.
Multiple teams has proposed frameworks and modern scene compression
techniques in order to enhance the training progress as well as to effectively
render the scene in real-time. For example, Niedermayr et al. [18] has proposed
a scene compression technique based on sensitivity-aware clustering of vectors,
quantization-aware training and entropy encoding. Kerbl et al. [19] suggest a
novel representations of Gaussian Splatting scenes using a hierarchical represen-
tation for scenes of large scale within massive dataset.
As efficient as the scene compression methods are, the authors reckon that
the scene-compression-oriented methods can be hard to implement and deploy in
practice for portable usage of the Gaussian Splatting scene files, especially when
specific engineering steps are carried out to meet with very specific requirements
over the rendering engines. Therefore, we would like to accelerate the training
process of the Gaussian Splatting scenes instead, as the training duration and
number of iterations are correlated with the growing number of 3D Gaussians.
To tackle this, we propose a simple yet efficient approach: a frame-skipping
method based on analyzing the blurriness of each frame. By skipping frames that
don’t contribute significantly to the quality of the final render, we can reduce
processing time while still preserving the high quality of the 3D model. This
method strikes a balance between speed and visual fidelity, offering a faster and
more efficient solution for Gaussian Splatting in real-time applications.
The key contributions of this paper can be summed up as follows:
– A frame evaluating the function for scoring the quality and potential contri-
bution of the frame into the Gaussian-Splatting-based reconstructed models,
specifically the COLMAP-free 3D Gaussian Splatting approach.
– An in-depth analysis of the rendering quality, the memory usage, and the
training time of the scene reconstructed under different settings of the Gaus-
sian Splatting approaches.
The structure of the paper is as follows: Sect. 2 discusses the current
approaches in the novel view synthesis task, Gaussian Splatting, and approaches
to optimize Gaussian Splatting; Sect. 3 proposes and suggests a simple approach
to filter frames for the reconstruction of 3D models in Gaussian Splatting; Sect. 4
illustrates the experiment settings for our approach and discusses the results and
properties of the method; Sect. 5 proposes future works over the field of optimiz-
ing the Gaussian Splatting.
228 T.-P. Tran et al.
2 Related Works
2.1 Novel View Synthesis
For the Novel View Synthesis task, the objective is to provide a dataset of view-
points for a specific scene and augment this dataset by generating images from
new, unseen angles. Traditionally, techniques such as Image-Based Rendering
(IBR) [16] and Multi-View Stereo (MVS) [17] have been employed for this task.
IBR synthesizes new views by interpolating between existing images, relying on
depth information and pixel correspondences to create smooth transitions. MVS
reconstructs 3D geometry from multiple images to create dense point clouds or
meshes, allowing for rendering from new viewpoints.
Some new methods have been developed to tackle the problem by using depth
information. MultiDiff, which enables consistent novel view synthesis from a sin-
gle RGB image by depth information and video diffusion priors. This also utilizes
an attention layer, making the quality of synthesized views realistic. Depth-
Guided Methods [15] extract depth information from provided images to guide
the synthesis process, therefore, improving the accuracy. Gaussian Splatting and
NeRF are developed as two of the new successful approaches for this task.
The Sobel operator [4] and Tenengrad [13] are widely used techniques in image
processing for assessing image sharpness and detecting blur. The Sobel operator
is an edge detection algorithm that computes the image intensity gradient at each
pixel, highlighting areas of rapid intensity change. At its core, the Sobel operator
performs convolution with two 3. × 3 kernels, which are specifically designed to
approximate derivatives in the horizontal and vertical directions. These convo-
lutional filters are crucial for detecting edge information by amplifying inten-
sity changes in the image, making the Sobel operator a fundamental tool in
convolutional-based image processing techniques.
Tenengrad, based on the Sobel operator, is a focus measure that quantifies
image sharpness by summing the squared magnitudes of these gradients across
the entire image. A higher Tenengrad value generally indicates a sharper image
with more defined edges, while a lower value suggests a blurrier image. These
methods are particularly effective because they are sensitive to high-frequency
content in images, which tends to be reduced in blurry or out-of-focus pho-
tographs. These techniques enable objective comparisons of image quality and
can be used in various applications, from autofocus systems in cameras to quality
control in image processing pipelines (Fig. 1).
230 T.-P. Tran et al.
3 Methods
3.1 Preliminaries: Gaussian Primitives
Fig. 1. Overview of the original Gaussian Splatting pipeline. [1] Gaussian Splatting
consists of three main components: an initialisation component to create the set of
points from the Structure-from-Motion algorithm at the beginning, a projection and
adaptive density control component to optimise the Gaussian Splats and minimise the
photometric loss, and a differentiable rasteriser to rasterise and render the projections
of 3D Gaussians for computing the loss and optimise it using gradient descent.
For each scene, the Gaussian Splatting model [1] perceives it as a set of 3D Gaus-
sian primitives, an explicit representation of the 3D scene. Each 3D Gaussian is
characterized by a set of parameters including:
Fig. 2. Overview of the original CF-3DGS method. [14] The method takes the sequence
of images from the scene as input to learn the 3D Gaussians for the scene and the
camera poses of the frames. Unlike the original Gaussian Splatting method, the CF-
3DGS method optimise two separate set of Gaussians - local and global Gaussians - in
order to learn the assumed-to-be-affine transformations of the camera poses
The scene optimization process proposed by Yang et al. suggests removing the
initialization phase using the SfM points of the scene obtained from COLMAP
232 T.-P. Tran et al.
and replacing it with only the camera intrinsics and the camera poses. For each
frame .It at step .t, a set of points is initialized using a monocular-depth estimation
network, in this method Dense Prediction Transformer is employed, to estimate
the monocular depth .Dt . Then, the model learns to optimize the Gaussians for
this specific frame, by minimizing a linear combination of the photometric loss
and D-SSIM:
Fig. 3. Our method is to calculate the Tenengrad values of the images of each viewpoint
of each scene. The sequence of images would form a distribution of Tenengrad to take
sampling.
In order to reduce the burden of the initial datasets over the process of training
the Gaussian primitives, we propose a simple method 3 that has proven to be
effective in eradicating frames that do not drastically boost the resolution of the
rendered model, thus saving the memory usage of the computational resources
and the training time. First, consider the .m × n source frame .A, we can compute
the Sobel of the image w.r.t .x and .y directions:
⎡ ⎤ ⎡ ⎤
+1 0 −1 +1 +2 +1
. Gx = ⎣+2 0 −2⎦ ∗ A Gy = ⎣ 0 0 0 ⎦ ∗ A (4)
+1 0 −1 −1 −2 −1
Then, we would be able to calculate the Tenengrad value of each image,
which is essentially the mean of the magnitude of Sobel gradient of each pixel.
This value will serve as the basis for the scoring and sampling process before
reconstructing the scene.
m−1 n−1
1
Tenengrad(A) =
. G2x (i, j) + G2y (i, j) (5)
mn i=0 j=0
234 T.-P. Tran et al.
We assume that the images of one single sequence of a scene would form a
distribution of Tenengrad, thus converting the process of sampling the scene as a
random sampling from the distribution. As mentioned before, a high Tenengrad
value indicates a high resolution of the image with less blurry areas on the image,
therefore we assigned the weights for each of the images based on the normalized
Tenengrad from distribution .TScene (μ; σ) as follows
Tenengrad(A) − μ
.w(A) = (6)
σ
Using the weights, we will randomly sample a proportion .p of the images from
the original image set of the scene for training, and the model will be trained
exclusively on this subset. The authors decided to randomly sample from the
image set instead of directly removing the images with low quality from the
training set to reduce the likelihood that it would result in catastrophic disrup-
tion in the temporal continuity of the sequence, which leads to a large section
of the recorded scene not being reconstructed properly. For our experiments, we
choose .p = 80% of frames from each of the scenes to be filtered at the beginning.
Table 1. Table of Novel View Synthesis, Camera Pose Estimation, and Resource Usage.
Scenes Church when training at the 200% settings encountered out-of-memory issues,
thus the result of this scene is redacted on the table. The results of the settings with
lowest memory usage and the shortest training duration is highlighted in bold.
Scene Size Novel view synthesis Camera pose estimation Resources Utilisation Status
SSIM .↑ PSNR .↑ LPIPS .↓ RPE trans .↓ RPE rot .↓ ATE .↓ VRAM (GB) .↓ Time .↓
Church 80% 0.939 31.889 0.083 0.011 0.022 0.002 9.86 3 h 51 m
100% 0.93 30.23 0.11 0.008 0.018 0.002 13.63 5 h 28 m
200% [RED] [RED] [RED] [RED] [RED] [RED] [RED] [RED] OOM
Francis 80% 0.936 34.827 0.125 0.061 0.198 0.006 7.64 1 h 00 m
100% 0.91 32.72 0.14 0.029 0.154 0.006 13.45 2 h 44 m
200% 0.961 37.309 0.095 0.010 0.062 0.003 11.34 3 h 47 m
Family 80% 0.957 33.819 0.061 0.102 0.08 0.004 8.22 0 h 44 m
100% 0.94 31.27 0.07 0.022 0.024 0.002 17.71 1 h 39 m
200% 0.972 36.522 0.042 0.007 0.016 0.002 16.60 5 h 12 m
Ignatius 80% 0.79 25.849 0.139 0.053 0.041 0.005 7.64 0 h 43 m
100% 0.9 28.43 0.09 0.033 0.032 0.005 16.55 1 h 47 m
200% 0.944 32.52 0.061 0.01 0.018 0.003 12.17 3 h 10 m
time and resource usage. From the dataset, we have chosen to train on 4 scenes:
Francis, Family, Church, and Ignatius Fig. 4.
To create a new interpolate, we follow a similar data preprocessing protocol
to that of the original CF-3DGS settings, with an extra frame interpolation
procedure to mimic the effects of blurry and/or slow recordings of the scenes:
1. From the original sequence of frames for each scene, we conduct the frame
filtering process mentioned in Sect. 3.3 to sample .80% of the frames from the
original sequence.
2. We interpolate by duplicating every single frame of the video and average
pooling consecutive frames in the sequence to impose the slow-down effect of
the videos. This would lead to the number of frames being interpolated to be
.200% of that of the original sequence.
4.2 Settings
We keep most of our training settings the same as the original training settings of
CF-3DGS: The initial learning rate is .10−5 and gradually decays to .10−6 until
convergence. All of our experiments are conducted on 1 single NVIDIA RTX
A5000 GPU with 24 GB of VRAM.
4.3 Metrics
For novel view synthesis, we employ standard metrics for evaluating the ren-
dering quality of models, including PSNR, SSIM [9], and LPIPS [10], similar to
traditional settings of SfM-free NeRF tasks. For camera pose estimation, we cal-
culate the difference between the ground-truth camera trajectories of the scene
and the estimated one of the models using ATE and RPE.
4.4 Results
Novel View Synthesis. The overall quality of rendered models is indicated
in Table 1. The sampling method that we proposed is capable of producing
the models with high rendered quality in certain scenes (with three out of four
scenes getting no less than 30 PSNR), however, the resolutions of the models are
inconsistent. This indicates that choosing a high-quality subset of images from
each sequence does help with getting high-quality rendering of 3D Gaussians in
real time with much less training time compared to the original results in most
cases, as we are going to show later on.
Fig. 4. Rendering results of the CF-3DGS reconstructed scene (right) and the ground
truth for the novel view synthesis task (left), as well as the pose estimation results of
the camera poses of the scene.
Resources Usage and Training Duration. The overall resource usage and
training time of the model are reported in Table 1. On average, the VRAM usage
of the pipeline when performing our sampling method is reduced by 30% to 50%,
and the training time is reduced by 20% to 30%. This shows that the trade-off
between the rendering results and the resource utilization can help with fitting
the scenes under computational constraints, especially for low-tier and mid-tier
machines to produce photorealistic reconstruction of the scenes.
5 Conclusion
Gaussians and to reduce the memory usage of the pipeline. The mechanism to
determine the level of significance of each frame towards the reconstruction of
the scene, in other words, how much quality and resolution of the output scene
is attributed to each frames, needs further analyses and deeper understanding
for a more efficient and robust skipping frame mechanism.
References
1. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian Splatting for
real-time radiance field rendering. ACM Trans. Graph. 42(4) (2023)
2. Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), Las Vegas (2016)
3. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng,
R.: NeRF: representing scenes as neural radiance fields for view synthesis. In:
Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol.
12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-
58452-8 24
4. Sobel, I., Feldman, G.: An isotropic 3x3 image gradient operator. In: Stanford
Artificial Intelligence Project (1968)
5. Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in
3D. ACM Trans. Graph. 25(3), 835 –846 (2006)
6. Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and Temples: benchmarking
large-scale scene reconstruction. ACM Trans. Graph. 36(4), 1–3 (2017)
7. Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: BARF: bundle-adjusting neural
radiance fields. In: Proceedings of IEEE International Conference on Computer
Vision (ICCV), Online (2021)
8. Bian, W., Wang, Z., Li, K., Bian, J.W., Prisacariu, V.A.: Nope-Nerf: optimising
neural radiance field with no pose prior. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), Vancouver (2023)
9. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment:
from error visibility to structural similarity. IEEE Trans. Image Process. 13(4),
600–612 (2004)
10. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effec-
tiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake
(2018)
11. Verbin, D., Hedman, P., Mildenhall, B., Zickler, T., Barron, J.T., Srinivasan, P.P.:
Ref-NeRF: structured view-dependent appearance for neural radiance fields. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), New Orleans (2022)
12. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction.
In: Proceedings of IEEE International Conference on Computer Vision (ICCV),
Online (2021)
238 T.-P. Tran et al.
13. Schlag, J.F., Sanderson, A.C., Neuman, C.P., Wimberly, F.C.: Implementation
of Automatic Focusing Algorithms for a Computer Vision System with Camera
Control. Carnegie-Mellon University (1983)
14. Fu, Y., Liu, S., Kulkarni, A., Kautz, J., Efros, A.A., Wang, X.: COLMAP-free 3D
gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (2024). arXiv:2312.07504 [cs.CV]. Available at:
https://arxiv.org/abs/2312.07504
15. Hou, Y., Solin, A., Kannala, J.: Novel view synthesis via depth-guided skip con-
nections. In: Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision (WACV), pp. 1892–1901 (2021)
16. Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. CoRR,
abs/2102.13090, 2021. Available at: https://arxiv.org/abs/2102.13090
17. Poggi, M., Conti, A., Mattoccia, S.: Multi-view guided multi-view stereo (2022).
arXiv:2210.11467 [cs.CV]. Available at: https://arxiv.org/abs/2210.11467
18. Niedermayr, S., Stumpfegger, J., Westermann, R: Compressed 3D gaussian splat-
ting for accelerated novel view synthesis. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), Seattle (2024)
19. Kerbl, B., Meuleman, A., Kopanas, G., Wimmer, M., Lanvin, A., Drettakis, G.:
A hierarchical 3D gaussian representation for real-time rendering of very large
datasets. ACM Trans. Graph. 43(4), 1–15 (2024)
Enhancing Unsupervised Person
Re-identification with Multi-view Image
Representation
1 Introduction
2 Literature Reviews
2.1 Purely Unsupervised Learning for Person ReID
Unsupervised Learning for Person ReID (USL ReID) is challenging and flex-
ible because it uses unlabeled data for training, but it’s better for real-world
developing environments. Traditional methods use metric learning for personal
retrieval. Many USL ReID methods have emerged due to the removal of perfor-
mance bottlenecks from clustering algorithms [16, 25]. Using pseudo-labels from
clustering algorithms or similarity estimation, these methods train models on
unlabeled data like labeled ones [1, 3, 10]. Examples include SpCL [10], a self-
paced contrastive learning framework using instance-level memory, and Cluster
Contrast [4], which addresses inconsistent updates of memory class centroids.
ICE [1] applies camera-aware, hard-sample mining, and soft-label concepts to
Enhancing Unsupervised Person ReId with Multi-view 241
contrastive learning. Some methods have ignored noise in pseudo labels generated
by unsupervised clustering algorithms. Thus, our research focuses on creating a
new USL approach to overcome unsupervised clustering algorithm limitations.
3 Proposed Method
3.1 Approach Direction
To solve the challenges described in previous sections, we approach the solution
with the following ideas: (i) modifying the ResNet-based backbone to extract
more information from the image, and (ii) constructing unsupervised learning
architecture integrated with the contrastive loss with the memory bank and the
diversity loss to enhance image representations. Thus, our method is illustrated
in Fig. 1(A), which can be described in detail as follows.
Let .D = {xi }Ni=1 denote an unlabeled dataset, where each .xi represents the
.i-th image and .N is the total number of images. The goal of the USL ReID
task is to train an image encoder .E in an unsupervised manner, producing
ReID features .F = {fi }N i=1 . During inference, these ReID features are used
for identity retrieval. Typically, the training process of clustering-based USL
methods alternates between two stages:
Stage I: Clustering. At the beginning of each epoch, training samples are
clustered using the DBSCAN algorithm. The cluster ID .yi ∈ C, where .C is
the set of cluster IDs, serves as one-hot pseudo labels for network optimization.
Based on the clustering results, a cluster-based memory bank .M = {mi }C i=1 is
1
initialized using cluster centroids, where .mi = |Ci | fj ∈Ci fj with .fj represents
the feature of the .j-th sample in the cluster .Ci , and .|Ci | denotes the cluster size.
242 A. D. Nguyen et al.
Stage II: Network Training. Once the pseudo labels are obtained, the network
undergoes optimization in a manner akin to supervised learning. The training
objective employed is ClusterNCE [4], which is defined as follows:
exp(Sim(f, m+ ))
L = − log C
. (1)
j=1 exp(Sim(f, mj ))
In this equation, .m+ denotes the centroid of the cluster to which the fea-
ture vector .f belongs, while .mj represents the .j-th centroid within the memory
bank. The function .Sim(u, v) calculates the cosine similarity between vectors .u
and .v, and .τ is the temperature parameter that controls the sharpness of the
distribution.
The memory bank, which stores the cluster centroids, is updated in a
momentum-based manner, similar to the approaches used in previous works
such as Momentum Contrast [11] and HHCL [14]. The normal update rule for
the memory bank is given by:
mi ← βmi + (1 − β)f
. (2)
In this context, the momentum coefficient .β determines how the new fea-
ture vector .f affects the current centroid .mi . The instance in the .i-th cluster
of the current mini-batch is represented by the feature vector .f . With the help
of this momentum update technique, centroids are continuously improved over
time, incorporating fresh data while preserving stability. The holistic distribution
might not be captured by Eq. 2, which uses either the hardest sample or the aver-
age centroid. In order to tackle this, we employ Dynamic Clustering Contrastive
Learning (DyCL) [13], where memory momentum updates are given dynamic
weights. With this strategy, the model can fully use reliable data in the global
context. We give similar instances of each query instance appropriate weights,
with tougher examples having larger weights, in accordance with Triplet Loss’
hard sample mining technique [17]. The sample weights are determined using a
softmax function, which highlights the significance of hard cases and keeps the
Enhancing Unsupervised Person ReId with Multi-view 243
Fig. 2. A person image is represented as multi-views, and the response between the
different views of images is considered together to determine the match.
dy exp(−mi · fj /τw )
. wij = Ni (3)
t=1 exp(−mi · ft /τw )
where .τw is the temperature coefficient hyper-parameter that affects the propor-
tion of weights for hard instances, .Ni is the instance number of .i-th class in a
Ni dyand .ft is the .t-th instance feature of .Ni . Note that the sum of weights
mini-batch
is . j=1 wij = 1. Thus, the .i-th dynamic cluster centroid is the weighted mean
in the mini-batch:
Ni
m̂i =
. wij fj and mi ← γmi + (1 − γ)m̂i (4)
j=1
regions, which are crucial for object recognition. Detailed features of human
body parts, which might be overlooked by pooling operations, are essential for
identifying individuals and generating pseudo labels. Additionally, most meth-
ods treat images as individual feature embeddings, overlooking the fact that
they can be described from multiple views, especially images that may contain
diverse information. As shown in Fig. 2, an image of a person can be viewed from
different perspectives, each focusing on different parts of the image and poten-
tially overlapping regions. Different views emphasize various aspects. Therefore,
multi-view embedding offers more comprehensive semantic information, enabling
the model to better adapt to various semantic contexts
We propose a novel architecture called the Multi-View Image Represen-
tation (MVIR), illustrated in Fig. 1(B). Given an image .I, we extract par-
tial features by horizontally dividing the feature map into .K uniformly par-
titioned regions and applying GMP layers. This results in a set of .K partial
features .P = stack({fp1 , ..., fpk }|fpi ∈ RD ) ∈ RK×D . Instead of using these
features directly for representation, we construct a multi-view representation.
Specifically, we create .m learnable view codes .(c1 , ..., cm ) as queries for atten-
tion, where .ci ∈ RD . We then calculate .m view attentions .(A1 , ..., Am ), where
.Ai = exp(P · ci ) ∈ R is a weight vector corresponding to the .K par-
K×D
1
N
.Lobjective = ce(Mq fi , yi ) + λLdiv (6)
N i=1
where .ce refers to the cross-entropy loss and .λ is a control parameter of the
diversity loss.
4.2 Results
Table 1. The performance of baseline model with and without proposals (%) on
datasets
74.8%, 96.6% in mAP and Rank-1 (R1) metrics on Market benchmark, respec-
tively, while it obtains 14.2%, 35.2% in mAP and Rank-1 (R1) metrics on the
Msmt benchmark.
Effectiveness of MVIR and Diversity Loss. To evaluate the efficacy of our
proposed method, we conducted a series of independent experiments by incremen-
tally adding various components during the training process. Firstly, we examined
the efficiency of our proposed image representation method. To determine the
optimal number .K of local parts used to generate views, we fine-tuned .K from
1 to 6, where .K = 1 indicates that the model does not utilize MVIR. As illus-
trated in Fig. 3, the best performance in the Rank-1 metric is achieved when .K
is set to 4. Table 1 also demonstrates that the accuracy of the models increased
by approximately 1% to 2% across all metrics on both datasets. Subsequently, we
incorporated the diversity loss into the training process. The .λ parameter was fine-
tuned. As shown in Fig. 4, the best performance in the Rank-1 metric is obtained
when .λ is set to 1. However, further increasing .λ results in a significant drop in
performance. Table 1 also indicates that this loss slightly improves the model’s
performance in both Rank-1 and mAP metrics across all datasets.
Fig. 4. Evaluation on Market with varying .λ value with K = 4 horizontal splitted parts.
Table 2. Comparison of SOTA unsupervised learning methods for person ReID (%).
Bold denotes the best while Underline indicates the second best.
MMCL [23] CVPR .45.5 .80.3 .89.4 .92.3 .11.2 .35.4 .44.8 .49.8
we surpass GCL by nearly 10% in mAP and show a competitive Rank-1 accuracy
that is close to SpCL’s 90.8%. When evaluating the MSMT dataset, our perfor-
mance is higher than MMCL and comparable to SpcL’s results. In general, these
results highlight the robustness of our approach across well-known datasets and
underscore its potential for a new image representation in person ReID tasks.
5 Conclusions
In this study, we introduce a new way to present images in the feature spaces,
called Multi-View Image Representation (MVIR), tailored for unsupervised per-
son ReID. Our approach leverages both global and local image contexts to
enhance the discriminativeness of representations within the feature space by
utilizing the relationship between part features. We also suggest utilizing a diver-
sity loss to make view features explore different information, leading to the effec-
tiveness of MVIR. Our experiments on three challenging benchmarks consisting
of Market-1501 and MSMT17 datasets prove that the proposals bring poten-
tial enhancements to traditional methods using the ResNet architecture as the
backbone.
Our proposed method has room for improvement. The current method must
synthesize local information to create new image representations. In addition,
unsupervised clustering pseudo-label generation assumes a low percentage of
inaccurate labels. As shown in MSMT17, using raw pseudo-labels from this pro-
cess leads to low performance on large and difficult datasets. Thus, we will focus
on optimizing our backbone architecture and pseudo label generation in our
perspective work method in the near future.
248 A. D. Nguyen et al.
References
1. Chen, H., Lagadec, B., Bremond, F.: Ice: inter-instance contrastive encoding for
unsupervised person re-identification. In: IEEE/CVF International Conference
on Computer Vision (ICCV), pp. 14940–14949 (2021). https://doi.org/10.1109/
ICCV48922.2021.01469
2. Chen, H., Wang, Y., Lagadec, B., Dantcheva, A., Bremond, F.: Joint generative
and contrastive learning for unsupervised person re-identification. In: IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2004–2013
(2021). https://doi.org/10.1109/CVPR46437.2021.00204
3. Cho, Y., Kim, W.J., Hong, S., Yoon, S.E.: Part-based pseudo label refinement
for unsupervised person re-identification. In: IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 7298–7308 (2022). https://doi.org/
10.1109/CVPR52688.2022.00716
4. Dai, Z., Wang, G., Yuan, W., Zhu, S., Tan, P.: Cluster contrast for unsupervised
person re-identification. In: ACCV: 16th Asian Conference on Computer Vision,
pp. 319–337 (2022). https://doi.org/10.1007/978-3-031-26351-4_20
5. Deng, D.: Dbscan clustering algorithm based on density. In: 2020 7th International
Forum on Electrical Engineering and Automation (IFEEA), pp. 949–953 (2020).
https://doi.org/10.1109/IFEEA51475.2020.00199
6. Ding, J., Zhou, X.: Learning feature fusion for unsupervised domain adaptive per-
son re-identification. In: 2022 26th International Conference on Pattern Recog-
nition (ICPR), pp. 2613–2619 (2022). https://doi.org/10.1109/ICPR56361.2022.
9956264
7. Du, H.P., Nguyen, A.D., Nguyen, D.T., Nguyen, H.N.: .μpewface: parallel ensemble
of weighted deep convolutional neural networks with novel loss functions for face-
based authentication. Image Vis. Comput. 139(104819) (2023). https://doi.org/
10.1016/j.imavis.2023.104819
8. Du, H.P., Nguyen, A.D., Nguyen, D.T., Nguyen, H.N., Nguyen, D.: A novel deep
ensemble learning to enhance user authentication in autonomous vehicles. IEEE
Trans. Autom. Sci. Eng. 21(3), 2362–2373 (2024). https://doi.org/10.1109/TASE.
2023.3270764
9. Ge, Y., Chen, D., Li, H.: Mutual mean-teaching: pseudo label refinery for unsuper-
vised domain adaptation on person re-identification. In: International Conference
on Learning Representations (2020)
10. Ge, Y., Zhu, F., Chen, D., Zhao, R., Li, H.: Self-paced contrastive learning with
hybrid memory for domain adaptive object re-id. In: Proceedings of the 34th Inter-
national Conference on Neural Information Processing Systems (NIPS) (2020).
https://doi.org/10.5555/3495724.3496673
11. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised
visual representation learning. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 9726–9735 (2020). https://doi.org/10.1109/
CVPR42600.2020.00975
12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Enhancing Unsupervised Person ReId with Multi-view 249
13. He, Z., Xue, M., Du, Y., Zhao, Z., Su, F.: Dynamic clustering and cluster con-
trastive learning for unsupervised person re-id with feature distribution align-
ment. In: IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), pp. 3610–3614 (2024). https://doi.org/10.1109/ICASSP48485.
2024.10447711
14. Hu, Z., Zhu, C., He, G.: Hard-sample guided hybrid contrast learning for unsu-
pervised person re-identification. In: 2021 7th IEEE International Conference on
Network Intelligence and Digital Content (IC-NIDC), pp. 91–95 (2021). https://
doi.org/10.1109/IC-NIDC54101.2021.9660560
15. Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-
text matching. In: ECCV 2018: 15th European Conference on Computer Vision,
pp. 212–228 (2018). https://doi.org/10.1007/978-3-030-01225-0_13
16. Lin, Y., Dong, X., Zheng, L., Yan, Y., Yang, Y.: A bottom-up clustering approach
to unsupervised person re-identification. In: Proceedings of the Thirty-Third AAAI
Conference on Artificial Intelligence. AAAI’19. AAAI Press (2019). https://doi.
org/10.1609/aaai.v33i01.33018738
17. Ming, Z., Chazalon, J., Luqman, M.M., Visani, M., Burie, J.C.: Simple triplet
loss based on intra/inter-class metric learning for face verification. In: 2017 IEEE
International Conference on Computer Vision Workshops (ICCVW), pp. 1656–
1664 (2017). https://doi.org/10.1109/ICCVW.2017.194
18. Nguyen, A.D., Nguyen, D.T., Dao, H.N., Le, H.H., Tran, N.Q.: Impact analysis of
different effective loss functions by using deep convolutional neural network for face
recognition. In: From Born-Physical to Born-Virtual: Augmenting Intelligence in
Digital Libraries, pp. 101–111 (2022). https://doi.org/10.1007/978-3-031-21756-
2_8
19. Nguyen, A.D., Pham, D.H., Nguyen, H.N.: GAN-based data augmentation
and pseudo-label refinement for unsupervised domain adaptation person re-
identification. In: Computational Collective Intelligence, pp. 591–605 (2023).
https://doi.org/10.1007/978-3-031-41456-5_45
20. Pham, D.H., Nguyen, A.D., Nguyen, H.N.: GAN-based data augmentation and
pseudo-label refinement with holistic features for unsupervised domain adaptation
person re-identification. Knowl.-Based Syst. 288, 111471 (2024). https://doi.org/
10.1016/j.knosys.2024.111471
21. Qu, L., Liu, M., Cao, D., Nie, L., Tian, Q.: Context-aware multi-view summariza-
tion network for image-text matching. In: Proceedings of the 28th ACM Interna-
tional Conference on Multimedia. MM ’20, New York, NY, USA, pp. 1047–1055.
Association for Computing Machinery (2020). https://doi.org/10.1145/3394171.
3413961
22. Si, T., Zhang, Z., Liu, S.: Compact triplet loss for person re-identification in cam-
era sensor networks. Ad Hoc Netw. 95, 101984 (2019). https://doi.org/10.1016/j.
adhoc.2019.101984
23. Wang, D., Zhang, S.: Unsupervised person re-identification via multi-label clas-
sification. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pp. 10978–10987 (2020). https://doi.org/10.1109/CVPR42600.2020.
01099
24. Wei, L., Zhang, S., Gao, W., Tian, Q.: Person transfer GAN to bridge domain
gap for person re-identification. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 79–88 (2018). https://doi.org/10.1109/CVPR.2018.
00016
250 A. D. Nguyen et al.
25. Zeng, K., Ning, M., Wang, Y., Guo, Y.: Hierarchical clustering with hard-batch
triplet loss for person re-identification. In: IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 13654–13662 (2020). https://doi.org/
10.1109/CVPR42600.2020.01367
26. Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-
identification: a benchmark. In: 2015 IEEE International Conference on Computer
Vision (ICCV), pp. 1116–1124 (2015). https://doi.org/10.1109/ICCV.2015.133
Boosting Image Super-Resolution:
Incorporating Locally-Enhanced FFN
and Data Augmentation in the Swin
Transformer Architecture
1 Introduction
2 Related Work
Data Augmentation. Augmentation strategies for SR can be classified into
two types based on where they are applied: pixel-domain and feature-domain
techniques [1]. Pixel-domain augmentations, such as CutBlur [1], CutMix [21],
or Cutout [7], operate directly on the raw image data. Meanwhile, feature-domain
Boosting Image Super-Resolution with LeFF and CutBlur 253
In this work, we propose the SwinIR-LeCut model, which contains the two key
improvements added to the SwinIR model. First, we apply the CutBlur data
augmentation technique to increase the diversity of training images. Second,
we integrate the LeFF layer into the Swin Transformer blocks to improve the
model’s ability to capture both channel-wise and spatial feature interactions
effectively. These enhancements leverage diverse training data and more efficient
feature modeling to boost overall model performance [25]. More precisely, at
first, the number of training images is increased by using the CutBlur [1] data
augmentation technique to build the final training set defined as DMixed Final in
Sect. 3.1 above. The training then starts with the Shallow Feature Extraction,
where the initial presentation of each training LR image is conducted. After
that, the New Deep Feature Extraction (nDF) phase begins with the Residual
Swin Transformer LeCut Block (RSTLB) (m blocks), containing multiple LeSwin
Transformer layers (k layers) where the LeFF layer [19] is added to each layer.
Each LeSwin Transformer layer (LeSTL) consists of 3 main components, includ-
ing Window Self-Attention (WSA), LeFF, and MLP with 2 FC layers, where the
LayerNorm is added before each component to normalize the input (see Fig. 1).
At the final stage, the new HR image is constructed using the learned features
which is finally upsampled by the PixelShuffle. The overall architecture can be
modeled as the following steps:
Shallow Feature Extraction: The Shallow feature extraction component is
used to capture the LR features from the inputs. As defined in [12], given an
ILR ∈ DMixed Cutblur LR is the input LR image, the operation can be formulated
as:
F0 = HSF E (ILR ) (4)
where F0 ∈ RW ×H×C is the extracted feature map of the LR input image ILR
by the Shallow Feature Extraction layer HSF E .
Boosting Image Super-Resolution with LeFF and CutBlur 255
New Deep Feature Extraction: After the first extraction, the deep features
are captured by this second component. The enhancement behind this step is
motivated by the need to address specific limitations in SwinIR’s handling of
intricate local textures. LeFF [19] introduces locally enhanced convolutional lay-
ers that focus on refining texture details at the pixel level, making it well-suited
for this purpose. The process can be modeled as follows:
where FnDF ∈ RW ×H×C is the deep feature map extracted by the proposed
Deep Feature Extraction module HnDF , which can be formulated as:
k
HnDF (F0 ) = HLeST Li (Fi−1 ) (6)
i=1
where HLeST Li (.) is the proposed LeSwin Transformer layer (LeSTL) inside the
RSTLB block. Each RSTLB block contains k LeSTL layers.
Image Construction: This component is the final step before producing the
super-resolved output image. In this step, the output image is upsampled from
the extracted deep representation while the predict high-quality details are pre-
served. According to [12], this process can be modeled as:
4 Experiments
4.1 Datasets
Set5 [2] (5 images) and Set14 [22] (14 images) contain mixtures of simple scenes,
including animals, people, and landscapes. They are commonly used for initial
testing of SR models due to their small sizes.
BSD100 [13] consists of 100 natural images, featuring diverse subjects like build-
ings, animals, and landscapes. It challenges models with more detailed textures
and variety, compared to Set5 and Set14.
256 P. H. Tran and N.-T. Nguyen
4.2 Baselines
We compare the proposed model with the following baselines: Residual chan-
nel attention networks (RCAN) [24], Second-order attention network (SAN) [6],
Internal Graph neural networ (IGNN) [26], Holistic attention network (HAN)
[16], Non-local sparse attention (NLSA) [15], and SwinIR [12]
Fig. 2. Visual comparison of the images super-resolved by SwinIR (left) and our pro-
posed SwinIR-LeCut model (right).
Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
RCAN [24] 38.27 34.12 32.41 33.34 39.44
SAN [6] 38.31 34.07 32.42 33.1 39.32
IGNN [26] 38.24 34.07 32.41 33.23 39.35
HAN [16] 38.27 34.16 32.41 33.35 39.46
NLSA [15] 38.34 34.08 32.43 33.4 39.59
SwinIR [12] 38.35 34.14 32.44 33.40 39.60
SwinIR-LeCut 38.37 34.17 32.45 33.41 39.61
Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
RCAN [24] 0.9614 0.9216 0.9027 0.9384 0.9786
SAN [6] 0.962 0.9213 0.9028 0.937 0.9792
IGNN [26] 0.9613 0.9217 0.9025 0.9383 0.9786
HAN [16] 0.9614 0.9217 0.9027 0.9385 0.9785
NLSA [15] 0.9618 0.9231 0.9027 0.9394 0.9789
SwinIR [12] 0.9620 0.9227 0.903 0.9393 0.9792
SwinIR-LeCut 0.9622 0.9232 0.903 0.9393 0.9792
and SSIM metrics slightly increase, showing that the refinements made to the
model enhance the quality of super-resolved images on both simple and medium-
complexity datasets (Set5 and Set14). For example, Set5 moves from 38.35 to
38.37 PSNR, and Set14 benefits from an increase in PSNR from 34.14 to 34.17,
demonstrating the model’s adaptability to varied textures. In more complex
datasets like BSD100, the improvements are minimal. Although BSD100 is more
challenging due to its intricate textures and noise patterns, our model main-
tains a competitive performance, suggesting that the changes made are effective
in handling these intricacies. On the highly detailed Manga109 and Urban100
258 P. H. Tran and N.-T. Nguyen
Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
RCAN [24] 32.63 28.87 27.77 26.82 31.22
SAN [6] 32.64 28.92 27.78 26.79 31.18
IGNN [26] 32.57 28.85 27.77 26.84 31.28
HAN [16] 32.64 28.9 27.8 26.85 31.42
NLSA [15] 32.59 28.87 27.78 26.96 31.27
SwinIR [12] 32.72 28.94 27.83 27.07 31.67
SwinIR-LeCut 32.74 28.96 27.82 26.72 31.68
Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
RCAN [24] 0.9002 0.7889 0.7436 0.8087 0.9173
SAN [6] 0.9003 0.7888 0.7436 0.8068 0.9169
IGNN [26] 0.8998 0.7891 0.7434 0.809 0.9182
HAN [16] 0.9002 0.789 0.7442 0.8094 0.9177
NLSA [15] 0.9 0.7891 0.7444 0.8109 0.9184
SwinIR [12] 0.9021 0.7914 0.7459 0.8164 0.9226
SwinIR-LeCut 0.9023 0.7916 0.7459 0.8112 0.9227
datasets, the model performs similarly to its baseline (SwinIR), indicating that
while the improvements help with general datasets, there should be further
improvements in order to deliver more precise super-resolved images on these
such complex datasets where the complex architectural elements are present in
a high frequency manner [11].
×4 Scale Image Super-Resolution. For scale ×4, the performance of the
SwinIR-LeCut is more variable. Set5 and Set14 continue to exhibit increases
in both PSNR and SSIM metrics, confirming that the model still handles sim-
pler datasets well (Table 3 and Table 4). However, the results for more complex
datasets such as Urban100 show a slight decline in performance at the ×4 scale
factor, where PSNR drops from 27.07 to 26.72, and SSIM decreases from 0.8164
to 0.8112, indicating that the model struggles with upscaling fine-grained tex-
tures and highly detailed images. This reduction in performance highlights the
need for additional enhancements of high frequency details and complex struc-
tural features in urban data [4], which is already exposed in the results of the
model at the ×2 scale factor for these datasets in SSIM metric (Table 2). Mean-
while, BSD100 remains the performance in either metrics. That said, the addi-
tional enhancements may be necessary to address the challenges due to the noisy
level of this dataset in order to have visible increases. The results for Manga109
remain strong but exhibit minimal changes, reflecting that the augmentation
Boosting Image Super-Resolution with LeFF and CutBlur 259
Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
RCAN [24] −0.2% −0.05% −0.09% −0.1% −0.4%
SAN [6] −0.1% −0.2% −0.06% −0.8% −0.7%
IGNN [26] −0.2% −0.2% −0.09% −0.5% −0.6%
HAN [16] −0.2% +0.05% −0.09% −0.1% −0.3%
NLSA [15] −0.02% −0.1% −0.03% 0% −0.02%
SwinIR-LeCut +0.05% +0.08% +0.03% +0.02% +0.02%
Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
SwinIR [12] 38.35 34.14 32.44 39.6 33.4
SwinIR-CutBlur 38.36 34.16 32.45 39.6 33.4
Method Set5 [2] Set14 [22] BSD100 [13] Urban100 [10] Manga109 [14]
SwinIR [12] 38.35 34.14 32.44 39.6 33.4
SwinIR-LeFF 38.35 34.15 32.44 39.61 33.41
5 Ablation Study
For the ablation study, we conduct the experiments of the SwinIR [12] with
CutBlur data augmentation technique [1] and the SwinIR [12] with the addi-
tional LeFF layer [19] in Swin Transformers blocks separately on all introduced
datasets. The results are presented in Table 6 and Table 7.
Impact of CutBlur on the Performance of SwinIR. Our experimental
findings from this ablation study demonstrate that integrating CutBlur [1] into
SwinIR [12] positively impacts the model’s overall performance, particularly on
simpler datasets like Set5 [2], Set14 [22], and even moderately complex ones such
as BSD100 [13] (Table 6). This validates the effectiveness of CutBlur [1].
Impact of LeFF on the Performance of SwinIR. The ablation study on
the incorporation of LeFF layer [19] in SwinIR [12] reveals that the additional
LeFF layer enhances the model’s ability to capture potential local features, lead-
ing to improved performance on more complex datasets like Urban100 [10] and
260 P. H. Tran and N.-T. Nguyen
References
1. Ahn, N., Yoo, J., Sohn, K.A.: Data augmentation for low-level vision: cutblur and
mixture-of-augmentation, pp. 2041–2059 (2024)
2. Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-Complexity
Single-Image Super-Resolution Based on Nonnegative Neighbor Embedding.
BMVA Press (2012)
3. Chen, Z., Guo, Y., Zhou, Z.: Pretrained image transformer for image super-
resolution. In: Proceedings of the European Conference on Computer Vision
(ECCV) (2020)
4. Conde, M.V., Choi, U.J., Burchi, M., Timofte, R.: Swin2sr: Swinv2 transformer
for compressed image super-resolution and restoration (2022). https://arxiv.org/
abs/2209.11345
5. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning
augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 113–123 (2019)
6. Dai, T., Cai, J., Zhang, Y., Xia, S.T., Zhang, L.: Second-order attention network
for single image super-resolution. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 11065–11074 (2019)
7. DeVries, T.: Improved regularization of convolutional neural networks with cutout
(2017)
8. Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convo-
lutional networks (2016)
9. Hore, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: 2010 20th Interna-
tional Conference on Pattern Recognition, pp. 2366–2369. IEEE (2010)
Boosting Image Super-Resolution with LeFF and CutBlur 261
10. Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from transformed
self-exemplars. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 5197–5206 (2015)
11. Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution from trans-
formed self-exemplars. In: 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 5197–5206 (2015). https://doi.org/10.1109/CVPR.2015.
7299156
12. Liang, J., Cao, J., Sun, G., Zhang, K., Gool, L.V., Timofte, R.: Swinir: image
restoration using SWIN transformer. In: Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision (ICCV), pp. 1833–1844 (2021)
13. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural
images and its application to evaluating segmentation algorithms and measuring
ecological statistics. In: Proceedings eighth IEEE international conference on com-
puter vision. ICCV 2001, vol. 2, pp. 416–423. IEEE (2001)
14. Matsui, Y., et al.: Sketch-Based Manga Retrieval Using Manga109 Dataset, vol. 76,
pp. 21811–21838. Springer (2017)
15. Mei, Y., Fan, Y., Zhou, Y.: Image super-resolution with non-local sparse attention.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 3517–3526 (2021)
16. Niu, B., et al.: Single image super-resolution via a holistic attention network. In:
Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol.
12357, pp. 191–207. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-
58610-2_12
17. Pires, T.P., Lopes, A.V., Assogba, Y., Setiawan, H.: One wide feedforward is all
you need (2023). https://arxiv.org/abs/2309.01826
18. Pratt, W.K.: Digital Image Processing. Wiley (2001)
19. Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: a general u-shaped
transformer for image restoration. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 17683–17693 (2022)
20. Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution via sparse rep-
resentation (2010)
21. Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization
strategy to train strong classifiers with localizable features. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)
22. Zeyde, R., Elad, M., Protter, M.: On single image scale-up using sparse-
representations. In: Curves and Surfaces: 7th International Conference, Avignon,
France, June 24-30, 2010, Revised Selected Papers 7, pp. 711–730. Springer (2012)
23. Zhang, H.: Mixup: beyond empirical risk minimization (2017)
24. Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution
using very deep residual channel attention networks. vol. abs/1807.02758 (2018).
http://arxiv.org/abs/1807.02758
262 P. H. Tran and N.-T. Nguyen
25. Zheng, Q., Xu, H., Bian, M.: Image super-resolution using a enhanced SWIN trans-
former network. In: 2023 3rd International Symposium on Computer Technology
and Information Science (ISCTIS), pp. 1151–1155 (2023). https://doi.org/10.1109/
ISCTIS58954.2023.10213090
26. Zhou, S., Zhang, J., Zuo, W., Loy, C.C.: Cross-scale internal graph neural network
for image super-resolution. Adv. Neural. Inf. Process. Syst. 33, 3499–3509 (2020)
Dual-Domain Reconstruction Network
for Enhancing Sparse-View and Low-Dose
CT Imaging
1 Introduction
Computed tomography (CT) imaging, while indispensable for diagnostic pur-
poses, poses significant radiation exposure risks to patients due to its widespread
use [1]. To address this concern, numerous strategies have been proposed to min-
imize radiation dose while maintaining image quality. Sparse-view scanning [2],
which involves acquiring projections from a reduced quantity of angles, offers
a promising approach. However, conventional reconstruction methods, such as
Filtered Back Projection (FBP) [3], struggle to generate high-quality images
from sparse data, often resulting in artifacts and noise. FBP, although computa-
tionally efficient, is prone to noise amplification and aliasing artifacts, especially
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 263–274, 2025.
https://doi.org/10.1007/978-981-96-4282-3_22
264 P. C. Thang and P. M. Nhat
2 Related Work
2.1 Computed Tomography Reconstruction
The Filtered Back Projection (FBP) algorithm [3] is a widely used method for CT
image reconstruction, but it has notable limitations, such as failing to account
for noise, X-ray spectrum variability, and sensor characteristics. Dual-domain
deep learning methods, such as DRONE [11], CDDCN [12], CLRecon [13], and
DualCNN [14], have been developed to fully leverage information from both
sinogram and image domains, enabling simultaneous enhancement of sinograms
and reconstructed images. Despite their effectiveness, these dual-domain joint
DD-ReconNet for Enhanced Sparse-View CT Image Reconstruction 265
3 Proposed Method
3.1 DD-ReconNet Architecture
As illustrated in Fig. 1, the DD-ReconNet is designed with three primary stages:
Signogram Restoration, employing an SR-Module; CT Reconstruction, utiliz-
ing the FBPConvNet method with direct inversion followed by a CNN to deal
with normal-convolutional inverse problems (as discussed in detail by [26]); and
CT Restoration, leveraging an IR-Module. Given a sparse-view sinogram input
denoted as .Y ∈ RHS ×WS , an enhanced sinogram, represented by .Ỹ ∈ RHS ×WS ,
is generated through the SR-Module. Subsequently, the FBPConvNet method
is applied to sequentially reconstruct two low-quality CT images, .X̃1 and .X̃2 ,
266 P. C. Thang and P. M. Nhat
both of which belong to the space .RHI ×WI , from .Y and the enhanced sinogram
Ỹ . Lastly, the concatenated output of .X̃1 and .X̃2 is fed into the IR-Module to
.
yield the final, high-quality CT image denoted as .X̃ ∈ RHI ×WI .
N
Fi = IE-Conv
. (SwinV2j (Fi−1 )) +Fi−1 .
j=1
where .BSx and .BSy are biases for horizontal and vertical Sobel Convolution,
respectively. Additionally, the Laplacian filter is utilized to extract the second-
order gradient. The Laplacian operators based on 4-connected and 8-connected
268 P. C. Thang and P. M. Nhat
The mathematical expression for the second-order gradient of the latent feature
map is given by:
where .BL4 and .BL8 are biases for 4- and 8-connected neighborhoods Laplacian
Convolution, respectively. Let the parameters .α1 , .α2 , and .α3 serve as learn-
able competitive coefficients for each branch. These coefficients are regulated
by a simple softmax function, ensuring that the IE-Conv framework effectively
preserves high-frequency feature information. The complete feature extraction
process within the IE-Conv layer is as follows:
FIE = α1 F3×3 + α2 FS + α3 FL .
.
Through the reparameterization process, the three branches are merged into
a single Convolution operation. The feature extraction process of the IE-Conv
layer in the inference phase is as follows:
The Variance loss function .(LV ar ) was developed to address the shortcomings
of Mean Squared Error .LM SE in deep learning-based image reconstruction tasks.
DD-ReconNet for Enhanced Sparse-View CT Image Reconstruction 269
the ground truth image .Igt using Sobel operators. These maps are subsequently
ˆ ˆ I
divided into .n × n non-overlapping patches to form matrices .G̃Ix , .G̃Iy , .G̃xgt , and
I
. G̃ygt , each of size .n2 . With .μ as the mean value, the variance of each gradient
n 2 2
(G̃ −µ)
map is then calculated as: .v = i=1n2 −1i .
Therefore, the Variance loss .(LV ar ) is formulated as:
ˆ ˆ
LV ar = vxI − vxIgt 2 + vyI − vyIgt 2 .
.
W = 1 − e−
. 2σ 2 ,
to modulate frequencies, suppressing lower frequencies below . while amplifying
higher ones. The parameter .σ controls the Gaussian’s spread, influencing the
filter’s emphasis on high frequencies. .fx and .fy denote the frequency components
along the .x and .y axes. By using Fast Fourier Transformation .FFT , .LEdge is
computed as follows:
ˆ − W |FFT (Igt )| .
LEdge = W |FFT (I)|
.
1
The abdominal CT dataset leveraged in this study was contributed by the Mayo
Clinic [28]. It consists of 5,388 slices, each with a thickness of 1 mm and a pixel
resolution of .512 × 512. From these slices, 5,388 sinograms were generated, with
each sinogram derived from 120 projection angles. The dataset was divided into
a training set of 4,839 images and a testing set of 549 images. Label images were
270 P. C. Thang and P. M. Nhat
In the field of medical imaging, the quality of CT images is critical for ensur-
ing accurate treatment decisions. However, acquiring the large number of pro-
jection angles necessary for high-quality image reconstruction is often time-
consuming and inconvenient for patients. To address this challenge, we propose
DD-ReconNet, a novel CT image reconstruction model based on deep neural
networks. This model capitalizes on its ability to learn complex image features,
leading to substantial improvements in reconstruction quality. In comparison
with conventional CT reconstruction methods like FBPConvNet [26], as well as
advanced deep learning approaches such as GMSD [15], DuDoNet [9], and DDP-
Net [29], DD-ReconNet demonstrates marked superiority. The model not only
achieves significantly higher PSNR, SSIM, and MSE values (as summarized in
Table 1), but also surpasses the state-of-the-art GMSD by approximately 2.46
dB in PSNR and 0.01 in SSIM. Furthermore, DD-ReconNet shows remarkable
improvement in preserving fine details, such as vascular structures and small
lesions. As illustrated in Fig. 3, local details are effectively maintained.
function .LM SE . This superiority is evident across all three evaluation metrics:
MSE (0.07), PSNR (1.74), and SSIM (0.16).
272 P. C. Thang and P. M. Nhat
5 Conclusion
This work presents the DD-ReconNet model as an innovative solution for enhanc-
ing image quality in sparse-view CT reconstruction. By leveraging both the
Sinogram and Image domains, along with advanced techniques such as the
Swin Transformer V2 block and the Improved Edge Convolution layer, DD-
ReconNet effectively addresses challenges posed by reduced projection angles
and radiation dose concerns. The three-stage reconstruction process-comprising
Sinogram Restoration, FBPConvNet-based Image Reconstruction, and Image
Restoration-demonstrates superior performance compared to traditional meth-
ods like Filtered Back Projection. Experimental results confirm the model’s abil-
ity to generate higher-quality images, supporting its potential to improve diag-
nostic accuracy while reducing patient radiation exposure. These findings posi-
tion DD-ReconNet as a promising approach for advancing low-dose and sparse-
view CT imaging, with significant implications for both clinical practice and
patient safety.
References
1. Brenner, D., Elliston, C., Hall, E., Berdon, W.: Estimated risks of radiation-induced
fatal cancer from pediatric CT. Am. J. Roentgenol. 176, 289–296 (2001). https://
doi.org/10.2214/ajr.176.2.1760289
2. Hsieh, J.: Computed Tomography: Principles, Design, Artifacts, and Recent
Advances. SPIE Press, vol. PM259, p. 666 (2015)
3. Schäfer, D., Grass, M., Haar, P.: FBP and BPF reconstruction methods for cir-
cular X-ray tomography with off-center detector. Med. Phys. 38, S85–S94 (2011).
https://doi.org/10.1118/1.3578342
4. Lauritsch, G., Haerer, W.: Theoretical framework for filtered back projection
in tomosynthesis. Med. Imaging 1998 Image Process. 3338, 1127–1137 (1998).
https://doi.org/10.1117/12.310839
5. Ye, J.: Compressed sensing MRI: a review from signal processing perspective. BMC
Biomed. Eng. 1, 1–17 (2019). https://doi.org/10.1186/s42490-019-0006-z
6. Koetzier, L., et al.: Deep learning image reconstruction for CT: technical principles
and clinical prospects. Radiology 306, e221257 (2023). https://doi.org/10.1148/
radiol.221257
7. Zhang, X., Pan, G., Chen, B., Sun, K., Meng, Z.: Integral algorithm of exponential
observables for interacting fermions in quantum Monte Carlo simulations. Phys.
Rev. B 109, 205147 (2024). https://doi.org/10.1103/PhysRevB.109.205147
8. Chauhan, S., Malik, N., Vig, R.: UNet with ResNextify and IB modules for low-
dose CT image denoising. Int. J. Inf. Technol. 16, 4677–4692 (2024). https://doi.
org/10.1007/s41870-024-01898-8
9. Lin, W., et al.: DuDoNet: dual domain network for CT metal artifact reduction.
In: Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 10512–10521 (2019). https://doi.org/10.1109/CVPR.2019.01076
DD-ReconNet for Enhanced Sparse-View CT Image Reconstruction 273
10. Liu, Z., et al.: Swin Transformer V2: scaling up capacity and resolution. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion, pp. 12009–12019 (2022). https://doi.org/10.1109/CVPR52688.2022.01170
11. Wu, W., Hu, D., Niu, C., Yu, H., Vardhanabhuti, V., Wang, G.: DRONE: dual-
domain residual-based optimization network for sparse-view CT reconstruction.
IEEE Trans. Med. Imaging 40(11), 3002–3014 (2021). https://doi.org/10.1109/
TMI.2021.3078067
12. Li, Q., et al.: A cascade-based dual-domain data correction network for sparse
view CT image reconstruction. Comput. Biol. Med. 165, 107345 (2023). https://
doi.org/10.1016/j.compbiomed.2023.107345
13. Hu, J., Xing, S., Shan, X., Yu, X., Li, G.: Research on key processing parameters
of parallel seam welding of micro crystal resonator based on simulation experi-
ment. Ferroelectrics 565(1), 88–98 (2020). https://doi.org/10.1080/00150193.2020.
1761722
14. Chao, L., et al.: Sparse-view cone beam CT reconstruction using dual CNNs in pro-
jection domain and image domain. Neurocomputing 493, 536–547 (2022). https://
doi.org/10.1016/j.neucom.2021.12.096
15. Guan, B., et al.: Generative modeling in sinogram domain for sparse-view CT
reconstruction. IEEE Trans. Radiat. Plasma Med. Sci. 8, 195–207 (2024). https://
doi.org/10.1109/TRPMS.2023.3309474
16. Xia, W., et al.: Parallel diffusion model-based sparse-view cone-beam breast CT.
ArXiv Preprint ArXiv:2303.12861, pp. 1–16 (2023). https://doi.org/10.48550/
arXiv.2303.12861
17. Vasconcelos, F., He, B., Singh, N., Teh, Y.: UncertaINR: uncertainty quantification
of end-to-end implicit neural representations for computed tomography. ArXiv
Preprint ArXiv:2202.10847, pp. 1–57 (2022). https://doi.org/10.48550/arXiv.2202.
10847
18. Li, D., et al.: Noise characteristics modeled unsupervised network for robust CT
image reconstruction. IEEE Trans. Med. Imaging 41, 3849–3861 (2022). https://
doi.org/10.1109/TMI.2022.3197400
19. Fu, J., et al.: Dual attention network for scene segmentation. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3141–
3149 (2019). https://doi.org/10.1109/CVPR.2019.00326
20. Habib, G., Singh, D., Malik, I., Lall, B.: optimizing vision transformers with
data-free knowledge transfer. ArXiv Preprint ArXiv:2408.05952, pp. 1–20 (2024).
https://doi.org/10.48550/arXiv.2408.05952
21. Liang, J., et al.: SwinIR: image restoration using swin transformer. In: Proceedings
of the IEEE/CVF International Conference on Computer Vision, pp. 1833–1844
(2021). https://doi.org/10.1109/ICCVW54120.2021.00210
22. Lin, A., et al.: DS-TransuNet: dual swin transformer U-Net for medical image
segmentation. IEEE Trans. Instrum. Measur. 71, 1–15 (2022). https://doi.org/10.
1109/TIM.2022.3178991
23. Lei, Y., et al.: Diffeomorphic transformer-based abdomen MRI-CT deformable
image registration. Med. Phy. 51(9), 1–18 (2024). https://doi.org/10.1002/mp.
17235
24. Lian, J., Liu, T.: Lesion identification in fundus images via convolutional neural
network-vision transformer. Biomed. Signal Process. Contr. 88, 105607 (2024).
https://doi.org/10.1016/j.bspc.2023.105607
25. Chi, J., et al.: Low-dose CT image super-resolution with noise suppression based
on prior degradation estimator and self-guidance mechanism. IEEE Trans. Med.
Imaging (2024). https://doi.org/10.1109/TMI.2024.3454268
274 P. C. Thang and P. M. Nhat
26. Jin, K., McCann, M., Froustey, E., Unser, M.: Deep convolutional neural network
for inverse problems in imaging. IEEE Trans. Image Process. 26, 4509–4522 (2017).
https://doi.org/10.1109/TIP.2017.2713099
27. Zhang, X., Zeng, H., Zhang, L.: Edge-oriented convolution block for real-time super
resolution on mobile devices. In: Proceedings of the 29th ACM International Con-
ference on Multimedia, pp. 4034–4043 (2021). https://doi.org/10.1145/3474085.
347529
28. Moen, T., et al.: Low-dose CT image and projection dataset. Med. Phys. 48, 902–
911 (2021). https://doi.org/10.1002/mp.14594
29. Ge, R., etal.: DDPNet: a novel dual-domain parallel network for low-dose CT
reconstruction. In: International Conference on Medical Image Computing and
Computer-Assisted Intervention, pp. 748–757 (2022). https://doi.org/10.1007/978-
3-031-16446-0_71
DehazeCLNet: A Contrastive Learning
Framework with Advanced Feature
Extraction for Image Dehazing
1 Introduction
Image dehazing has garnered compelling attention recently due to its significa-
tion in various tasks, such as autonomous driving, aerial imaging, and outdoor
surveillance. The presence of haze, caused by the scattering of light by atmo-
spheric particles, severely degrades image quality by reducing contrast, blurring
textures, and distorting colors. Numerous dehazing methods have been devel-
oped, focusing on restoring image clarity and enhancing visibility under chal-
lenging weather conditions. Early methods depended on priors, such as the Dark
Channel Prior (DCP) [1] and atmospheric scattering models [2]. While effective
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 275–286, 2025.
https://doi.org/10.1007/978-981-96-4282-3_23
276 P. C. Thang et al.
2 Related Works
2.1 Image Dehazing
In recent years, image restoration, including dehazing, has gained substantial
attention due to its importance in enhancing vision-based tasks such as object
DehazeCLNet for Image Dehazing 277
detection and recognition. Various dehazing techniques have been classified into
three primary approaches: image enhancement, image fusion, and image restora-
tion [6]. Authors in [7] introduced non-aligned supervision methods, providing
significant insights into new dehazing approaches.
Moreover, the dark channel prior method, developed in [8], has since become a
fundamental benchmark in dehazing research. Additionally, [7] further advanced
the field with non-aligned supervision methods, contributing to the exploration
of innovative dehazing strategies. Alongside dehazing, deep learning has enabled
significant improvements in related restoration tasks, including motion deblur-
ring [9] and defocus deblurring [8].
The atmospheric scattering model mathematically describes the formation
of hazy images as follows:
I
. hazy (x) = Iclean (x) · τ (x) + A · (1 − τ (x)) ,
where .Ihazy (x) serves as the observed hazy image, .Iclean (x) stands for the scene
radiance or the clean image, .A is the global atmospheric light, .τ (x) is the trans-
mission map at a spatial position .x, defined as .τ (x) = e−β·d(x) , quantifies the
fraction of light reaching the camera through the hazy medium, .β is the scatter-
ing coefficient related to the haze density, .d(x) reproduces the depth or distance
between the object and the camera.
3 Proposal Method
The overall architecture of DehazeCLNet is outlined through Fig. 1(a). We then
provide a detailed explanation of the proposed modules within Groups 1, 2, and
3 (Fig. 1(b)), which include the Channel Attention Block (CAB) (Fig. 1(c)) and
the Dual Convolutional Block (DCB) (Fig. 1(d)). The discussion concludes with
an explanation of the loss function and the implementation of the Contrastive
Learning process.
Fig. 1. (a) The overall architecture of the proposed DehazeCLNet is designed for net-
work training using the Contrastive Learning method. (b) Groups 1, 2, 3 consist of
N blocks containing convolutional layers, channel attention blocks (CAB) and dual
convolutional blocks (DCB) to extract feature maps through various filters. (c) The
Channel Attention Block (CAB) improves feature learning by focusing on the most
relevant channels. (d) The Dual Convolutional Block (DCB) comprises two distinct
convolutional branches: one with smaller kernels to learn local features and the other
with larger kernels to capture global context.
Attention Blocks (Fig. 1(c)) and Dual Convolutional Blocks (Fig. 1(d)). These
components are strategically designed to capture and extract detailed informa-
tion, strengthening the network’s performance in image restoration tasks by that.
Additionally, the network layers have been adapted to accommodate contrastive
learning tasks [11].
The architecture of our network (Fig. 1(a)) is organized into three groups,
each responsible for extracting features at different levels of granularity. These
extracted features are then concatenated into a unified feature map comprising
192 filters. This feature map is processed through the Channel Attention Block,
which analyzes the 192 channels to emphasize important feature representations.
Following this, the Dual Convolutional Block is employed to capture both local
and global features, further improving the network’s ability to generalize and
effectively process hazy regions, thereby enhancing dehazing performance.
Additionally, the network architecture includes 3. × 3 convolutional layers at
the input and output stages, along with skip connections, to prevent the loss of
critical information during forward propagation and preserve essential geometric
features throughout the model.
The Dual Convolutional Block (DCB) consists of two parallel branches that
utilize distinct convolutional layers to process input feature maps, which have
already been weighted for importance by the Channel Attention Block (CAB),
as described in Sect. 3.1. Each branch is designed differently, focusing on specific
tasks to capture unique aspects of the features. These branches work concur-
rently to extract various characteristics from the input data.
Specifically, as shown in Fig. 1 (d), the right branch has a structure similar
to the CAB, but the final normalization step is omitted. This branch is primarily
responsible for learning local features. In contrast, the left branch consists of 3. × 3
convolutional layers with larger kernel sizes, aimed at capturing more complex
features, such as global contextual information.
Once the global context information, denoted as .h̃, is obtained, the remaining
information, .1− h̃, is multiplied by the output of the right branch to extract finer
details. Simultaneously, the left branch refines the global context by multiplying
the feature map by .h̃. These outputs are then fused to generate new feature maps
that encapsulate both generalized global context and critical local features.
For simplicity in implementation and configuration, the Dual Convolutional
Block (DCB) can be expressed as follows, in conjunction with Eq. (1):
where .FDCB (X) represents a function that yields the output feature map of the
Dual Convolutional Block (DCB) after processing the input feature map .X. The
term .h̃ denotes the global context, while .1 − h̃ signifies the complement of the
global context information, effectively capturing the local features that are not
addressed by .h̃.
In this study, the loss function is constructed by combining two distinct com-
ponents: the pixel loss function and the contrastive loss function. The pixel loss
function is used to compare the clear image with the dehazed output gener-
ated by the DehazeCLNet model. In contrast, the contrastive loss function is
designed to strengthen the model’s capability to distinguish between outputs.
It works by minimizing the distance between the DehazeCLNet’s output and
the clear image (positive image), while maximizing the distance between the
DehazeCLNet’s output and the output generated by the Negative Model.
The pixel objective function is formed on the .L1 Loss [8], and is specifically
represented by the following equation:
LP ixel = Iˆa − Ip 1 ,
. (2)
where .Ip represents the clear image considered as the positive image, while .Iˆa
denotes the reconstructed image from the anchor model (DehazeCLNet).
The contrastive loss function is formulated by calculating the distances
between the anchor image and the positive image, as well as between the anchor
image and the negative image, across various feature depth levels. A deep neural
network architecture is leveraged to extract features from input images at mul-
tiple depths, capturing feature sizes that range from large to small. Specifically,
we utilize the pre-trained VGG19 network [13] to perform feature extraction at
these varying depth levels.
Additionally, we define the influence weights of the feature maps at different
depths, as defined by the subsequent equation:
i
Wcontrastive
. = 2−(η−i) , (3)
where, .η denotes the total number of blocks utilized in the computation, and
i represents the index of the blocks within the VGG19 architecture. For the
.
I˜ = Fvgg19 (Ia ),
. a I˜p = Fvgg19 (Ip ), I˜n = Fvgg19 (In ), I˜x = Fvgg19 (Ix ). (4)
In this context, .I{a,p,n,x} represents the input images, which include the posi-
tive image, anchor image, and negative image, while .I˜{a,p,n,x} denotes the output
feature maps extracted at the five depth levels of the VGG19 network. The func-
tion .Fvgg19 is employed for this feature extraction process. By leveraging Eqs.
(3), (4), and the .L1 loss function, we can derive the formula for the contrastive
loss function as follows:
282 P. C. Thang et al.
· L1 Loss(I˜ai , I˜pi )
η i
Wcontrastive
LContrastive =
. . (5)
i=1
L1 Loss(I˜ai , I˜ni ) · Wnegative
i + L1 Loss(I˜ai , I˜xi ) · Winput
i
i i
In this equation, .Wnegative and .Winput signify the influence weights of the
negative image concerning the image generated by the anchor model and the
input image relative to the image produced by the anchor model, respectively.
Having established the computation formulas for the pixel loss function as
delineated in Eq. (2) and the contrastive loss function as specified in Eq. (5),
the overall objective function for the task can be formulated as follows:
In this equation, the influence of the contrastive loss function on the training
phase is indicated via parameter .λ. .λ = 0.2 have been selected for the purposes
of this study.
4 Experiments
4.1 Experiment Setup
In Fig. 2, it can be seen that the outputs of various models yield relatively
high results. However, our model, DehazeCLNet, demonstrates notably supe-
rior performance, achieving Peak Signal-to-Noise Ratio (PSNR) and Structural
Similarity Index (SSIM) values that are significantly elevated, with some exper-
iments even exceeding 45 dB. These experimental results indicate that Dehaze-
CLNet, developed in conjunction with the methods presented in our research,
has yielded impressive outcomes in the task of image dehazing on the SOTS-
Indoor dataset. This highlights the validity of our approach in restoring image
quality and enhancing visual clarity in challenging hazy conditions.
284 P. C. Thang et al.
This study proposes the DehazeCLNet network, which incorporates groups and
blocks to enable advanced feature extraction. Additionally, a depth-wise loss
function is introduced, integrated with contrastive learning methods to enhance
the efficiency and performance of DehazeCLNet in image dehazing tasks. The
proposed network demonstrates relatively high performance, achieving superior
PSNR and SSIM scores. Future improvements could include replacing the Nega-
tive Model, currently FFA-Net, with a more effective alternative, and adjusting
the influence parameters of the contrastive loss function to further optimize the
dehazing process.
DehazeCLNet for Image Dehazing 285
References
1. He, K., Sun, J., Tang, X.: Single image haze removal using dark channel prior.
IEEE Trans. Pattern Anal. Mach. Intell. 33, 2341–2353 (2010). https://doi.org/
10.1109/TPAMI.2010.168
2. Fattal, R.: Single image dehazing. ACM Trans. Graph. (TOG) 27, 1–9 (2008).
https://doi.org/10.1145/1360612.1360671
3. Hang, D., et al.: Multi-scale boosted dehazing network with dense feature fusion.
In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 2154–2164 (2020). https://doi.org/10.1109/CVPR42600.
2020.00223
4. Qin, X., Wang, Z., Bai, Y., Xie, X., Jia, H.: FFA-Net: feature fusion attention
network for single image dehazing. Proc. AAAI Conf. Artif. Intell. 34, 11908–11915
(2020)
5. Guo, C., et al.: Image dehazing transformer with transmission-aware 3D position
embedding. In: Proceedings of T2022 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 5802–5810. (2022). https://doi.org/10.1109/
CVPR52688.2022.00572
6. Wang, W., Yuan, X.: Recent advances in image dehazing. IEEE/CAA J. Automat-
ica Sinica 4, 410–436 (2017). https://doi.org/10.1109/JAS.2017.7510532
7. Fan, J., et al.: Non-aligned supervision for real image dehazing. ArXiv Preprint
arXiv:2303.04940 (2023). https://doi.org/10.48550/arXiv.2303.04940
8. He, X., Cheng, J.: Revisiting L1 loss in super-resolution: a probabilistic view and
beyond. ArXiv Preprint arXiv:2201.10084, pp. 1–13 (2023). https://doi.org/10.
48550/arXiv.2201.10084
9. Zamir, S., et al.: Restormer: efficient transformer for high-resolution image
restoration. In: Proceedings of EEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 5718–5729. (2022). https://doi.org/10.1109/
CVPR52688.2022.00564
10. Khosla, P., et al.: Supervised contrastive learning. In: Proceedings of the 34th
International Conference on Neural Information Processing Systems, pp. 18661–
18673. (2020). https://doi.org/10.5555/3495724.3497291
11. Cheng, D., et al.: Progressive negative enhancing contrastive learning for image
dehazing and beyond. IEEE Trans. Multimedia 26, 8783–8798 (2024). https://doi.
org/10.1109/TMM.2024.3382493
12. Bieder, F., Sandkühler, R., Cattin, P.: Comparison of methods generalizing max-
and average-pooling. ArXiv Preprint arXiv:2103.01746, pp. 1–16 (2023). https://
doi.org/10.48550/arXiv.2103.01746
13. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: Proceedings of 3rd International Conference on Learning
Representations (ICLR 2015), pp. 1–14 (2015). https://doi.org/10.48550/arXiv.
1409.1556
14. Li, B., et al.: Benchmarking single-image dehazing and beyond. IEEE Trans. Image
Process. 28, 492–505 (2019). https://doi.org/10.1109/TIP.2018.2867951
15. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. ArXiv Preprint
arXiv:1412.6980, pp. 1–15 (2017). https://doi.org/10.48550/arXiv.1412.6980
286 P. C. Thang et al.
16. Horé, A., Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: 2010 20th Interna-
tional Conference on Pattern Recognition, pp. 2366–2369 (2010). https://doi.org/
10.1109/ICPR.2010.579
17. Nilsson, J., Akenine-Möller, T.: Understanding SSIM. ArXiv Preprint
arXiv:2006.13846, pp. 1–8 (2020). https://doi.org/10.48550/arXiv.2006.13846
18. Tu, Z., et al.: MAXIM: multi-axis MLP for image processing. In: Proceedings
Of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 5759–5770 (2022). https://doi.org/10.1109/CVPR52688.2022.00568
19. Ye, T., et al.: Perceiving and modeling density is all you need for image dehazing.
In: Computer Vision - ECCV 2022: 17th European Conference, Part XIX, pp.
130–145 (2021). https://doi.org/10.1007/978-3-031-19800-7_8
20. Cui, Y., et al.: Selective frequency network for image restoration. In: Proceedings
of 11th International Conference on Learning Representations, pp. 1–13 (2023)
21. Cui, Y., Ren, W., Knoll, A.: Omni-Kernel network for image restoration. Proc.
AAAI Conf. Artif. Intell. 38, 1426–1434 (2024). https://doi.org/10.1609/aaai.
v38i2.27907
Distortion-Resilient DIBR for Novel View
Synthesis from a Single Image
1 Introduction
Novel view synthesis (NVS) has become a crucial technology in various fields,
including 3D modeling [1], autonomous driving [2], virtual reality [3], medi-
cal imaging [4], and industrial scanning [5]. It enables the rendering of three-
dimensional scenes from un-sampled viewpoints, enhancing the realism and inter-
activity of digital environments by assuming a static scene with a fixed light-
ing condition where the camera moves freely to illustrate the parallax changes
between frames, allowing the understanding of the spatial information of the 3-D
scene.
Most existing NVS methods rely on multiple views input to better recon-
struct the geometry proxies [6–9] or sample sufficient optical information [10–13].
However, obtaining multiple views is not always feasible due to several practical
constraints. In many real-world scenarios, only a single image may be available,
such as in surveillance footage, historical photographs, or casual daily photos.
Additionally, capturing multiple views often requires specialized equipment, like
multi-camera setups or depth sensors, which can be expensive and not readily
accessible. Single Image NVS which is our research focus, in contrast, potentially
provides a flexible and cost-effective solution. The challenge from a single image
is the limited ability to perceive the geometry information, as with unknown
optical information (e.g. occluded area). Traditional methods try to apply 3-
D Warping [14] to the input image with known depth to get a sparse novel
This work was supported by JSPS KAKENHI under Grant 24K20797.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 287–297, 2025.
https://doi.org/10.1007/978-981-96-4282-3_24
288 Y. Liu et al.
view with significant distortions and fixing the distortion usually involves more
input images. FTV View Generation [15] first warps the depth and refines it to
re-sample the missing texture for its stereo input. 3D Photo Inpainting [16] esti-
mates a monocular depth and locates the depth discontinuous area, then inpaints
the disocclusion region from the neighbor context region. MPI [17] encodes every
input into fixed layers representation on pre-defined depths, it further models
the layer with another alpha channel to encode transparency to solve the depth
discontinuous caused by discrete layers.
In this paper, a comprehensive DIBR-based model that can handle different
types of distortions when synthesizing a novel view given a single image input,
is proposed. Specifically, when receiving a single arbitrary RGB image input
from the user, the proposed model estimates the depth and combines it with
input RGB to form an RGB-D as the input of the synthesizing procedure. In
our method, the scene is represented by two distinct layers: the foreground,
derived from the input RGB; and the background, assumed to constitute the
occluded areas. This representation allows us to deal with different distortion
types independently. A depth-guided segment-inpainting approach is used to
generate the background, effectively addressing disocclusions by filling in missing
textures with contextually appropriate information. For the foreground, a reverse
depth mapping test is proposed to re-sample the unknown pixels and generate
an alpha map simultaneously to enable a soft blend with the background. This
approach aims to restore texture in distorted areas and preserve the sharpness
of other regions.
Our proposed model is qualitatively and quantitatively evaluated using scenes
from the Real-Estate 10K dataset [18]. The results demonstrate sharper and
clearer synthesized images with better NIQE [19] and LPIPS [20] score.
2 Problem Statement
3D Warping is a universal approach for most DIBR methods to render novel
views. It has a clear physical meaning and a direct, fast computation procedure,
but it suffers from distortion problems which will be handled by our work.
Given an RGB-D image as input, 3D Warping (2) is a one-to-one mapping
algorithm composed of camera pose .RT and intrinsics .K, where all points .p from
the input view are mapped to new positions .p in the novel view.
⎡ ⎤ ⎡ ⎤
x x
1
. ⎣ y ⎦ = K5 ( ⎣ y ⎦ ) ))
T −1
(R 6 K 5 (z (2)
z
1 1
3D Warping considers only the pixels sampled from the reference view and
thus suffers from severe distortions as shown in Fig. 1.
Distortion-Resilient DIBR for Novel View Synthesis from a Single Image 289
3 Our Proposal
In this research, to tackle the above-listed distortions, we propose a comprehen-
sive DIBR-based model for synthesizing novel views from single input image. We
assume that the target novel views in a desired scene are composed of two images:
a foreground image .C[H×W ] and a background image .B[H×W ] . All information in
the foreground image is provided by the input view, while the background image
reflects the information of the occluded area. Figure 2 shows our pipeline which
synthesizes the foreground and background images simultaneously and blends
them into the final image .S[H×W ] .
Fig. 2. The pipeline of our proposed Single-View DIBR method. In our approach,
the scene is represented into foreground and background to efficiently handle different
types of distortions. The novel view is produced by soft blending the foreground and
background controlled by an alpha channel.
We calculate the foreground image of the novel view using 3D warping and Z-
buffering with all the points .p(uv) =< u, v, d > from the input image, where
.< u, v > stands for the coordinates in image space, .d is the depth value at that
point .d = D(uv) .
D̂ = −1[H×W ] , D̂(uv ) = d
. pvm =< u , v , dmin >, where D̂(uv ) = −1 (4)
dmin = min(D̂i )i±W
i=(uv )
By doing this, we create a virtual mapping .pvm over the distorted areas, the
mapping’s position depends on the depth relationship between the occluded
area and the occluder. In undersampled areas, this mapping will be very close
to the local samples, almost negligible.
Distortion-Resilient DIBR for Novel View Synthesis from a Single Image 291
After filling all unknown depths, we compute the inverse 3D warping .W3−1 ,
projecting the filled points .pvm from the novel view back to the reference view.
. < urm , vrm
, drm >∼ prm = W3−1 (pvm ) (5)
At this point, the virtual mapping overlaps with certain points in the reference
view, these overlapped points are called reverse mapping. Based on the reverse
mapping .< urm , vrm
>, we resample RGB value .C(uvrm ) and the depth value
.D(uv ) . The RGB values of reverse mapping are directly mapped to the previ-
rm
ously recorded unknown points,
. Ĉ(uvvm
) = C(uvrm
) (6)
while the re-sampled depth values are compared with the depth from virtual
mapping to generate a depth difference map for the previously recorded unknown
points.
.dif f(uv ) = |D(uv ) − drm | (7)
vm rm
This depth difference exists in the distorted regions throughout the novel
view, it reflects the relative gradient changes in the occluded regions and main-
tains global continuity. We found this information can be well-modeled as an
alpha map to control the soft-blending ratio between the background and fore-
ground. The absolute values of the depth difference vary depending on the com-
plexity of the scene and the camera poses in the novel views. However, for nor-
malized scene and camera matrices, we found the distribution pattern of the
depth difference shares consistency when the same camera movement happens
in different scenes, thus, we normalize the depth-diff map to an alpha map with
range .[0, 1].
.a[H×W ] = norm(dif f[H×W ] ) (8)
292 Y. Liu et al.
4 Evaluation
To evaluate the performance of our NVS method, we apply our method with to-
be-compared baselines to the Real-Estate 10K dataset [18]. This dataset contains
video frames with corresponding camera matrices (Intrinsics, Poses) in static
scenes with fixed illumination. We use the first frame as the input view and try
to synthesize the novel views at the corresponding camera of other frames, then
compare it with the actual image of that frame (Ground Truth).
2 |N IQEpred − N IQEgt |
N IQEstand =
. arctan( ) (10)
π N IQEgt
The monocular vision cannot perceive the scale of a scene, this results in the
relative pose of the novel view not matching the real pose of ground truth. There-
fore, we utilize the method “scale-invariant loss” used in MPI [17] to calculate a
factor .σ to correct the normalized depth to the correct scale.
294 Y. Liu et al.
4.3 Results
Stand-NIQE (.↓) A B C D E F G H
RAW-DIBR 0.863 0.75 0.747 0.771 0.638 0.870 0.435 0.164
MPI 0.217 0.122 0.189 0.213 0.117 0.238 0.232 0.217
OURS 0.184 0.043 0.036 0.11 0.05 0.027 0.114 0.087
LPIPS(Alex) (.↓) A B C D E F G H
RAW-DIBR 1.053 1.031 1.04 1.104 1.131 1.010 0.88 0.512
MPI 0.361 0.238 0.278 0.424 0.334 0.242 0.315 0.481
OURS 0.331 0.206 0.23 0.485 0.303 0.205 0.254 0.359
Fig. 5. The visual quality compared to Raw-DIBR, MPI, and Ground Truth.
Distortion-Resilient DIBR for Novel View Synthesis from a Single Image 295
our method shows the capability to fix the distortion and adaptively restore the
disocclusion, while maintaining the sharpness of the novel view. For quantiative
evaluation, we calculate the average stand-NIQE and LPIPS score from all the
frames of every scene to show the overall performance as in Tables 1 and 2, our
method shows a relatively lower error along 8 different scenes. Figure 6 shows the
average stand-NIQE from all the scenes of every frame(In all 8 scenes, the camera
is moving steadily away from the position of the first frame). It is observed when
the target camera baseline increases, all the methods tend to generate larger
errors on the novel view. Our method maintained a relatively low error in all the
frames and a steady slope when the baseline increased.
Fig. 6. The average standard NIQE score at different frames, refers to the error that
grows in the novel views when rendering at a wider baseline.
5 Conclusion
We have proposed a DIBR-based model to tackle the distortions from the syn-
thesized views given a single image input, which is challenging in traditional
approaches. Our proposed method models the scene representation into 2 lay-
ers to handle different distortions in foreground and background, and then they
are further softly blended by an alpha map generated by the proposed reverse
depth mapping test. Experimental results show that our method has an abil-
ity to recover the unknown textures (e.g., disocclusions) and also maintain the
sharpness of the novel image, the performance was also reflected in the dedicated
dataset by quantitative evaluations.
296 Y. Liu et al.
References
1. Verykokou, S., Ioannidis, C.: An overview on image-based and scanner-based 3D
modeling technologies. Sensors 23(2), 596 (2023)
2. Cheng, J., et al.: A review of visual SLAM methods for autonomous driving vehi-
cles. Eng. Appl. Artif. Intell. 114 , 104992 (2022)
3. Fachada, S., et al.: Depth image based view synthesis with multiple reference views
for virtual reality. In: 2018-3DTV-Conference: The True Vision-Capture, Transmis-
sion and Display of 3D Video (3DTV-CON). IEEE (2018)
4. Wolterink, J.M., et al.: Deep MR to CT synthesis using unpaired data. In: Simula-
tion and Synthesis in Medical Imaging: Second International Workshop, SASHIMI
2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, Septem-
ber 10, 2017, Proceedings 2. Springer International Publishing (2017)
5. Usamentiaga, R., Molleda, J., García, D.F.: Fast and robust laser stripe extraction
for 3D reconstruction in industrial environments. Mach. Vis. Appl. 23, 179–196
(2012)
6. Dyer, C.R.: Volumetric scene reconstruction from multiple views. In: Foundations
of Image Understanding. Boston, MA: Springer US, pp. 469–489 (2001). https://
doi.org/10.1007/978-1-4615-1529-6_16
7. Sinha, S., Steedly, D., Szeliski, R.: Piecewise planar stereo for image-based render-
ing. In: 2009 International Conference on Computer Vision (2009)
8. Penner, E., Zhang, L.: Soft 3D reconstruction for view synthesis. ACM Trans.
Graph. (TOG) 36 (6 ), 1–11 (2017)
9. Hedman, P., et al.: Casual 3D photography. ACM Trans. Graph. (TOG) 36(6),
1–15 (2017)
10. Buehler, C., et al.: Unstructured lumigraph rendering. In: Seminal Graphics
Papers: Pushing the Boundaries, vol. 2, pp. 497–504 (2023)
11. Kalantari, N.K., Wang, T.C., Ramamoorthi, R.: Learning-based view synthesis for
light field cameras. ACM Trans. Graph. (TOG) 35(6), 1–10 (2016)
12. Mildenhall, B., et al.: Local light field fusion: practical view synthesis with pre-
scriptive sampling guidelines. ACM Trans. Graph. (TOG) 38(4), 1–14 (2019)
13. Mildenhall, B., et al.: NeRF: representing scenes as neural radiance fields for view
synthesis. Comm. ACM 65(1), 99–106 (2021)
14. Mark, W.R., McMillan, L., Bishop, G.: Post-rendering 3D warping. In: Proceedings
of the 1997 symposium on Interactive 3D graphics (1997)
15. Mori, Y., et al.: View generation with 3D warping using depth information for
FTV. Signal Process. Image Comm. 24(1-2), 65–72 (2009)
16. Shih, M.L., et al.: 3D photography using context-aware layered depth inpainting.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (2020)
17. Tucker, R., Snavely, N.: Single-view view synthesis with multiplane images. In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (2020)
18. Zhou, T., et al.: Stereo magnification: learning view synthesis using multiplane
images. arXiv preprint arXiv:1805.09817 (2018)
19. Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image
quality analyzer. IEEE Signal Process. Lett. 20(3), 209–212 (2012)
20. Zhang, R., et al.: The unreasonable effectiveness of deep features as a perceptual
metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (2018)
Distortion-Resilient DIBR for Novel View Synthesis from a Single Image 297
21. Yang, L., et al.: Depth anything: unleashing the power of large-scale unlabeled
data. arXiv preprint arXiv:2401.10891 (2024)
22. Kirillov, A., et al.: Segment anything. In: Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision (2023)
23. Suvorov, R., et al.: Resolution-robust large mask inpainting with Fourier convo-
lutions. In: Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision (2022)
24. Lin, W., Kuo, C.C.J.: Perceptual visual quality metrics: a survey. J. Vis. Comm.
Image Representation 22(4), 297–312 (2011)
Towards Real-Time Open World Instance
Segmentation
1 Introduction
Instance segmentation is a crucial task in computer vision with wide-ranging
applications in fields such as education, medicine, and autonomous driving. How-
ever, traditional deep learning models for instance segmentation face limitations
due to their reliance on fixed training sets, reducing their effectiveness in dynamic
real-world scenarios.
To address these limitations, the field of open-world learning has emerged,
introducing approaches like novel feature representations and text-based features
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 298–312, 2025.
https://doi.org/10.1007/978-981-96-4282-3_25
Towards Real-Time Open World Instance Segmentation 299
Fig. 1. Example output of our enhanced model: Highlighting both known and unknown
instances in a single image, demonstrating the capability of real-time open-world
instance segmentation.
2 Related Works
2.1 Open World Object Detecion (OWOD)
known and unknown instances) and background, delivering better results than
unsupervised methods commonly used in earlier works.
Incremental learning remains a challenge in OWIS, particularly due to the
complexity of generating accurate segmentation masks for new classes without
degrading performance on previously learned ones. Our approach addresses this
with dynamic training strategies, ensuring that the model can continuously learn
new classes while preserving its segmentation capabilities for known objects.
Real-time instance segmentation models must balance fast inference (over 30
FPS) with reasonable accuracy, typically measured by COCO mAP@50-95 scores
ranging from 24 to 40. The YOLO family, including models like YOLACT [3]
and maYOLACT [20], is known for prioritizing speed while delivering sufficient
accuracy, making them popular for applications requiring high frame rates.
Nevertheless, for our needs, SparseInst [5] provides a better balance between
speed and segmentation quality. With a ResNet-50 backbone, it achieves a mAP
of 32.8, placing it among the higher-performing real-time models while maintain-
ing competitive speed. Additionally, SparseInst’s flexible architecture, similar to
D-DETR [34], allows for future improvements-such as domain adaptation or
enhanced segmentation-without sacrificing real-time performance. This makes it
not only strong in its current form but also adaptable to open-world segmenta-
tion challenges.
3 Problem Statement
In the context of open-world instance segmentation, during training, a model .f is
trained on a dataset .D = {I, Y }, which contains .K known instance classes. The
dataset comprises .N images and corresponding labels, where .I = {I1 , I2 , . . . , IN }
represents the images, and .Y = {Y1 , Y2 , . . . , YN } represents the labels. Each
label .Yi , .i ∈ [1, 2, . . . , N ], consists of .J annotated instances, denoted as .Yi =
{y1 , y2 , . . . , yJ } ⊂ Y , where each .yj is an instance label.
Each instance label contains two parts:
1. A mask, which is a set of points defining the boundary of the instance.
2. An instance label .lj ∈ {0, 1}K , which is a one-hot vector encoding the class
of the instance.
We extend this framework to the open-world setting, following the formula-
tion introduced by Joseph et al. [12], which builds upon the work of Bendale and
Boult [2] by enables models to dynamically update themselves in an incremen-
tal fashion over multiple training episodes. At a given task or time .t, there are
.Kt known in-distribution classes, and the dataset .Dt = {It , Yt } consists of .Nt
images and labels. The object class label is now a .(Kt + 1)-dimensional vector
K +1
.lj ∈ {0, 1} t , where the additional dimension represents unknown instances.
While there may be an unbounded number of unknown classes, a subset
.Ut denotes the unknown classes of interest that we aim to segment. When the
model identifies unknown instances, they are sent to an oracle (e.g., a human
annotator) for labeling. The newly labeled objects are then used to generate a
new dataset .Dt+1 , which contains instances of the .Ut newly introduced instance
classes.
The model is then updated using the new dataset .Dt+1 , the current model
.ft , and a limited subset of the previous datasets .Di where .i ∈ {0, 1, . . . , t}, to
produce an updated model .ft+1 capable of segmenting .Kt+1 = Kt + Ut instance
classes. This cycle can repeat as necessary. Note that “unknown” is not treated
as a class, and there are no annotations for instances that are not introduced at
time .t.
4 Proposed Method
For our Real-Time Open-World Instance Segmentation (ROWIS) model, we
extend SparseInst [5], leveraging its Sparse Instance Activation Map (IAM) for
instance representation. To adapt it to the open-world setting, we introduce a
novel approach.
Like PROB [35], we use separate objectness and class heads for novel objects.
Figure 2 shows the architecture, where both heads are trained and used during
Towards Real-Time Open World Instance Segmentation 303
Fig. 2. Our method builds on the efficient encoder-decoder framework of SparseInst [5].
The decoder has two branches: one for generating masks and another for instance acti-
vation maps. Each map is processed by three heads: the kernel head (for mask mul-
tiplication), the classification head (for instance class prediction), and the objectness
head (for foreground-background classification). To handle unknown objects, we train
the objectness head with an advanced energy-based approach using both matched and
unmatched examples. Additionally, a Self-Attention CNN Module is applied to the
mask branch, enhancing semantic understanding and mask quality.
Let .m denote the instance activation map, .i the foreground instance, and
.c the class of the instance. Traditional approaches aim to answer the ques-
tion: “Does this instance activation map correspond to an instance of a specific
class?” However, inspired by [35], we decouple instance prediction .i and instance
class prediction .c|i, treating them independently during training and inference.
The objectness now becomes .p(o|m), and the instance class prediction becomes
.p(c|i, m).
The objectness head .fot (m) is used to predict the probability that an instance
activation map corresponds to a foreground object, while the classification head
.fc (m) predicts the class of the activation map, assuming it corresponds to a
t
exp(zi /temperaturet )
p = n
. c (3)
k=1 exp(zk /temperaturet )
Towards Real-Time Open World Instance Segmentation 305
– Key, Query, and Value Generation: Key (.K), query (.Q), and value (.V )
vectors are generated from the input features .X using 1x1 convolution layer.
– Attention Score Computation: The attention scores are computed by
performing the dot product of the query and key vectors, followed by softmax
T
normalization: .A = Sof tM ax( QK
√
dk
), where .dk is the dimensionality of the
key vectors.
– Weighted Sum of Values: The weighted sum of values is obtained by
multiplying the attention scores with the value vectors: .Z = AV , Where .Z
represents the outputs feature map.
Fig. 3. The self-attention layer takes input features and generates key, query, and value
vectors via 1x1 convolutions. The query and value vectors are downsampled, while the
key remains unchanged. Attention scores are computed using the dot product of key
and query, followed by softmax normalization. The weighted sum of values produces
the self-attention features, which are output by the layer.
306 B. L. T. Hoang et al.
We downsample the key and query vectors using bottleneck 1x1 convolution
layers to reduce the parameter overhead while preserving essential information
for generating instance activation maps. Specifically, the key and query vectors
are downsampled with a reduction factor of .r. To maintain the original shape
of the input features, the value vectors’ dimensions remain unchanged, while
the shape of the key and query vectors is adjusted to .[n, c/r, w, h], where .n
represents the batch size, .w and .h denote the width and height of the input
features, respectively.
This downsampling approach enables efficient computation of attention
scores while preserving the spatial resolution of the input features. With this
lightweight implementation, we observe an increase in Average Precision (AP),
particularly at higher IoU thresholds like AP@75, due to the generation of high-
quality masks.
4.5 Dataset
We construct our dataset by sampling from MS COCO [14], diverging from pre-
vious methods due to the lack of instance segmentation data in Pascal VOC [9],
commonly used in earlier open-world object detection tasks.
In alignment with the approaches from [10, 12], we split our dataset into four
tasks to progressively evaluate model adaptability:
The COCO classes are divided into four tasks, each with 20 classes. Classes
within tasks are semantically similar, while classes across tasks are semantically
distinct, based on super-category criteria.
At each task .t, we sample images from the COCO training set that con-
tain the highest percentage of instances belonging to the classes in task .t. For
instances labeled with classes from future tasks .t (where .t > t), we remove their
annotations-both the mask and the label-to ensure there is no overlap between
tasks in the initial state.
In the evaluation set, we follow a similar sampling strategy from the COCO
validation set. However, for classes in future tasks .t , while the instances remain
present in the images, their labels are changed to “unknown” for evaluation
purposes, as their true class is yet to be introduced.
To evaluate the model’s incremental learning ability, we increase the difficulty
by accumulating images from the previous evaluation sets. This means that
for task .t, the evaluation not only tests the model’s performance on the newly
learned classes but also enforces the evaluation on earlier tasks’ datasets to ensure
the model retains knowledge and does not experience significant forgetting.
Towards Real-Time Open World Instance Segmentation 307
5 Experiments
Since Pascal VOC lacks clear annotations for instance segmentation tasks, we
created a new dataset based on MS COCO. While this differs slightly from the
OWOD benchmark, we perform comparisons with both the base model and our
proposed updates. Given that COCO evaluation is more challenging than Pascal
VOC, for mAP comparison in the open-world domain, we focus on the incre-
mental learning ability by measuring the percentage change in mAP between
previously and newly learned tasks.
Training Settings. We trained on 2 RTX 4090 GPUs with a batch size of 32.
The base learning rate was set to .5 × 10−5 for all tasks, with an initial warm-up
phase followed by three learning rate reductions. We applied data augmentation
techniques such as cropping to enhance dataset robustness. Unlike the standard
OWOD setup, which typically involves training and fine-tuning for each task
(resulting in 8 steps), we perform sampling in the early stages. As a result, we
reduce the total training time, requiring only 4 training steps for the 4 tasks (see
Table 1 for more details).
Evaluation Metrics. For known classes, we use the mean average precision
(mAP) metric. To better understand the quality of continual learning, mAP is
divided into previously seen and newly introduced object classes. For unknown
objects, we adopt the unknown object recall (U-recall) metric, which measures
the ratio of detected unknown objects to the total labeled unknown objects, as
mAP cannot be applied (due to the lack of annotations for all unknown objects).
We also study the confusion between unknown and known objects [12, 31, 35],
Implementation Details. Our implementation is based on the SparseInst
model, using a ResNet-50 backbone with an FPN. The number of Instance Acti-
vation Maps (IAMs) is set to 100, and the dimensionality .D of the IAMs is
256.
Due to the limited training data compared to the original dataset and eval-
uation on the same test set, we observed a reduction in mAP relative to the
base model. As expected, the shift to an open-world domain creates a trade-off
between unknown object recall and mAP for known instances. Nonetheless, we
successfully kept the mAP reduction to around 5% compared to the base model,
while significantly enhancing unknown object recall and improving incremen-
tal learning capabilities. Our method outperformed the base model in mAP on
Tasks 3 and 4, demonstrating its robustness in handling incremental learning
tasks. See Table 2.
Table 2. Comparison of mAP and unknown object recall across tasks between the
base model and our method. Training on a smaller subset for Task 1 shows a .∼2.1
mAP drop in the base model, while our method dropped .∼2.14% compared with based
model but achieved superior unknown recall and incremental learning performance,
with mAP for previous tasks and sustained performance through Tasks 3, 4. The FPS
archived when tested on a single RTX 4090 24G.
In comparison with open-world object detection models, one of our key evalua-
tion metrics is unknown recall, which measures how well the model can identify
unknown instances or objects in images. To ensure a fair comparison with object
detection models, we convert the segmentation masks into bounding boxes by
creating bounding rectangles from the masks and then recalculating the unknown
recall. The results are shown in Table 3 all OWOD results report in M-OWOD
benchmark.
Table 3. Comparison of unknown recall across tasks between our method and other
open-world object detection models. Although benchmarked on different datasets, the
use of the same metrics and evaluation methods provides a reference for understanding
our model’s performance. Our model achieved promising unknown recall, with the
highest recall observed in Task 2.
Table 4. Applying the efficient self-attention layer improves the precision of mask
generation, particularly on small objects and higher precision metrics such as AP@75.
However, this comes with a slight trade-off in inference speed, reducing performance
by approximately 1 FPS.
6 Conclusion
The open-world domain presents significant challenges due to its complexity and
the requirement for models to handle unknown objects. In this work, we adapted
Open World Object Detection to the Open World Instance Segmentation task
and introduced ROWIS, an end-to-end model specifically designed for Open
World Instance Segmentation. We also provided a new dataset to encourage
further exploration in this field, not only for instance segmentation but also for
open-world object detection.
ROWIS demonstrated its ability to adapt to real-world scenarios while bal-
ancing precision, speed, and the ability to handle unknown objects. However,
there are limitations that need further improvement. Notably, there is a trade-
off between precision and recall, which could be addressed in future iterations.
Additionally, the model is currently highly sensitive to hyperparameter tuning
to achieve optimal results. In future work, we aim to develop a more robust
approach that reduces dependency on hyperparameters. We also acknowledge
the potential inconsistencies in our dataset and will continue to refine it for the
benefit of the research community.
310 B. L. T. Hoang et al.
References
1. Bello, I., Zoph, B., Le, Q., Vaswani, A., Shlens, J.: Attention augmented con-
volutional networks. In: 2019 IEEE/CVF International Conference on Computer
Vision, ICCV 2019, Seoul, Korea (South), 27 October–2 November 2019, pp. 3285–
3294. IEEE (2019)
2. Bendale, A., Boult, T.E.: Towards open world recognition. In: IEEE Conference on
Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12
June 2015, pp. 1893–1902. IEEE Computer Society (2015)
3. Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmenta-
tion. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV
2019, Seoul, Korea (South), 27 October–2 November 2019, pp. 9156–9165. IEEE
(2019)
4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.:
Endto- end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox,
T., Frahm, J. (eds.) Computer Vision - ECCV 2020 - 16th European Conference,
Glasgow, UK, 23–28 August 2020, Proceedings, Part I. Lecture Notes in Computer
Science, vol. 12346, pp. 213–229. Springer, Cham (2020)
5. Cheng, T., et al.: Sparse instance activation for real-time instance segmentation.
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR
2022, New Orleans, LA, USA, 18–24 June 2022, pp. 4423–4432. IEEE (2022)
6. Chu, X., et al.: Twins: revisiting the design of spatial attention in vision transform-
ers. In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W.
(eds.) Advances in Neural Information Processing Systems 34: Annual Conference
on Neural Information Processing Systems 2021, NeurIPS 2021, 6–14 December
2021, virtual, pp. 9355–9366 (2021)
7. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec-
tional transformers for language understanding. In: Burstein, J., Doran, C., Solorio,
T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies,
NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and
Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)
8. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image
recognition at scale. In: 9th International Conference on Learning Representations,
ICLR 2021, Virtual Event, Austria, 3–7 May 2021. OpenReview.net (2021)
9. Everingham, M., Gool, L.V., Williams, C., Winn, J.M., Zisserman, A.: The pascal
visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
10. Gupta, A., Narayan, S., Joseph, K.J., Khan, S., Khan, F.S., Shah, M.: OW-DETR:
open-world detection transformer. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022,
pp. 9225–9234. IEEE (2022)
11. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 2018 IEEE Confer-
ence on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City,
UT, USA, 18–22 June 2018, pp. 7132–7141. Computer Vision Foundation/IEEE
Computer Society (2018)
Towards Real-Time Open World Instance Segmentation 311
12. Joseph, K.J., Khan, S.H., Khan, F.S., Balasubramanian, V.N.: Towards open world
object detection. In: IEEE Conference on Computer Vision and Pattern Recog-
nition, CVPR 2021, virtual, 19–25 June 2021, pp. 5830–5840. Computer Vision
Foundation/IEEE (2021)
13. Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
14. Lin, T., et al.: Microsoft COCO: common objects in context. In: Fleet, D.J., Pajdla,
T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision - ECCV 2014 - 13th Euro-
pean Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V.
Lecture Notes in Computer Science, vol. 8693, pp. 740–755. Springer (2014)
15. Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for
open-set object detection. CoRR abs/2303.05499 (2023)
16. Ma, S., et al.: CAT: localization and identification cascade detection transformer
for open-world object detection. In: IEEE/CVF Conference on Computer Vision
and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023,
pp. 19681–19690. IEEE (2023)
17. Mallya, A., Lazebnik, S.: Packnet: adding multiple tasks to a single network by
iterative pruning. In: 2018 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 7765–
7773. Computer Vision Foundation/IEEE Computer Society (2018)
18. Miller, D., Dayoub, F., Milford, M., Sünderhauf, N.: Evaluating merging strategies
for sampling-based uncertainty techniques in object detection. In: International
Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, 20–
24 May 2019, pp. 2348–2354. IEEE (2019)
19. Miller, D., Nicholson, L., Dayoub, F., Sünderhauf, N.: Dropout sampling for robust
object detection in open-set conditions. In: 2018 IEEE International Conference
on Robotics and Automation, ICRA 2018, Brisbane, Australia, 21–25 May 2018,
pp. 1–7. IEEE (2018)
20. Oksuz, K., Cam, B.C., Kahraman, F., Baltaci, Z.S., Kalkan, S., Akbas, E.: Mask-
aware IOU for anchor assignment in real-time instance segmentation. In: 32nd
British Machine Vision Conference 2021, BMVC 2021, Online, 22–25 November
2021, p. 228. BMVA Press (2021)
21. Park, J., Woo, S., Lee, J., Kweon, I.S.: BAM: bottleneck attention module. In:
British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, 3–6 Septem-
ber 2018, p. 147. BMVA Press (2018)
22. Radford, A., et al.: Learning transferable visual models from natural language
supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International
Conference on Machine Learning, ICML 2021, 18–24 July 2021, Virtual Event.
Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)
23. Rajasegaran, J., Khan, S.H., Hayat, M., Khan, F.S., Shah, M.: itaml: an incre-
mental task-agnostic meta-learning approach. In: 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19
June 2020, pp. 13585–13594. Computer Vision Foundation/IEEE (2020)
24. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object
detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.
39(6), 1137–1149 (2017)
25. Srinivas, A., Lin, T., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck
transformers for visual recognition. In: IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2021, virtual, 19–25 June 2021, pp. 16519–16529.
Computer Vision Foundation/IEEE (2021)
312 B. L. T. Hoang et al.
26. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances
in Neural Information Processing Systems 30: Annual Conference on Neural Infor-
mation Processing Systems 2017, 4–9 December 2017, Long Beach, CA, USA, pp.
5998–6008 (2017)
27. Wang, C., Xu, H., Zhang, X., Wang, L., Zheng, Z., Liu, H.: Convolutional embed-
ding makes hierarchical vision transformer stronger. In: Avidan, S., Brostow, G.J.,
Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022 - 17th
European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XX.
Lecture Notes in Computer Science, vol. 13680, pp. 739–756. Springer (2022)
28. Wang, W., Feiszli, M., Wang, H., Malik, J., Tran, D.: Open-world instance seg-
mentation: exploiting pseudo ground truth from learned pairwise affinity. In:
IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022,
New Orleans, LA, USA, 18–24 June 2022, pp. 4412–4422. IEEE (2022)
29. Wang, W., Feiszli, M., Wang, H., Tran, D.: Unidentified video objects: a benchmark
for dense, open-world segmentation. In: 2021 IEEE/CVF International Conference
on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp.
10756–10765. IEEE (2021)
30. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense pre-
diction without convolutions. In: 2021 IEEE/CVF International Conference on
Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021, pp.
548–558. IEEE (2021)
31. Wang, Y., Yue, Z., Hua, X., Zhang, H.: Random boxes are open-world object
detectors. In: IEEE/CVF International Conference on Computer Vision, ICCV
2023, Paris, France, 1–6 October 2023, pp. 6210–6220. IEEE (2023)
32. Xue, X., et al.: Transformer-based open-world instance segmentation with cross-
task consistency regularization. In: El-Saddik, A., et al. (eds.) Proceedings of
the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON,
Canada, 29 October 2023–3 November 2023, pp. 2507–2515. ACM (2023)
33. Zhang, H., et al.: DINO: DETR with improved denoising anchor boxes for end-to-
end object detection. In: The Eleventh International Conference on Learning Rep-
resentations, ICLR 2023, Kigali, Rwanda, 1–5 May 2023. OpenReview.net (2023)
34. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable
transformers for end-to-end object detection. In: 9th International Conference on
Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021.
OpenReview.net (2021)
35. Zohar, O., Wang, K., Yeung, S.: PROB: probabilistic objectness for open world
object detection. In: IEEE/CVF Conference on Computer Vision and Pattern
Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023, pp. 11444–
11453. IEEE (2023)
An Attempt to Develop a Neural Parser
Based on Simplified Head-Driven Phrase
Structure Grammar on Vietnamese
1 Introduction
Natural Language Processing (NLP) has witnessed significant advancements
in recent years, propelled by the development of sophisticated models and
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 313–328, 2025.
https://doi.org/10.1007/978-981-96-4282-3_26
314 D.-V. Nguyen et al.
algorithms. A critical area within NLP is the development of efficient and accu-
rate parsers, particularly for languages with limited computational resources,
like Vietnamese. Vietnamese, characterized by its tonal nature, complex mor-
phology, and unique syntactic structure, presents unique challenges for parsing
technologies [1, 2]. This study aims to address these challenges by developing
a Vietnamese neural parser using a simplified version of Head-Driven Phrase
Structure Grammar (HPSG) [3, 4].
Our approach involves addressing inconsistencies within the VietTreebank
and VnDT corpora, which are pivotal for Vietnamese NLP [5, 6]. We incorporated
advanced text encoding models, PhoBERT and XLM-RoBERTa, hypothesizing
that these models would enhance the parser’s performance due to their robust
linguistic representation capabilities [7, 8]. Our experiments demonstrate that
the parser achieves an 82% F-score in constituency parsing and shows promising
performance in dependency parsing, outperforming others in the field despite
lower scores in the Labeled Attachment Score (LAS) [9, 10].
In the context of the VLSP 2023 - Vietnamese Constituency Parsing Chal-
lenge, our study also ventures into transforming constituency trees into depen-
dency trees using proposed head-rules implemented with the ClearNLP toolkit
[11, 12]. This transformation is particularly significant, as it was achieved with-
out direct linguistic input. Remarkably, the HPSG Neural Parser [13] achieved
a marginally higher F-score of 89.04%, surpassing established parsers like the
Stanza Constituency Parser, which scored 88.73% [14]. This outcome under-
scores the potential of incorporating linguistic expertise into the development of
Vietnamese NLP tools, an area that has been relatively underexplored.
The rest of this paper is structured in the following manner. Section 2 sur-
veys several current works on Vietnamese parsing. Section 3 briefly looks and
analyzes at the used datasets for our methodology and baseline. Our method-
ology is presented in detail through Sect. 4. Section 5 illustrates processes for
experimental settings, implementing models, and our experimental results on
each dataset, and describes the result analysis and discussion of the proposed
approach. In summary, Sect. 6 serves as the final part of our research and outlines
our conclusions and any potential areas for future exploration.
Neural parsing techniques have seen significant innovations, shifting from tra-
ditional rule-based methods to more advanced neural network-based approaches.
Key developments include the minimal span-based neural constituency parsers
by Stern, Andreas, and Klein [17] and the analytical insights into neural con-
stituency parsers provided by Gaddy, Stern, and Klein [18]. These studies have
significantly influenced the field, moving it towards more efficient and accurate
parsing solutions.
The integration of pre-trained models like PhoBERT and XLM-RoBERTa
into parsing has been a game-changer. Nguyen and Tuan Nguyen [7] introduc-
tion of PhoBERT, and Conneau et al. [8] work on unsupervised cross-lingual
representation learning [8] have demonstrated the potential of these models in
enhancing parsing accuracy and efficiency, especially for languages like Viet-
namese that lack extensive computational resources.
Despite these advancements, Vietnamese parsing faces specific challenges,
such as the complexity of its syntactic structure and limited linguistic resources.
Recent studies have proposed innovative solutions, including the use of head-
rules for tree transformations and leveraging toolkits like ClearNLP to improve
parsing efficiency [1]. Nguyen, Nguyen, and Nguyen [19] presented a Depen-
dency Tree-LSTM approach for Vietnamese sentiment analysis. Trang et al. [20]
proposed a prosodic boundary prediction model to improve Vietnamese speech
synthesis, using both traditional and novel features like syntactic blocks and
links.
In comparing our parser’s performance with others, such as the Stanza Con-
stituency Parser, it becomes evident that while there are similarities in method-
ological approaches, each parser has its unique strengths and limitations. Our
parser’s slightly higher F-score highlights the potential impact of incorporating
linguistic expertise in parser development, a concept that has been relatively
underutilized in Vietnamese NLP [21, 22].
The future of Vietnamese parsing looks promising, with potential impacts
extending beyond the immediate field. The methodologies and findings from this
study could influence future research directions, not only in Vietnamese NLP but
also in the broader context of computational linguistics for other low-resource
languages [23, 24].
3 Corpora
3.1 VTB and VnDT
tagged for parts of speech. This dataset is sourced from “Tuoi Tre1 ”, a Vietnamese
newspaper. In addition, Nguyen et al. [1] introduced a technique to convert Viet-
namese treebanks into dependency trees, particularly important for addressing
Vietnamese linguistic peculiarities. This conversion process led to the creation
of the VnDT Treebank, featuring 10,200 sentences. The treebank was evaluated
using two parsers, MSTParser and MaltParser, with MSTParser showing su-
perior performance in Vietnamese dependency parsing. The VnDT Treebank is
now a publicly available resource that offers significant value for research on
Vietnamese natural language processing (Fig. 1).
Fig. 1. Constituent, dependency, and joint span structures, extracted from the
training datasets of VTB and VnDT and intended solely for visualization
purposes, may contain slight labeling errors. These structures represent the same
Vietnamese sentence, indexed from 1 to 7 and assigned an interval range for each node.
The sentence
The VLSP 2023 Vietnamese Treebank [16] is a collection of about 10,000 Viet-
namese sentences, mostly from news articles and socio-political texts. The cre-
ators used various linguistic methods to handle language ambiguities, and anno-
tators were assisted by automatic tools.
In the VLSP 2023 shared task [15], participants are asked to develop a con-
stituency parser. This parser takes a sentence and produces a tree that shows the
grammatical structure of the sentence. Participants can improve their parsers us-
ing extra Vietnamese text or pre-trained language models. The evaluation uses
Parseval metrics, and the test data includes texts from the same domain as the
training data, as well as new areas like legal and biomedical fields (Fig. 2).
We analyzed the training data from the VLSP 2023 Vietnamese Treebank to
understand the structure of Vietnamese sentences. The data shows that nouns
and verbs are the most common parts of speech, appearing 42,584 and 32,456
times respectively. This means that Vietnamese sentences in this dataset often
focus on nouns and verbs, which is typical in formal writing like news articles.
Punctuation marks are also frequent, with 22,819 instances, highlighting the
structured nature of written Vietnamese. The dataset includes various types of
318 D.-V. Nguyen et al.
The scoring mechanism within the HPSG Neural Parser utilizes a biaffine at-
tention model [9]. This model accurately scores potential dependency relations
among words, allowing for precise parsing and syntactic relationship establish-
ment in complex sentences.
2
https://github.com/stanfordnlp/stanza/blob/main/stanza/utils/training/run_pos.
py.
320 D.-V. Nguyen et al.
Stanza Constituency Parser. Our study involved refining the Stanza con-
stituency parser3 [14] with PhoBERTlarge [7], yielding an 81% F-score. We
trained a solitary model, consisting of one parser, on 90% of the data, allo-
cating the remaining 10% for validation. The training parameters were set with
BERT fine-tuning from epoch 0 to 300, employing the AdamW optimizer, and
a batch size of 32 for training.
Fig. 3. Balancing Constituency and Dependency in Joint Span HPSG Parsing on the
VTB & VnDT Development Sets.
Fig. 4. An attempted version of head rules for the VLSP 2023 Vietnamese Treebank
was developed with a non-linguistic engineering background.
Table 2. The performances of dependency parsing on the VnDT test set were reported.
[] represents our replicated PhoNLP result. [] represents the average results of five
runs, with the lowest result indicated by [♣]. The red bold font indicates that our result
is significantly different from the result of PhoBERTbase [] using a paired-wise t-test.
POS denoted part-of-speech.
Model LAS UAS
PhoNLP w/ PhoBERTbase [30] 78.17 84.95
PhoNLP w/ PhoBERTbase [] 77.38 84.44
W/o HPSG w/ PhoBERTbase [♣] 77.40 85.05
POS tags
HPSG w/ PhoBERTbase [] 77.60 85.18
W/ HPSG w/ PhoBERTbase [♣] 77.28 85.01
POS tags
HPSG w/ PhoBERTbase [] 77.48 85.22
[♣]) were significantly different from the replicated PhoNLP result, as shown by
the red bold font (LAS: 77.40, UAS: 85.05), while the average of these runs ([])
showed a slightly higher LAS of 77.60 and the best UAS of 85.18. With POS
tags included, the lowest and average results were 77.28 (LAS) and 85.01 (UAS)
for the former, and 77.48 (LAS) and 85.22 (UAS) for the latter, with the lowest
results again marked significantly different in red bold font. This analysis under-
scores the nuanced impact of POS tags on parsing accuracy and demonstrates
the robustness of the HPSG model with PhoBERTbase in dependency parsing
tasks.
Fig. 5. Our tagging and parsing results on the public set of the VLSP 2023 Vietnamese
Treebank.
0.735, which is better than Stanza’s 0.549. In noun phrases (NP), both parsers
did almost the same, with Stanza slightly ahead at 0.826 compared to HPSG’s
0.825.
Both parsers struggled with certain categories like WHADVP, UCP, and VCP,
where they scored zero. This indicates these areas need more attention and
improvement. By looking closely at the scores for different categories, we can
better understand each parser’s strengths and where they need to improve in
processing natural language.
Table 3 presents the results of the VLSP 2023 Shared Task6 , comparing three
parsers: Stanza, HPSG, and Attach-Juxtapose. The performance is measured
using Precision (P), Recall (R), and F-score (F) on both public and private test
datasets.
By comparing the results, we see that the Stanza and HPSG parsers per-
form similarly in the public test, with F-scores of 85.87 and 86.05, respectively.
However, in the private test, the HPSG parser performs slightly better, achiev-
ing an F-score of 89.04 compared to Stanza’s 88.73. This indicates that the
HPSG parser is better at handling unseen data. The Attach-Juxtapose parser
performs slightly lower than Stanza and HPSG overall but shows promising re-
sults when combined with specific models like PhoBERTbase-v2 . This suggests
that the Attach-Juxtapose parser has potential for improvement in future appli-
cations.
6
The VLSP 2023 Workshop Program is available at https://vlsp.org.vn/vlsp2023.
An Attempt to Develop a Neural Parser 325
Table 3. Results of the VLSP 2023 Shared Task: Performance metrics of the Stanza
parser [27], HPSG parser [13], and Attach-Juxtapose parser [31]. Note that the result
for the Attach-Juxtapose parser was reported by another team participating in the
shared task. The ‘&’ symbol denotes an ensemble of two language models.
Model Public Test Private Test
P R F P R F
Attach-Juxtapose w/ PhoBERTbase – – 80.55 – – 84.66
Attach-Juxtapose w/ PhoBERTbase-v2 – – 81.09 – – 84.79
Attach-Juxtapose w/ PhoBERTlarge – – 80.44 – – 84.45
Attach-Juxtapose w/ [PhoBERTbase & PhoBERTlarge ] – – 80.87 – – 84.60
Attach-Juxtapose w/ [PhoBERTbase-v2 & PhoBERTlarge ] 82.25 79.97 81.09 83.70 86.06 84.86
Stanza w/ PhoBERTlarge 86.78 84.97 85.87 89.56 87.91 88.73
HPSG w/ PhoBERTlarge 86.84 85.28 86.05 89.56 88.53 89.04
In summary, the results show that the HPSG parser is slightly stronger over-
all, especially in more challenging datasets, while Stanza remains highly com-
petitive. The Attach-Juxtapose parser, although not the strongest here, shows
room for growth with further refinements.
This paper developed a neural parser for Vietnamese using a simplified Head-
Driven Phrase Structure Grammar (HPSG). To address the 15% of tree pairs in
the VietTreebank and VnDT corpora that did not conform to HPSG rules, we
permuted samples from the training and development sets. We then modified
the original parser by incorporating PhoBERT and XLM-RoBERTa models for
Vietnamese text encoding. Our experiments showed that the parser achieved an
82% F-score in constituency parsing and outperformed previous studies in de-
pendency parsing with a higher Unlabeled Attachment Score (UAS). The lower
Labeled Attachment Score (LAS) likely resulted from not consulting linguistic
experts. These results suggest the need for greater linguistic input when devel-
oping Vietnamese treebanks.
References
1. Nguyen, D.Q., Nguyen, D.Q., Pham, S.B., Nguyen, P.-T., Nguyen, M.L.: From
treebank conversion to automatic dependency parsing for Vietnamese. In: Métais,
E., Roche, M., Teisseire, M. (eds.) Natural Language Processing and Information
Systems, pp. 196–207. Springer, Cham (2014). ISBN: 978-3-319-07983- 7
2. Nguyen, Q.T., Miyao, Y., Le, H.T.T., Nguyen, N.T.H.: Ensuring annotation consis-
tency and accuracy for Vietnamese treebank. Lang. Resour. Eval. 52(1), 269–315
(2018). ISSN: 1574-0218
3. Pollard, C., Sag, I.A.: Head-Driven Phrase Structure Grammar. The University of
Chicago Press, Chicago (1994)
4. Do, B.L., Le, T.H.: Implementing a Vietnamese syntactic parser using HPSG. In:
The International Conference on Asian Language Processing (IALP) (2008)
5. Nguyen, K.-H.: BKTreebank: building a Vietnamese dependency treebank. In: Pro-
ceedings of the Eleventh International Conference on Language Resources and
Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Asso-
ciation (ELRA) (2018)
6. Thi, L.N., My, .H., Viet, H.N., Minh, H.N.T., Hong, P.L.: Building a treebank for
Vietnamese dependency parsing. In: The 2013 RIVF International Conference on
Computing & Communication Technologies - Research, Innovation, and Vision for
Future (RIVF), pp. 147–151 (2013)
7. Nguyen, D.Q., Nguyen, A.T.: PhoBERT: pre-trained language models for Viet-
namese. In: Findings of the Association for Computational Linguistics: EMNLP
2020, pp. 1037–1042. Association for Computational Linguistics (2020)
8. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale.
In: Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pp. 8440-8451. Association for Computational Linguistics (2020)
9. Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing.
In: 5th International Conference on Learning Representations, ICLR 2017, Toulon,
France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net (2017)
10. Chomsky, N.: The Pisa Lectures. De Gruyter Mouton, Berlin, New York (1993).
ISBN: 9783110884166
11. de Marneffe, M.-C., MacCartney, B., Manning, C.D.: Generating typed depen-
dency parses from phrase structure parses. In: Proceedings of the Fifth Interna-
tional Conference on Language Resources and Evaluation (LREC’06). European
Language Resources Association (ELRA), Genoa, Italy (2006)
12. Ma, X., Zhang, X., Zhao, H., Lu, B.-L.: Dependency parser for Chinese constituent
parsing. In: CIPS-SIGHAN Joint Conference on Chinese Language Processing
(2010)
13. Zhou, J., Zhao, H.: Head-driven phrase structure grammar parsing on PENN tree-
bank. In: Proceedings of the 57th Annual Meeting of the Association for Compu-
tational Linguistics, Florence, Italy, pp. 2396–2408. Association for Computational
Linguistics (2019)
14. Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: a python nat-
ural language processing toolkit for many human languages. In: Celikyilmaz, A.,
Wen, T.-H. (eds.) Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics: System Demonstrations, pp. 101-108. Association for
Computational Linguistics (2020)
An Attempt to Develop a Neural Parser 327
15. Nguyen, T.-M.-H., Vu, X.-L., Ha, M.-L.: VLSP 2023 challenge on Vietnamese Con-
stituency Parsing. In: 2023
16. Nguyen, P.-T., Vu, X.-L., Nguyen, T.-M.-H., Nguyen, V.-H., Le, H.-P.: Building a
large syntactically-annotated corpus of Vietnamese. In: Proceedings of the Third
Linguistic Annotation Workshop (LAW III), pp. 182–185. Association for Compu-
tational Linguistics, Suntec, Singapore (2009)
17. Stern, M., Andreas, J., Klein, D.: A minimal span-based neural constituency parser.
In: Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 818-827. Association
for Computational Linguistics (2017)
18. Gaddy, D., Stern, M., Klein, D.: What’s Going on in neural constituency parsers?
An analysis. In: Proceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long Papers). New Orleans, Louisiana„ pp. 999–1010. Association for
Computational Linguistics (2018)
19. Nguyen, V.D., Nguyen, K.V., Nguyen, N.L.-T.: Variants of long short-term memory
for sentiment analysis on Vietnamese students’ feedback corpus. In: 2018 10th
International Conference on Knowledge and Systems Engineering (KSE), pp. 306–
311 (2018)
20. Trang, N.T.T., Ky, N.H., Rilliard, A., d’Alessandro, C.: Prosodic boundary pre-
diction model for Vietnamese text-to- speech. In: Interspeech 2021. Brno, Czech
Republic, pp. 3885–3889. ISCA (2021)
21. Linh, H.M., Huyen, N.T.M., Luong, V.X., Luong, N.T., Hue, P.T., Cuong, L.V.:
VLSP 2020 shared task: universal dependency parsing for Vietnamese. In: Pro-
ceedings of the 7th International Workshop on Vietnamese Language and Speech
Processing, Hanoi, Vietnam, pp. 77–83. Association for Computational Lingustics
(2020)
22. Nguyen, K.V., Nguyen, N.L.-T.: Vietnamese transition-based dependency parsing
with supertag features. In: 2016 Eighth International Conference on Knowledge
and Systems Engineering (KSE), pp. 175–180 (2016)
23. Nguyen, B.D., Nguyen, K.V., Nguyen, N.L.-T.: LSTM easy-first dependency pars-
ing with pre-trained word embeddings and character-level word embeddings in
Vietnamese. In: 2018 10th International Conference on Knowledge and Systems
Engineering (KSE), pp. 187–192 (2018)
24. Nguyen, D.Q.: A neural joint model for Vietnamese word segmentation, POS tag-
ging and dependency parsing. In: Proceedings of the The 17th Annual Workshop of
the Australasian Language Technology Association. Sydney, Australia, pp. 28–34.
Australasian Language Technology Association (2019)
25. Nguy`ên, T.M.H., Romary, L., Rossignol, M., Vũ, X.L.: A lexicon for Vietnamese
language processing. Lang. Resour. Eval. 40, 291–309 (2006)
26. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st In-
ternational Conference on Neural Information Processing Systems, Long Beach,
California, USA, pp. 5998–6008. Curran Associates Inc. (2017)
27. Bauer, J., Bui, H., Thai, V., Manning, C.: In-order transitionbased parsing for
Vietnamese. J. Comput. Sci. Cybernet. 39(3), 207–221 (2023)
28. Vu, T., Nguyen, D.Q., Nguyen, D.Q., Dras, M., Johnson, M.: VnCoreNLP: a Viet-
namese natural language processing toolkit. In: Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguistics,
Demonstrations, New Orleans, Louisiana, pp. 56–60. Association for Computa-
tional Linguistics (2018)
328 D.-V. Nguyen et al.
29. Tran, T.-V., Pham, X.-T., Nguyen, D.-V., Nguyen, K.V., Nguyen, N.L.-T.: An
empirical study for Vietnamese constituency parsing with pre-training. In: 2021
RIVF International Conference on Computing and Communication Technologies
(RIVF), pp. 1–6 (2021)
30. Nguyen, L.T., Nguyen, D.Q.: PhoNLP: a joint multi-task learning model for Viet-
namese part-of-speech tagging, named entity recognition and dependency parsing.
In: Proceedings of the 2021 Conference of the North American Chapter of the As-
sociation for Computational Linguistics: Human Language Technologies: Demon-
strations, pp. 1–7. Association for Computational Linguistics (2021)
31. Yang, K., Deng, J.: Strongly incremental constituency parsing with graph neural
networks. In: Proceedings of the 34th International Conference on Neural Infor-
mation Processing Systems. NIPS ’20, Vancouver, BC, Canada. Curran Associates
Inc. (2020). ISBN: 9781713829546
Knowledge Distillation for Lumbar Spine
X-ray Classification
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 329–342, 2025.
https://doi.org/10.1007/978-981-96-4282-3_27
330 M.-K. Nguyen et al.
1 Introduction
The human spine encompasses seven cervical vertebrae, twelve thoracic verte-
brae, five lumbar vertebrae, five sacral vertebrae, and four coccygeal vertebrae
[17]. The spine has a double S-shaped curve, facilitating movement flexibility.
Spondylosis symptoms frequently manifest in the lumbar and cervical regions, as
these areas endure the most significant strain, supporting the whole body [12].
Lumbar spondylosis is a chronic, progressively degenerative disorder that
induces discomfort, limits mobility, and results in deformity of the lumbar spine
[9]. The principal cause of this disorder is the aging process [1]. Moreover, addi-
tional factors encompass heredity [14], adverse living environment, and inade-
quate nourishment for the body [9]. As society advances and working circum-
stances enhance, the prevalence of office labor has risen; individuals now spend
most of their time seated and are less physically active [13]. This elucidates that,
while the principal etiology of this disorder is aging-impacting 80% of patients
over 40 years old-the disease is progressively affecting younger demographics,
evidenced by a notable increase in patients aged 20 to 29 exhibiting symptoms
of lumbar spondylosis [10].
Numerous imaging modalities are presently accessible for evaluating spinal
disease, including X-rays, computed tomography (CT), and magnetic resonance
imaging (MRI). Doctors frequently advise patients to choose lumbar X-rays [17],
as this technique offers extensive insights into spinal health, encompassing spinal
alignment, vertebral anatomy, bone cortex integrity, and degenerative or trau-
matic conditions [9], while also yielding prompt results at a minimal cost and
being readily available at numerous clinics and healthcare establishments [17].
Medical imaging informatics denotes using information and communication
technology (ICT) in healthcare imaging services. Due to the prevailing trend of
global aging [17] and a decline in physical activity linked to work patterns [13],
spine illnesses are becoming more prevalent. With the rising patient population,
physicians encounter an escalating workload that affects their diagnostic efficacy
[11]. The authors acknowledge the significance and necessity of facilitating the
swift identification of lumbar spondylosis by X-ray imaging methods.
Our main contributions are summarized as follows:
The paper is organized as follows. Section 2 of the paper reviews related works.
Section 3 discusses the pathogenesis. The proposed system is presented in Sect. 4.
Section 5 presents the dataset and experiment, and Sect. 6 is a conclusion.
Knowledge Distillation for Lumbar Spine X-ray Classification 331
2 Related Work
In 2022, Trinh et al. conducted an extensive study on techniques for identify-
ing lumbar disc herniation using deep learning networks applied to X-ray images
[17]. Their research highlighted the potential of neural networks in medical image
analysis, particularly in identifying disc herniation, a common condition affect-
ing the lumbar spine. In the same year, Trinh et al. introduced the LumbarNet
model, a specialized deep-learning network designed to diagnose lumbar disc her-
niation from X-ray images autonomously. The goal of LumbarNet was to enhance
both the accuracy and efficiency of diagnostic processes. After a thorough evalua-
tion, the model achieved an impressive accuracy of 88.83% in vertebrae detection,
underscoring its potential in clinical settings [18].
Further developments in the field were seen with the work of Zhang et al.,
who proposed a novel approach using deep learning techniques for identifying
osteoporosis from a dataset comprising 1,616 X-ray scans, augmented by two
additional datasets of 204 and 396 images, respectively. The findings of this
research were promising, demonstrating the viability of using deep learning for
osteoporosis screening based on X-ray images [19]. This body of work illustrated
the growing interest in leveraging AI for bone disease detection, showing promis-
ing results in both accuracy and computational efficiency.
In a related study, Kong et al. advanced the diagnosis of fractures through
deep learning models, further expanding the application of AI in musculoskele-
tal imaging [8]. Moreover, a noteworthy contribution by Hong et al. involved the
development of a model capable of simultaneously detecting osteoporosis and
fractures, demonstrating the potential for multi-condition detection using a sin-
gle model framework [5]. These studies underscore the breadth of deep learning
applications in spinal and bone-related conditions.
Discussion. Despite these advancements, the application of deep learning tech-
niques specifically for identifying lumbar spondylosis from X-ray images has not
yet received significant attention. Existing research has largely focused on indi-
vidual issues affecting the lumbar spine, such as disc degeneration, herniation,
osteoporosis, and vertebral displacement. While these conditions are components
of lumbar spondylosis, comprehensive studies exploring the use of deep learning
for the full detection of this condition remain scarce. Lumbar spondylosis, which
involves the degeneration of intervertebral discs and joints and the formation
of bone spurs, requires a more focused approach to deep learning-driven diag-
nosis. This paper addresses this gap by proposing and evaluating the effective-
ness of two deep learning models-EfficientNet-B4 and MobileNetV2-employed
within a knowledge distillation framework. Specifically, EfficientNet-B4 acts as
the Teacher model during the initial training phase, guiding the learning pro-
cess for MobileNetV2, which functions as the Student model. This approach
aims to optimize both the efficiency and accuracy of the detection and classi-
fication of lumbar spondylosis in X-ray images. Knowledge distillation ensures
that the more computationally efficient MobileNetV2 model inherits the strong
performance characteristics of the larger EfficientNet-B4 model, facilitating its
332 M.-K. Nguyen et al.
3 Pathogenesis
The lumbar spine has five vertebrae designated as L1, L2, L3, L4, and L5.
According to Kirkaldy-Willis, each vertebra possesses an intervertebral linking
structure including a complex of three joints (The information is laid out in
Fig. 1), which contains one disc anteriorly and two facet joints posteriorly. This
construction facilitates flexible joint movement while ensuring stability. Due to
the interrelated nature of this system, injury to one joint adversely impacts the
others.
Initially, lumbar spondylosis tends to manifest in a limited number of verte-
brae, most commonly at the L4-L5 or L5-S1 levels. These vertebrae are located
at the lower segment of the vertebral column, where they support the body’s
weight, and are subjected to significant mechanical stress due to their position in
the spinal curvature. Over time, if the condition is not diagnosed and managed
early, these degenerative changes may spread to other adjacent vertebrae within
the lumbar spine.
Figure 2 presents a detailed depiction of the spectrum of degenerative change,
illustrating the initial damage occurring in two directions-posterior joint and
intervertebral disc-leading to the development of intricate multi-level degenera-
tive lesions. These modifications underscore the significance of early identifica-
tion and prompt action to avert future decline in spine health.
For damage oriented toward the posterior joint, synovial reactions, carti-
lage destruction, osteophyte formation contribute to capsular laxity, subluxa-
tion, and lateral nerve entrapment. These biomechanical alterations can lead to
Knowledge Distillation for Lumbar Spine X-ray Classification 333
spinal instability, which worsens over time, impacting the overall function of the
lumbar spine. As these conditions persist, the enlargement of articular processes
and osteophytes at the vertebral bodies lead to multilevel degenerative lesions,
which characterize advanced stages of lumbar spondylosis. The entire procedure
is illustrated in Fig. 3, and the specifics will be addressed in part Sect. 3.1.
Lesions in the direction of the intervertebral disc can manifest as circumfer-
ential and radial tears, which may further progress to internal disruption and
herniation. Herniation, in particular, is a critical event that affects the overall
stability of the lumbar region, potentially resulting in a reduction in disc height
and disc resorption. Specifics will be addressed in part Sect. 3.2, and Fig. 4 will
illustrate the entire procedure.
Since the changes caused by these microtraumas are minimal at first, patients
often find it difficult to identify any abnormal alterations in their body, and no
significant symptoms may be apparent. However, as these injuries accumulate
over time, they progressively undermine the integrity of the intervertebral disc,
setting the stage for more significant structural damage.
As the process advances, the microtraumas give rise to circumferential tears
within the outer layers of the annulus fibrosus, which is the tough, fibrous ring
surrounding the softer core of the disc. These tears mark the beginning of disc
degeneration and can induce discomfort, particularly on the external surface
of the disc. This represents the initial phase of the degeneration process, during
which the disc’s ability to maintain its structure and function is compromised. As
these circumferential tears accumulate and spread, they may eventually coalesce
into radial tears. These deeper tears extend from the outer layers of the annulus
fibrosus into the central, gelatinous nucleus pulposus of the disc. The formation
of radial tears significantly weakens the disc, making it more susceptible to disc
herniation, which occurs when the nucleus pulposus is displaced outward through
the compromised annulus fibrosus.
As degeneration progresses, the disc’s height begins to decrease. This reduc-
tion in height is directly related to the disc’s declining ability to retain water,
as the structural damage impairs the disc’s capacity to absorb and hold onto
moisture. The desiccation or drying out of the disc leads to a loss of its cushion-
ing properties, which are essential for absorbing mechanical forces in the spine.
The onset of instability in the spinal joints often accompanies this loss of disc
height. The weakening of the disc, combined with the loosening of the joints,
compromises the stability of the entire lumbar region.
As these degenerative changes continue to accumulate, the disc sustains fur-
ther damage. New tears form within the annulus fibrosus, compounding the
existing injuries and further reducing disc height. This ongoing degeneration
reduces the overall space within the disc, effectively diminishing its ability to
perform its biomechanical functions. At the same time, in response to the insta-
bility and reduced disc space, the body begins to form osteophytes, or bone
spurs, around the affected vertebrae. These bony projections, which develop as
the body attempts to stabilize the spine, can contribute to additional complica-
tions, such as nerve impingement and further joint stiffness. As the degeneration
advances to this stage, the once-flexible and resilient intervertebral disc becomes
336 M.-K. Nguyen et al.
severely compromised, and the affected spinal segments may experience chronic
pain and reduced mobility.
4 Proposed System
4.1 Convolutional Neural Network
Convolutional Neural Network (CNN) is a deep learning model specifically
designed to process grid-like data such as images, by automatically extracting
features through different filters across multiple layers. CNN is widely applied
in various fields such as image classification, object detection.
872 images met the standard for image quality. The images were manually anno-
tated by a Level 1 specialist radiologist and later validated by another radiolo-
gist. The labeled images were then divided into two groups: Normal images and
Abnormal images (Table 1 shows an overview of the number of photographs in
each group; Fig. 5 showcases sample images of cases from the two groups).
Fig. 5. The image illustrates our dataset. The top 4 images are normal and the bottom
4 images are abnormal.
5.2 Experiments
The Teacher model, EfficientNet-B4, was employed for training on the dataset
described in Table 1. This dataset includes 872 images, consisting of 422 normal
images and 450 images showing various stages of lumbar degenerative disease.
Each image was carefully curated to provide a balanced dataset for the clas-
sification task, allowing the model to learn key features associated with both
healthy and degenerated lumbar conditions. However, one of the primary chal-
lenges with this dataset was the variability in the original image sizes. The width
of the images ranged from 918 to 1853 pixels, while the height varied between
477 and 957 pixels, making it necessary to preprocess the images to a standard
size before feeding them into the model.
To ensure uniformity and compatibility with the EfficientNet-B4 architec-
ture, all images were resized to a fixed dimension of 380 × 380 pixels. This resiz-
ing was essential for maintaining the model’s performance, as EfficientNet-B4
relies on consistent input sizes to perform efficiently, especially in large-scale
image classification tasks. By standardizing the image dimensions, we not only
facilitated the model’s training process but also ensured that the spatial fea-
tures of the images were preserved as much as possible, allowing the Teacher
Knowledge Distillation for Lumbar Spine X-ray Classification 339
model to detect better subtle differences between normal and degenerated lum-
bar spine images. This preprocessing step was crucial to adapting the dataset to
the EfficientNet-B4 architecture, optimizing the model’s ability to learn from a
diverse range of input images.
We executed training for 500 epochs with a batch size 16 to facilitate regular
model updates. The learning rate was established at 0.001 to avert instability
and ensure efficacy throughout training. The Teacher model (EfficientNet-B4)
was pre-trained to produce labels for the dataset samples. These labels provide
more information on the probability of each class training the student model
(MobileNetV2). This aims to assist the Student model in boosting its capacity
to learn from the intricate and more substantial traits identified by the Teacher
model, hence improving its generalization ability.
Table 2 points out that the Teacher model has commendable classification
accuracy on the dataset. The F1-score is high at 93%, signifying a desirable
equilibrium between Precision and Recall. This is noteworthy, signifying that the
model not only accurately identifies positive cases but also reduces the incidence
of incorrect predictions for the positive class.
The Student model (MobileNet
V2) will utilize the Knowledge Dis-
tillation technique to assimilate the
Teacher model’s outcomes via labels,
enhancing the model’s image recog-
nition proficiency in classifying lum-
bar spine X-ray pictures. The experi-
mental procedure of the Knowledge
Distillation model on the Test set
is depicted in Fig. 6, demonstrating
the model’s exceptional capacity to
predict True Positive and False Pos- Fig. 6. Confusion matrix for Knowledge Dis-
itive values, with a negligible num- tillation model on Testset.
ber of wrong predictions. Therefore, our experimental results show that the
MobileNetV2 model after distillation has higher accuracy (91%) than the
MobileNetV2 model without distillation (90%). In addition, compared to the
Teacher model, the Student model is 5.5 times smaller in size but the Accuracy
is only reduced by 2% (The details can be seen in Table 2).
340 M.-K. Nguyen et al.
Fig. 7. The loss function values of the Teacher and Student models.
Moreover, Fig. 7 illustrates the training process of both the Teacher and Stu-
dent models over 500 epochs. The initial loss value of the Teacher model was
0.6525, higher than that of the Student model, which started at 0.5254. The
loss for both models began to decrease rapidly and then gradually stabilized
around epoch 150, indicating that the models were converging well. By the end
of the training process, the loss value of the Teacher model was 0.0385, repre-
senting a 94% reduction compared to the initial value. This final loss was also
lower than the Student model, which ended at 0.0513. These results suggest
that although the Student model initially had an advantage due to its simpler
structure, allowing it to quickly capture basic features, the more complex archi-
tecture of the Teacher model ultimately outperformed it by better learning and
optimizing important features from the data. This lays the groundwork for the
Knowledge Distillation process, as the lower loss of the Teacher model indicates
higher accuracy, enabling the Teacher’s knowledge to be transferred to enhance
the performance of the Student model.
Knowledge Distillation for Lumbar Spine X-ray Classification 341
6 Conclusion
In this research, we have collected and published a comprehensive dataset of
lumbar spine X-ray images, which includes 872 images. The dataset we offer is
dependable, as all photos have undergone quality verification with the EGQC-
DRI scale, a respected instrument that assists radiologists in evaluating the qual-
ity of medical images prior to diagnosis. The dataset labels were initially assigned
by one radiologist and subsequently validated by another radiologist. Addition-
ally, this paper delves into the study of Convolutional Neural Network (CNN)
architectures and the technique of knowledge distillation to effectively detect
and classify images based on our dataset. The results obtained from the testing
process demonstrate that our model exhibits high efficiency in image recognition
and classification tasks. Although the distilled model does not achieve the same
level of accuracy as the teacher model, it still outperforms the original model.
This promising outcome paves the way for future advancements, offering signif-
icant potential for improving diagnostic accuracy and effectiveness in medical
image analysis.
References
1. Buckwalter, J.A., Saltzman, C., Brown, T.: The impact of osteoarthritis: implica-
tions for research. Clin. Orthop. Relat. Res. (1976-2007) 427, S6–S15 (2004)
2. Doktor, K., Vilholm, M.L., Hardardóttir, A., Christensen, H.W., Lauritsen, J.:
European guidelines on quality criteria for diagnostic radiographic images of the
lumbar spine-an intra-and interobserver reproducibility study. Chiropractic Manual
Ther. 27, 1–6 (2019)
3. Grivas, T.B., et al.: Are the spinal changes in the course of scoliogeny primary but
secondary? J. Clin. Med. 13(8), 2163 (2024)
4. Hinton, G.: Distilling the Knowledge in a Neural Network. arXiv preprint
arXiv:1503.02531 (2015)
5. Hong, N., et al.: Deep-learning-based detection of vertebral fracture and osteo-
porosis using lateral spine X-ray radiography. J. Bone Mineral Res. 38(6), 887–895
(2020)
6. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile
vision applications. arXiv preprint arXiv:1704.04861 126 (2017)
7. Kabir, M.M., Mridha, M.F., Rahman, A., Hamid, M.A., Monowar, M.M.: Detec-
tion of COVID-19, pneumonia, and tuberculosis from radiographs using AI-driven
knowledge distillation. Heliyon 10(5) (2024)
8. Kong, S.H., et al.: Development of a spine X-ray-based fracture prediction model
using a deep learning algorithm. Endocrinol. Metab. 37(4), 674–683 (2022)
9. Middleton, K., Fish, D.E.: Lumbar spondylosis: clinical presentation and treatment
approaches. Curr. Rev. Musculoskelet. Med. 2, 94–104 (2009)
10. Rothschild, B.: Lumbar spondylosis. Emedicine Publication (2008)
11. Sabri, N., Hamed, H.N.A., Ibrahim, Z., Ibrahim, K.: 2D photogrammetry image of
scoliosis Lenke type classification using deep learning. In: 2019 IEEE 9th Interna-
tional Conference on System Engineering and Technology (ICSET), pp. 437–440.
IEEE (2019)
342 M.-K. Nguyen et al.
12. Sasiadek, M.J., Bladowska, J.: Imaging of degenerative spine disease-the state of
the art. Adv. Clin. Exp. Med. 21(2), 133–142 (2012)
13. Shrestha, N., et al.: Workplace interventions for reducing sitting at work. Cochrane
Database Syst. Rev. 6 (2018)
14. Spector, T.D., MacGregor, A.J.: Risk factors for osteoarthritis: genetics.
Osteoarthr. Cartil. 12, 39–44 (2004)
15. Srinivasu, P.N., et al.: Classification of skin disease using deep learning neural
networks with MobileNet V2 and LSTM. Sensors 21(8), 2852 (2021)
16. Tan, M.: Efficientnet: rethinking model scaling for convolutional neural networks.
arXiv preprint arXiv:1905.11946 (2019)
17. Trinh, G.M., et al.: Detection of lumbar spondylolisthesis from Xray images using
deep learning network. J. Clin. Med. 11(18), 5450 (2022)
18. Trinh, G.M., et al.: LumbarNet: A Deep Learning Network for the Automated
Detection of Lumbar Spondylolisthesis From X-Ray Images (2022)
19. Zhang, B., et al.: Deep learning of lumbar spine X-ray for osteopenia and osteoporo-
sis screening: a multicenter retrospective cohort study. Bone 140, 115561 (2020)
20. Zhang, P., Yang, L., Li, D.: EfficientNet-B4-Ranger: a novel method for green-
house cucumber disease recognition under natural complex environment. Comput.
Electron. Agric. 176, 105652 (2020)
Forecasting Traffic Flow Under
Uncertainty: A Case Study in Da Nang
Doan Phuoc Mien1(B) , Tran The Vu2(B) , and Ngo Van Sy3
1
Tra Vinh University, Tra Vinh City, Viet Nam
[email protected]
2
VN-UK Institute for Research and Executive Education,
The University of Da Nang, Da Nang, Viet Nam
[email protected]
3
Vietnam Research Institute of Electronics, Informatics and Automation Da Nang,
Da Nang, Viet Nam
1 Introduction
2 Literature Review
2.1 Related Research on Traffic Flow Prediction
The study [12] introduced Graph Neural Networks (GNNs), a type of neural net-
work based on graphs, utilized in structured data processing tasks like graphs or
networks. GNNs can handle structured and heterogeneous data sequences, aiding
in the prediction of traffic flow based on traffic network information. In [7], the
development of a deep learning model for predicting traffic flow was discussed. The
authors developed a deep learning model capable of accurately predicting non-
linear, space-time dependent traffic flows. However, this method was only tested
on two specific events and requires further experimentation and verification across
a broader range of data and traffic scenarios. The research [13] presented the use of
IoT and intelligent algorithms for collecting data from multiple sources and pro-
cessing this information to enhance traffic flow performance. The authors focused
on evaluating and comparing intelligent techniques used in traffic flow prediction
to understand the strengths and limitations of each method. However, the lack of
detailed presentation regarding the quantity of data could impact the objectivity
of the comparison results. In the study [8, 9, 14], the authors employed multiscale
temporal smoothing to address data loss and fill missing time intervals in traffic
flow data within an LSTM network. Although the results achieved high accuracy,
further research and testing on diverse datasets are needed to ensure the method’s
generalizability and reliability.
Forecasting Traffic Flow Under Uncertainty: A Case Study in Da Nang 345
3 Methodology
3.1 Problem
(1) represents the simplified problem and its components when the conditions of
the road network have been predetermined.
x̂
. t+1 = f (xt , pt , xt−1 , pt−1 , xt−2 , pt−2 , ..., xt−n , pt−n ) (1)
In the context, “x” represents the current traffic state at time t, “p” refers to the
parameters affecting the traffic state, “n” is the sample size, “.x̂” is the predicted
traffic state, and “f(x)” denotes the model applied to the historical data of traffic
state and its parameters to make predictions. Each component is now detailed
further before exploring different prediction models.
ARIMA Model. The ARIMA model is a popular method in time series anal-
ysis for predicting future data based on past information. This model integrates
three main components:
where:
– .Xt : is the time series data at time t.
– .p: the number of autoregressive parameters.
– .d: the degree of differencing required to make the time series stationary.
– .q: the number of moving average parameters.
.L: the lag operator .L Xt = Xt−i .
i
–
– .φi : are the coefficients of the AR part.
– .θi : are the coefficients of the MA part.
– .εt : is the error (residual) at time t.
The ARIMA model remains a fundamental tool in time series analysis, widely
used in forecasting economics, finance, weather, and many other application
fields. It provides a systematic method to consider both trend and cyclicality in
historical data, allowing users to generate well-founded and informed predictions
[10].
Additionally, other statistical models such as Holt-Winters and Seasonal
ARIMA (SARIMA) can also be utilized for traffic flow prediction. These models
consider seasonality in the data, which is particularly useful for predicting traffic
patterns that exhibit regular seasonal variations.
cell state. It takes .ht−1 as input and outputs a number in [0,1] for each number in
cell state .Ct−1 . An output of 1 means the information is retained (remembered),
and 0 means it is discarded (forgotten).
Time series forecasting models are often represented as the sum of linear and
nonlinear components as shown in Eq. 8.
y = Lt + At
. t (8)
where .At represents the linear component in the time series while .Lt repre-
sents the nonlinear component. In the hybrid model, .At is predicted using the
ARIMA model, then .Lt is predicted using the LSTM model. The error values
are calculated according to formulas 9 and 10.
ARIts = 2 − LSts
. (12)
Forecasting Traffic Flow Under Uncertainty: A Case Study in Da Nang 349
The author utilized web and mobile app for collecting and storing online data.
From the application, we can track both historical and current data of camera
nodes installed in the surveillance camera system at 0511.vn. The data storage
system, collected from 2017 to May 2023, is stored for each monitored roadway,
across three time collection frames such as 5–9 h, 9–12 h, 13–17 h.
For predicting traffic flow, the author utilized 190,000 input images, dividing
the data into an 80-20 split (80% for training, equating to 152,000 images, and
20% for testing, equating to 38,000 images). The data in both the training and
testing sets represent all classes, which is particularly important for addressing
class imbalance. To ensure this, the author employed techniques like upsampling
and downsampling during the automatic data splitting process.
4.2 Evaluation
To assess the effectiveness of the proposed method, the author conducted exper-
iments on real data from the website 0511.vn, at a traffic junction in Da Nang
city, Vietnam. The data included information on traffic flow, congestion status,
details about special events, and other traffic-related information.
The model was compared with other currently popular models such as SAEs,
Random Forest, CNNs, LSTM based on criteria such as RMSE (Root Mean
Squared Error of the model), Accuracy (Ratio of correct predictions out of the
total samples), Precision (Ratio of true positive predictions to the total posi-
tive predictions (true positives + false positives)), Recall (Ratio of true positive
predictions to the total actual positives (true positives + false negatives)), and
F1-score (The harmonic mean of precision and recall). The results are shown in
Table 1.
From the table above, it can be seen that the HAIVAN-ALSTM model
achieves better or equivalent results compared to other models on most eval-
uation criteria such as RMSE (1), accuracy (2), precision (3), recall (4), and
F1-score (5).
350 D. P. Mien et al.
The respective MAEs of the HAIVAN-ALSTM, SAEs, and GRU models are
(7.224, 7.577, 7.376). In this case, HAIVAN-ALSTM has the lowest MAE, indi-
cating it has the lowest average absolute error and therefore is the most accurate
model according to this metric.
Meanwhile, the MSE values for the HAIVAN-ALSTM, SAEs, and GRU mod-
els are (100.947, 107.234, 103.015) with HAIVAN-ALSTM having the lowest
MSE, indicating that HAIVAN-ALSTM is less affected by large errors. The
RMSE values for the HAIVAN-ALSTM, SAEs, and GRU models are (10.047,
10.355, 10.150), showing that HAIVAN-ALSTM has the lowest RMSE, indicat-
ing that HAIVAN-ALSTM has the smallest average error when considering the
distribution of errors.
Therefore, based on these metrics, HAIVAN-ALSTM is currently the highest-
performing model in forecasting, with the lowest error rates in all three mea-
surements. GRU is second, while SAEs perform the least effectively. However,
selecting the appropriate model should not only be based on these metrics but
also consider other factors such as model complexity, training time and resources
required, and the specific characteristics of the problem.
The results presented in Fig. 3(a) demonstrate that, following a 15-min inter-
val, the observed roadway segment exhibits low traffic density (free-flowing). At
Fig. 3(c), the traffic density is high due to red light stops. In Fig. 3(b), the pre-
dicted traffic density at 5:15 AM is expected to be at a medium level.
352 D. P. Mien et al.
5 Conclusion
References
1. De Jong, G., Daly, A., Pieters, M., Miller, S., Plasmeijer, R., Hofman, F.: Uncer-
tainty in traffic forecasts: literature review and new results for The Netherlands.
Transportation 34(4), 375–395 (2007)
2. Ketu, S., Mishra, P.K.: A hybrid deep learning model for covid-19 prediction and
current status of clinical trials worldwide. Comput. Mater. Continua 66(2) (2021)
3. Laña, I., Del Ser, J., et al.: Measuring the confidence of traffic forecasting models:
techniques, experimental comparison and guidelines towards their actionability.
arXiv preprint arXiv:2210.16049 (2022)
4. Maas, T., Bloem, P.: Uncertainty intervals for graph-based spatio-temporal traffic
prediction. arXiv preprint arXiv:2012.05207 (2020)
Forecasting Traffic Flow Under Uncertainty: A Case Study in Da Nang 353
1 Introduction
The usual setting of the OP ([12]) involves each location being assigned a con-
stant score and visited at most once. The OP tour starts at a given departure
and visits a subset of locations during a time limitation. The objective of the
OP is to maximize the total score of the selected locations. However, it is not
easy to measure the scores as predefined values in many real-world applications
because of uncertain dependent factors. As such, the score can be estimated via a
choice model, which is more pratical. The variant proposed in this paper replaces
the scores by a result of a Maximum Capture Problem (MCP). The MCP is a
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 354–368, 2025.
https://doi.org/10.1007/978-981-96-4282-3_29
Orienteering Problem with Random Utility Maximization 355
Let us denote .[m] = {1, 2, . . . , m} as the set of locations that can be used to
locate new facilities and .N as the set of customer zones/types. Let .qn be the
number of customers at zone .n ∈ N . Assume that .C represents the set of the
competitor’s facilities. Letting .Tmax be the time budget for the OP tour and .tij
be amount of time to travel from location .i to .j. In our research, two tasks are
handled at the same time: selecting a subset of locations .S ⊂ [m], which satisfies
the customer demand .(qn ) as much as possible, to establish new facilities; and
creating a tour which starts at a .depot (0), visits several opened facilities within
.Tmax and returns the .depot.
Figure 1 visualizes the basic idea of our problem. In the Fig. 1(a), there are
10 customer zones .(|N | = 10) shown in yellow. Each zone is depicted as a circle,
with the size reflecting the number of customers .(qn ) (larger circles imply more
356 H. G. Pham et al.
customers). The black dots present the candidate locations .(m = 17) to open
new facilities. The sets of customers in different zones have different utilities
associated with each location. For example, the utilities of customers in zone A
for nearer candidate locations are higher because they prefer a facility in their
neighborhood, while ones in zone B are opposed to open a new facility near
their houses, thus, their utilities are higher for further locations. Assume that
Fig. 1(b) shows an optimal solution wherein 6 facilities .(|S| = 6), which maximize
customers’ utilities, are opened. There is a new facility in zone A, while zone
B does not contain any facility based on the customers’ utilities. Specially, a
tour, denoted by black arrows, is constructed to visit each of new facilities once.
The total length of this tour must be less than or equal a limitation set by the
decision-maker. This explains why only one facility can be opened in zone A.
part and can be modeled based on characteristics of the alternative and/or the
decision-maker, and .ni is the random term which is unknown to the analyst.
Under the MNL model, the choice probability that the facility at location .i is
chosen by an individual .n is
evni
Pn (i|S) =
.
vni
i∈S e
Let .y be decision variables such that .yij = 1 if there exists a visit from
locations .i to location .j, and 0 otherwise, for all .i, j ∈ [m] ∪ 0. To explore the
potential of CP on the problem, we introduce a formulation using the .“Circuit”
statement as follows:
qn i∈[m] Vni xi
. max f (x) = (CP)
x,y Unc + i∈[m] Vni xi
n∈N
s.t.
. Circuit(yij ; i, j ∈ [m] ∪ 0) (1)
y0i = 1 (2)
i∈[m]
The .“Circuit” statement adds a circuit constraint from a sparse list of arcs
that encode the graph. A circuit is a unique Hamiltonian path in a subgraph
of the total graph. In case a node .i is not in the path, then there must be a
loop arc .i → i associated with a true literal. Otherwise this constraint will fail.
Constraint (1) assures that the .y variables take values in such a way to form
a valid tour, eventually with self-loops for the variables associated to locations
unvisited. Constraint (2) ensures that the tour must include the .depot (0) and
constraints (3) guarantee that if a location contains a self-loop, it cannot be
chosen to open a new facility. Constraint (4) presents the time budget restricting
the OP tour.
In the traditional OP, a path starting from a .depot, ending at a given desti-
nation, which can be different from the .depot, and visiting a subset of location
in .[m]. By building upon previous research ( [13, 29]), it is possible to represent
the routing constraint by a set of linear constraints formulated according to the
well-known Miller-Tucker-Zemlin formulation (MTZ) (see [23]). To adapt the
requirement that the tour starts at .depot and also ends at .depot, we extend the
set of locations .[m] as .M = {0, 1, ..., m, m + 1} wherein locations 0 and .m + 1
are the .depots. A mixed-integer nonlinear program (MINLP) for the problem is
introduced as a baseline to compare to the (CP):
qn i∈[m] Vni xi
. max f (x) = (MINLP)
x,y,p Unc + i∈[m] Vni xi
n∈N
Orienteering Problem with Random Utility Maximization 359
s.t.
. (4)
m+1
m
y0i = yi,m+1 = 1 (5)
i=1 i=0
m
m+1
yji = yik = xi ∀i ∈ [m] (6)
j=0 k=1
In the above formulation, constraints (5) and (6) establish the ongoing and
outgoing flow at each location. Constraints (5) guarantee that there exists exact
one arc starting from depot and there is also exact one arc ending at depot.
Constraints (6) ensure that if location .i is chosen .(xi = 1), there is exact one
ongoing arc and one outgoing arc at .i. Constraints (7) and (8) stand for the
MTZ subtour elimination constraints.
3 Solution Methods
3.1 The Mixed-Integer Linear Programming Reformulation
The (MINLP) cannot be directly solved by any off-the-shelf solvers because its
objective function is not concave either convex. However, it can be linearized,
then become a mixed-integer linear programming (MILP) formulation, which
can be handled by commercial solver such as CPLEX or Gurobi.
Let us first denote that: .sn = U c + 1 Vni xi , .zni = xi sn for all .i ∈ [m], n ∈
n i∈[m]
N and then implement the following inequalities to represent the linear relation
between .zni , .xi and .sn :
z
. ni ≤ sU
n xi ∀i ∈ [m], n ∈ N (9)
zni ≥ sL
n xi ∀i ∈ [m], n ∈ N (10)
zni ≤ sn + sL
n (xi − 1) ∀i ∈ [m], n ∈ N (11)
zni ≥ sn + sU
n (xi − 1) ∀i ∈ [m], n ∈ N (12)
Vni zni + Unc sn = 1 ∀n ∈ N (13)
i∈[m]
where .sU c L
n = 1/Un and .sn = 1/(Un +
c
i∈[m] Vni ) are upper and lower bounds of
.sn .
Constraints (9), (10), (11), (12) restrict values of variables .zni , while con-
straints (13) are generated by combining .zni = xi sn and .sn = U c + 1 Vni xi .
n i∈[m]
360 H. G. Pham et al.
s.t. (4), (5), (6), (7), (8), (9), (10), (11), (12), (13)
|N |
s ∈ R+ , z ∈ R|N |×m , x ∈ {0, 1}m , y ∈ {0, 1}(m+2)×(m+2) , p ∈ Nm+2
q Uc
Ψn (x) = qn −
. n n , ∀n ∈ N
Unc + i∈[m] Vni xi
c
q Un
It can be seen that each component . U c + n Vni xi is convex in .x. Therefore,
n i∈[m]
Ψn (x) is concave in .x for any .n ∈ N . Then the following inequality holds for any
.
solution .x ∈ {0, 1}:
Besides the concavity, it is known that the objective function .f (S) is submod-
ular. Then, .Ψn (S) is monotonically increasing and submodular in .S. According
to the properties of submodular functions, the following inequalities hold for any
.n ∈ N , S ⊂ [m], k ∈ [m]:
.Ψn (S + k) ≥ Ψn (S)
. Ψn (S + k) − Ψn (S ) ≥ Ψn (S + k) − Ψn (S)
The functions .ψnk (S) are often referred to as marginal gains, i.e., gains from
adding an item .k to the set .S. The submodular properties imply that .ψnk (S) ≥ 0
for any .S ⊂ [m] and .ψnk (S) ≥ ψnk (S ), ∀S ⊂ S ⊂ [m]. These properties offer
the following inequalities that hold for any subset .S, S ⊂ [m] (see [24]):
.Ψn (S) ≤ ψnk (S) − ψnk ([m] − k) + Ψn (S)
k∈([m]\S)∩S k∈S\S
Ψn (S) ≤ ψnk (∅) − ψlk (S − k) + Ψn (S)
k∈([m]\S∩S) k∈S\S
To transform the above inequalities into valid linear cuts, the binary represen-
tation of this function can be presented as:
where .ek is a vector of size .m with zero elements except the .k-th element which
takes a value of 1.
The linear cuts can be deduced from the submodular inequalities as follows:
.Ψn (x) ≤ ψnk (x)(1 − xk )xk − ψnk (e − ek )xk (1 − xk ) + Ψn (x) (16)
k∈[m] k∈[m]
Ψn (x) ≤ ψnk (0)(1 − xk )xk − ψnk (x − ek )xk (1 − xk ) + Ψn (x) (17)
k∈[m] k∈[m]
Some prior studies have demonstrated that a cutting plane method incor-
porating outer-approximation cuts can yield an optimal solution after a finite
number of iterations (see [4, 6]). The seminal work of [24] also demonstrates
that the submodular maximization problem can be reformulated equivalently
as a MILP whose constraints are submodular cuts generated at every point
within the binary domain. These allow us to sequentially generate the outer-
approximation and submodular cuts and add them to a master problem that is
solved iteratively until an optimal solution is obtained.
For any feasible solution .x, the first master formulation created by replacing
nonlinear
objective function .f (x) of the (CP) with a linear function: .g(θ) =
n∈N n where each .θn is a non-negative variable and its value equals to .Ψn (x),
θ ,
can be written as follows:
. max g(θ) = θn (Master – CP)
x,y,θ
n∈N
solution found. This process ends when the value of the linear approximation
g(θ) calculated from .θ is extremely close to the value of the MNL function .f (x)
.
provided by .x.
4 Experiment Results
Since there is no benchmark instances available in the literature for the proposed
problem, we generate a new set of benchmark instances to test the performance
of the methods. We reuse the OP instances provided by [1] (available at: https://
or-brescia.unibs.it/instances) from which we take the coordinates of locations,
the distance matrix between any two vertices, and the value of .Tmax . To create
customer utilities, we use instances from three datasets ORlib, HM14, and NYC,
which are widely used in prior MCP studies. The set ORlib contains 15 instances
with the number of zones varying in {50, 100, 200, 400, 800} and the number
of locations in {25, 50, 100}, while HM14 includes 3 instances with 1000 zones
and 100 locations. The last set NYC has a single instance coming from a large-
scale Park-and-Ride location problem in New York City, with 82,341 customer
zones and 58 locations. The number of locations in the MCP data is modified to
match the number of vertices in the OP data. For example, 14 locations and their
corresponding utilities are randomly sampled from the ORlib instance of 25, 50,
or 100 locations, and then they are combined with the coordinates of vertices in
the OP instance burma14 containing 14 vertices to create a new instance.
A 64-bit Windows machine, with AMD Ryzen 7 3700X 8-Core Proces-
sor, 3.60 GHz, and 16 GB of RAM, is used to run the experiments. All mod-
els are implemented in C++. We link to CPLEX to formulate the (MILP)
model and design the cutting plane algorithm for the .(Master − MINLP) for-
mulation. The CP-SAT solver of Google OR-Tools (https://developers.google.
com/optimization/) is used for implementing the (Master – CP) embedded in
the cutting plane. Note that CP solvers significantly outperform MILP solvers
when it comes to leveraging multi-threaded computation. This advantage arises
from the distinct solving routines employed by CP solvers. Moreover, modern
home computers and laptops now feature multi-core architectures, making paral-
lel computation accessible. Therefore, we evaluate how multi-threading impacts
solver performance while tackling the OP under MNL model. There are 8 threads
used with a limited run time of 1 h for each instance.
Figures 2, 3 and 4 visualize the comparison between the (MILP) formu-
lation and others which are embedded into cutting plane framework. Over-
all, the cutting plane outperforms the (MILP) in both terms of number of
optimal solutions found and average runtime. In which, the master problem
(Master – CP) (yellow line) provides the most optimal solutions, following by
one from .(Master − MINLP) (red line) on all datasets. The average runtime of
each OP instance is calculated from runtime of all related MCP instances provid-
ing optimal and non-optimal solutions. It can be seen that the average runtime
of the cutting plane algorithm with the (Master – CP) is extremely lower than
the others. While the average run time of .(Master − MINLP) or (MILP) fluctuates
364 H. G. Pham et al.
Fig. 2. The results of the (MILP) model and the cutting plane algorithm on the ORlib
dataset (Color figure online)
Fig. 3. The results of the (MILP) model and the cutting plane algorithm on the HM14
dataset (Color figure online)
Fig. 4. The results of the (MILP) model and the cutting plane algorithm on the NYC
dataset (Color figure online)
wildly, the (Master – CP)’s one slowly increases until handling the OP instances
with over 70 vertices in Figs. 2 and 3 and over 29 vertices in Fig. 4. The run-
time of .(Master − MINLP) and (MILP) are competitive on small OP instances
with utilities taken from ORlib and NYC datasets, while the .(Master − MINLP)
Orienteering Problem with Random Utility Maximization 365
outperforms the (MILP) on the OP dataset containing more than 42 vertices and
all OP instances with utilities from NYC dataset.
The detailed results of the cutting plane framework are shown in Tables 1, 2
and 3. In each table, the two first columns present the OP instances’ name and
number of instances created from combining an OP instance, ranging from 14
vertices to 100 vertices, with an MCP instance. The third and fourth column
show number of optimal solutions found by solving the cutting plane with the
master problem of the .(Master − MINLP) and the average runtime calculated by
taking into account only those instances solved to optimality. Similarly, two last
columns report the results of the (Master – CP). Best values (the largest number
of optimal solutions, or the lowest average computing time) are shown in bold.
Table 1. The results of cutting plane Table 2. The results of cutting plane
algorithm on the ORlib dataset algorithm on the HM14 dataset
ORlib .(Master − MINLP) (Master – CP) HM14 .(Master − MINLP) (Master – CP)
OP #Instances #Optimal Time(s) #Optimal Time(s) OP #Instances #Optimal Time(s) #Optimal Time(s)
burma14 45 45 1.44 45 0.88 burma14 9 9 0.93 9 3.76
ulysses16 45 44 101.55 45 1.44 ulysses16 9 9 18.38 9 11.75
gr17 45 45 13.72 45 2.03 gr17 9 9 5.62 9 7.64
gr21 45 45 24.92 45 4.09 gr21 9 9 22.33 9 21.21
ulysses22 45 34 1351.33 45 5.11 ulysses22 9 5 2297.35 9 37.89
gr24 45 44 684.14 45 6.90 gr24 9 9 765.40 9 56.66
fri26 30 9 3087.12 30 10.04 fri26 9 2 3302.69 9 41.76
bays29 30 30 121.90 30 9.54 bays29 9 9 63.56 9 47.22
dantzig42 30 0 - 30 39.69 dantzig42 9 0 - 9 237.01
swiss42 30 29 485.80 30 28.81 swiss42 9 9 312.47 9 132.42
att48 30 3 3431.18 30 54.32 att48 9 4 - 9 202.55
gr48 30 14 2638.57 30 57.78 gr48 9 4 2542.17 9 208.24
hk48 30 3 3418.90 30 88.16 hk48 9 1 3457.62 9 207.44
eil51 15 13 1249.77 15 66.60 eil51 9 9 5.00 9 180.08
berlin52 15 14 1048.20 15 64.65 berlin52 9 8 1030.15 9 242.03
brazil58 15 1 3479.18 15 184.07 brazil58 9 3 2555.73 9 446.76
st70 15 0 - 15 458.61 st70 9 0 - 9 1088.84
eil76 15 2 3293.42 15 330.69 eil76 9 1 3233.93 9 435.35
pr76 15 0 - 12 1155.54 pr76 9 0 - 6 1736.46
gr96 15 0 - 14 1075.44 gr96 9 0 - 9 1490.20
rat99 15 0 - 10 2225.71 rat99 9 0 - 6 2519.48
kroA 9 0 - 9 739.41
kroB 9 0 - 6 2394.47
kroC 9 0 - 9 1311.41
From the tables, we see that the (Master – CP) outperforms the
.(Master − MINLP) in both terms of number of optimal solutions and aver-
age runtime. The MTZ subtour elimination constraint shows their disadvan-
tage when tackling the instances with medium and large number of locations,
while the “.Circuit” statement can effectively handle them. For example, the
.(Master − MINLP) only solve to optimality all instances up to 21 locations with
utility taken from ORlib dataset, while the (Master – CP) can handle ones up
to 76 locations. The same phenomenon occurs when solving instances related
to HM14 dataset. In case of NYC dataset, which contains 82341 customer zones
and is much larger than number of zones in ORlib and HM14, the (Master – CP)
handles successfully instances up to 48 locations, while the .(Master − MINLP)
can solve small instances with 14, 16, 17, 21, 22 and 29 locations.
366 H. G. Pham et al.
5 Conclusion
In this paper, we have considered a new variant of the OP, where each location
is controlled by customer behavior modeled by a MNL discrete choice model.
To address this challenging problem, we have explored two types of cuts - outer-
approximation and submodular, to solve the objective function and constraint
programming to tackle routing part. Experiments conducted on instances of
various sizes demonstrate the superiority of our cutting plane approaches, which
can solve to optimality the instances with a very large number of customer zones.
Future works will be dedicated to incorporating more complex and practical
settings for the routing such as time-window, mandatory visits, multiple vehicles
(Team OP) or exploring more advanced choice models, such as nested models or
cross-nested models.
References
1. Angelelli, E., Archetti, C., Filippi, C., Vindigni, M.: The probabilistic orienteering
problem. Comput. Oper. Res. 81, 269–281 (2017)
2. Archetti, C., Carrabs, F., Cerulli, R.: The set orienteering problem. Eur. J. Oper.
Res. 267(1), 264–272 (2018)
3. Benati, S., Hansen, P.: The maximum capture problem with random utilities: Prob-
lem formulation and algorithms. Eur. J. Oper. Res. 143(3), 518–530 (2002)
4. Bonami, P., et al.: An algorithmic framework for convex mixed integer nonlinear
programs. Discret. Optim. 5(2), 186–204 (2008)
5. Dolinskaya, I., Shi, Z.E., Smilowitz, K.: Adaptive orienteering problem with
stochastic travel times. Transport. Res. Part E: Logist. Transport. Rev. 109, 1–19
(2018)
6. Duran, M.A., Grossmann, I.E.: An outer-approximation algorithm for a class of
mixed-integer nonlinear programs. Math. Programm. 36, 307–339 (10 1986)
7. Faigl, J., Pěnička, R.: On close enough orienteering problem with dubins vehicle.
In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
pp. 5646–5652 (2017)
8. Feillet, D., Dejax, P., Gendreau, M.: Traveling salesman problems with profits.
Transp. Sci. 39(2), 188–205 (2005)
9. Freeman, N.K., Keskin, B.B.: İbrahim Çapar: Attractive orienteering problem with
proximity and timing interactions. Eur. J. Oper. Res. 266(1), 354–370 (2018)
10. Freire, A., Moreno, E., Yushimito, W.: A branch-and-bound algorithm for the
maximum capture problem with random utilities. Eur. J. Oper. Res. 252(1), 204–
212 (2016)
11. Gedik, R., Kirac, E., Bennet Milburn, A., Rainwater, C.: A constraint programming
approach for the team orienteering problem with time windows. Comput. Indust.
Eng. 107, 178–195 (2017)
12. Golden, B.L., Levy, L., Vohra, R.: The orienteering problem. Naval Res. Logist.
(NRL) 34(3), 307–318 (1987)
13. Gunawan, A., Lau, H.C., Vansteenwegen, P.: Orienteering problem: a survey of
recent variants, solution approaches and applications. Eur. J. Oper. Res. 255(2),
315–332 (2016)
14. Haase, K.: Discrete location planning (2009). http://hdl.handle.net/2123/19420
15. Haase, K., Müller, S.: A comparison of linear reformulations for multinomial logit
choice probabilities in facility location models. Eur. J. Oper. Res. 232(3), 689–691
(2014)
16. Hu, W., Fathi, M., Pardalos, P.M.: A multi-objective evolutionary algorithm based
on decomposition and constraint programming for the multi-objective team orien-
teering problem with time windows. Appl. Soft Comput. 73, 383–393 (2018)
17. Jorgensen, S., Chen, R.H., Milam, M.B., Pavone, M.: The matroid team surviving
orienteers problem: Constrained routing of heterogeneous teams with risky traver-
sal. In: IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), pp. 5622–5629 (2017)
18. Jorgensen, S., Chen, R.H., Milam, M.B., Pavone, M.: The team surviving orien-
teers problem: routing teams of robots in uncertain environments with survival
constraints. Auton. Robots 42(4), 927–952 (2018)
19. Kirac, E., Gedik, R., Oztanriseven, F.: Solving the team orienteering problem with
time windows and mandatory visits using a constraint programming approach. Int.
J. Oper. Res. 46, 20–42 (2023)
368 H. G. Pham et al.
20. Ljubić, I., Moreno, E.: Outer approximation and submodular cuts for maximum
capture facility location problems with random utilities. Eur. J. Oper. Res. 266(1),
46–56 (2018)
21. Mai, T., Lodi, A.: A multicut outer-approximation approach for competitive facility
location under random utilities. Eur. J. Oper. Res. 284(3), 874–881 (2020)
22. Meyer, F., Glock, K.: Kinematic orienteering problem with time-optimal trajecto-
ries for multirotor UAVs. IEEE Robot. Autom. Lett. 7(4), 11402–11409 (2022)
23. Miller, C.E., Tucker, A.W., Zemlin, R.A.: Integer programming formulation of
traveling salesman problems. J. ACM 7(4), 326–329 (1960)
24. Nemhauser, G.L., Wolsey, L.A.: Maximizing submodular set functions: formula-
tions and analysis of algorithms. In: North-Holland Mathematics Studies, vol. 59,
pp. 279–301. Elsevier (1981)
25. Pěnička, R., Faigl, J., Saska, M.: Physical orienteering problem for unmanned
aerial vehicle data collection planning in environments with obstacles. IEEE Robot.
Autom. Lett. 4(3), 3005–3012 (2019)
26. Pěnička, R., Faigl, J., Váňa, P., Saska, M.: Dubins orienteering problem. IEEE
Robot. Autom. Lett. 2(2), 1210–1217 (2017)
27. Train, K.: Discrete Choice Methods with Simulation. Cambridge University Press
(2003)
28. Tsiogkas, N., Lane, D.M.: Dcop: Dubins correlated orienteering problem optimizing
sensing missions of a nonholonomic vehicle under budget constraints. IEEE Robot.
Autom. Lett. 3(4), 2926–2933 (2018)
29. Vansteenwegen, P., Souffriau, W., Oudheusden, D.V.: The orienteering problem: a
survey. Eur. J. Oper. Res. 209(1), 1–10 (2011)
30. Zhang, Y., Berman, O., Verter, V.: The impact of client choice on preventive
healthcare facility network design. OR Spectrum 34, 349-370 (04 2012)
Operations Research
Cost Optimization in Competitive Facility
Location Under General Demand Model
1 Introduction
Competitive facility location (CFL) has been a key decision-making problem in
modern transportation and logistics, typically involving the selection of opti-
mal locations and the allocation of financial resources for establishing new facil-
ities. The primary goal is to either maximize profit (e.g., by capturing customer
demand or increasing revenue) or minimize costs (e.g., operational or transporta-
tion expenses). In this study, we address a cost optimization in facility loca-
tion problem within a competitive market, focusing on models that utilize dis-
crete choice frameworks like the random utility maximization (RUM) approach
[7, 24, 31] to predict customer behavior. In this setting, customers are assumed
to choose facilities by maximizing their perceived utility, which depends on facil-
ity features (e.g., service quality or transport costs) and customer characteristics
(e.g., age, income, gender). The RUM models are widely used and effective for pre-
dicting human choices in transportation-related contexts [5, 28]. In the context of
facility cost optimization, prior research has primarily relied on the classical multi-
nomial logit (MNL) model [14], which is one of the most popular choice models in
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 371–385, 2025.
https://doi.org/10.1007/978-981-96-4282-3_30
372 B. L. Le et al.
demand modeling. However, the MNL model has a significant limitation known as
the Independence of Irrelevant Alternatives (IIA) property, which assumes that
the ratio of choice probabilities between two facilities remains unaffected by the
presence of other facilities. This assumption often fails in real-world scenarios,
reducing the model’s effectiveness in accurately capturing customer behavior. To
overcome this limitation, several studies have proposed more advanced models,
such as the nested logit (NL) [3, 4], Generalized Extreme Value (GEV) [26], and
mixed-logit models (MMNL) [29]. For example, the NL model groups locations
into distinct subsets (nests), relaxing the IIA property for choices within the same
nest, though it still applies to alternatives from different nests. In this paper,
we examine the cost optimization in facility location under the cross-nested logit
(CNL) model [33], one of the most flexible and general RUM models. The CNL
model extends the NL model by allowing facilities to belong to multiple overlap-
ping nests, thereby fully relaxing the IIA property. This flexibility enables the
CNL model to approximate any RUM model with high precision [8, 17]. To the
best of our knowledge, this is the first study to apply such a general model to cost
optimization in facility location problem.
The main challenge in facility cost optimization problem is allocating a fixed
budget across predetermined locations to maximize customer demand. Higher
investment in a facility increases its attractiveness and utility to customers. Open-
ing and operating costs directly impact a facility’s ability to provide services and
attract demand. Our goal is to optimize how the budget is allocated to each facil-
ity, assuming the potential locations have already been selected. We model cus-
tomer behavior as a function of costs, where a higher budget leads to better service
capacity and a greater likelihood of being chosen. The problem then can be for-
mulated as a nonlinear optimization problem with continuous variables. Previous
studies show that this problem is highly non-convex with multiple local optima
even with only one nest [13]. To address this, we use a piecewise linear approxi-
mation and then reformulate the approximation problem as a mixed-integer non-
linear convex program, solvable using an outer-approximation method. We also
apply a projected gradient descent (PGD) algorithm with adaptive step size for
the original problem to evaluate the performance of our approach.
Paper Outline: The paper is organized as follows. Section 2 provides a litera-
ture review. Section 3 discusses the problem formulation. Section 4 presents our
solution methods. Section 5 presents our experimental results, and finally, Sect. 6
concludes the paper.
Notation: Boldface characters represent matrices (or vectors), and .ai denotes
the .i-th element of vector .a if it is indexable. We use .[m], for any .m ∈ N, to
denote the set .{1, . . . , m}.
2 Literature Review
The RUM framework has been extensively studied since the 1980s and applied
in numerous fields [25, 27, 31]. In facility location cost optimization under RUM
Cost Optimization in Competitive Facility 373
models, most studies use the MNL model to estimate customer demand. For
example, [14] was the first to propose a cost optimization in CFL problem under
the MNL model, utilizing a CONIC and Mixed-Integer Linear Programming
(MILP) reformulation for a piecewise approximation problem. They also propose
an outer-approximation approach to solve the approximation problem.
Regarding the CNL customer choice model, the model has been primarily
introduced and explored within the transportation research community [6, 30,
32]. [8] provided the theoretical groundwork for the CNL model, demonstrating
its inclusion in the GEV family [26] and introducing a novel estimation method
using nonlinear programming, as opposed to earlier heuristic approaches. The
CNL model is highly flexible, capable of approximating any RUM model with
arbitrary precision [17], as well as the general ranking preference model [1, 21]. Its
versatility has led to successful applications in various transportation problems,
including mode choice [34], departure time choice [11], route choice and revenue
management [19, 22, 23], air travel management [12], and more recently, location
choice in international migration [2]. The CNL model consistently outperforms
other choice models, such as the MNL and NL, in demand modeling [2, 11].
To the best of our knowledge, this study is the first to apply the CNL model
to solve the cost optimization problem in CFL. Our work is closely related to
the work of [14], which addressed cost optimization in CFL using the MNL
model. Their approach guarantees near-optimal solutions through piecewise lin-
ear approximation, which can be easily reformulated into a MILP or CONIC
model after the approximation procedure. However, it still theoretically requires
solving problems of infinite size to achieve an exact solution, as the problem size
depends on the precision of the solution.
In this study, we examine cost optimization in the CFL problem, where a “new-
comer” company seeks to enter a market already dominated by an established
competitor. The company has already planned the locations for opening facili-
ties, and it aims to allocate its budget across its stores to attract customers away
from the existing supermarkets. The main goal is to capture a portion of the mar-
ket share by drawing customers to the new facilities, achieved by optimizing the
investment in each location to maximize expected customer demand.
First, we need to consider the attraction of facilities to customer demand. In
real-world scenarios, estimating customer demand is challenging and inherently
uncertain. To address this, we explore the CFL problem using discrete choice
models, as described in [31], to estimate and predict customer behavior. Among
the various demand modeling approaches, the RUM framework [31] is the most
commonly used to model discrete choice behavior. This method is based on
the theory that a decision-maker’s preference for an option is represented by a
random utility, leading customers to choose the option that offers the highest
utility. According to the RUM framework [16, 26], the probability of individual .t
choosing an option .i from a given choice set .S is determined by .P (uti ≥ utj , ∀j ∈
374 B. L. Le et al.
S), indicating that the individual will opt the alternative with the greatest utility.
The random utility for a given location .i is defined as .uti = vti + ti , where .vti
represents the deterministic component derived from observable factors, and .ti
represents random components that are unobservable.
In this work, we adopt the CNL model, which is renowned for its flexibility
and general applicability within RUM frameworks. To formulate the problem,
let .[m] = {1, 2, . . . , m} represent the set of all available locations that will be
invested, and let .[T ] = {1, . . . , T } represent customer types, which can be defined
by factors such as geography, age, income, or gender. Additionally, let .C denote
the set of existing facilities owned by competitors. Under the CNL model, these
locations can be assigned to different subsets (or “nests”). For each customer
type .t ∈ [T ], we assume that .[m] ∪ C can be assigned to .N nests, denoted as
t t t
.N1 , N2 , . . . , NN , based on shared attributes or characteristics. Importantly, these
nests are not necessarily disjoint, meaning each facility/location .i can belong to
t
multiple nests. We also use a non-negative quantity .αin to capture the level of
t
t
membership of location .i in nest .Nn , where . n∈[N ] αin = 1 for all .i ∈ [m] and
t t
.αin is equal to .0 if the facility .i is not in nest .Nn .
Let .vti be the deterministic utility for location .i ∈ [m] ∪ C and customer
type .t ∈ [T ], which can be modeled as a function of both customer and loca-
tion characteristics. The parameters of this utility function can be estimated
through the choice model inference [31]. Specifically, the deterministic utility for
customers from zone .t choosing location .i is defined as .vti = ati xi + bti for all
.t ∈ [T ], i ∈ [m], where .ati represents the sensitivity of customer in zone .t to
the cost invested in facility .i, .xi is investment in facility .i, and .bti accounts for
other factors affecting customer choice, such as distance or parking availability.
The parameter .ati plays a critical role in our model, as it captures the impact
of investment on customer attraction based on proximity. If a customer in zone
.t is far from facility .i, the investment in that facility will have a limited effect
on attracting those customers. On the other hand, if facility .i is close to zone .t,
increasing the budget for that facility will significantly boost its appeal. There-
fore, higher investment .xi generally increases the attractiveness of the facility .i,
and thus, .ati is expected to be positive.
The choice process in the CNL model is conceptualized as a two-stage process
where a customer first selects a nest and then chooses a location or facility within
that nest. Let .S ⊆ [m] represent the set of available locations, and the probability
that a customer selects a nest .Nnt , for any .n ∈ [N ] is given by
σtn
Wtn
.P (Nnt |S) = σtn ,
n ∈[N ] Wtn
t vti /σtn
where .Wtn = i∈N t ∩(S∪C) αin e is the total utility of alternatives in nest
n
t
.Nn , and .σtn is the dissimilarity parameter, typically assumed to vary within the
unit interval to maintain consistency with the RUM framework [8, 26].
Cost Optimization in Competitive Facility 375
In the second stage, the customer decides to select a facility .i ∈ S from the
chosen nest .Nnt with probability:
t vi /σn
αin e
.P (i|Nn ) = , ∀i ∈ S.
Wn
σtn
Wtn t vti /σtn
αin e
.P t (i|S) = P (Nnt |S)P (i|Nnt ) = σtn ×
n ∈[N ] W tn W tn
n∈[N ] n∈[N ]
n∈[N ]Wtn σtn −1 αin
t vti /σtn
e
= σtn , ∀i ∈ S.
n∈[N ] Wn
By applying the law of total expectation and replacing .vti with the utility func-
tion .vti = ati xi + bti , the expected captured market share given by the set of
available locations .S is expressed as
σtn −1
n∈[N ] Wtn ( i∈S αint (ati xi +bti )/σtn
e )
.F (x, S) = qt σtn ,
t∈[T ] n∈[N ] Wtn
where .qt is the total demand of customers of type .t, and .x = (x1 , x2 , . . . , xm ) ∈
Rm+ represents the investment cost for available locations. This problem is com-
monly referred to as the Maximum Capture Problem (MCP) [7]. By the assump-
tion that the list of new facilities to be opened has already been determined, it
is not a loss of generality to assume .S = [m]. Then, we can formulate the cost
optimization as the following nonlinear program:
⎧ ⎫
⎨ σtn −1 t (ati xi +bti )/σtn ⎬
n∈[N ] W tn ( α
i∈[m] in e )
. max F (x) = qt σtn
x ⎩ n∈[N ] Wtn ⎭
t∈[T ]
(CFL-CNL)
subject to xi ≤ B,
i∈[m]
xi ∈ [Li , Ui ], ∀i ∈ [m].
t (ati xi +bti )/σtn
where .Wtn = i∈Nnt ∩([m]∪C) αin e , ∀t ∈ [T ], n ∈ [N ]. Here, .B is
the maximum budget to spend on the new facilities, .Li and .Ui are the lower
and upper bounds for the investment in facility .i ∈ [m]. It is required that
. i∈[m] Li ≤ B to ensure feasibility.
376 B. L. Le et al.
c
t (ati xi +bti )/σtn
Let .Utn = i∈C∩Nnt αin e , we then can rewrite the objective
function of (CFL-CNL) as
σtn −1
n∈[N ] Wtn ( i∈[m] αin t (ati xi +bti )/σtn
e )
.F (x) = qt σtn
t∈[T ] n∈[N ] Wtn
σtn −1 t (ati xi +bti )/σtn
n∈[N ] Wtn ( i∈N t ∩[m] αin e )
= qt n
σtn
t∈[T ] n∈[N ] Wtn
σtn −1
n∈[N ] Wtn (Wtn − Utn c
)
= qt σtn
t∈[T ] n∈[N ] Wtn
σtn −1 c
n∈[N ] Wtn Utn
= qt − qt σtn ,
t∈[T ] t∈[T ] n∈[N ] Wtn
4 Solution Methods
For ease of notation, let .ftni (x) = e(ati x+bti )/σtn , where .t ∈ [T ], n ∈ [N ] such
that .Nnt is the nest to which facility .i belongs. We can see that .ftni (x) is convex
and always takes positive values. Each decision variable .xi can vary within the
interval .[Li , Ui ]. Therefore, similar to the piecewise-linear approximation method
proposed by [14], we can divide each interval .[Li , Ui ] into .K successive closed
sub-intervals of equal length .Δi = (Ui − Li )/K, represented as .[cik , cik+1 ] for
i
.k ∈ [K], where .∀k ∈ [K + 1], ck = Li + (k − 1)Δi . Then, we can represent an
If .k ∗ ∈ [K] such that .cik∗ ≤ xi < cik∗ +1 , then .wik = 1 ∀k < k ∗ , and .wik = 0 ∀k ≥
k ∗ . We then approximate .ftni (xi ) as follows
tn
.ftni (xi ) ≈ ftni (xi ) = ftni (Li ) + Δi γik wik ,
k∈[K]
tn
where .γik = (ftni (cik+1 )−ftni (cik ))/Δi are the slopes of .ftni (x) in the sub-interval
i i
.[ck , ck+1 ], ∀k ∈ [K]. We can then approximate .Wtn as
⎛ ⎞
.
c
Wtn ≈ Wtn = Utn + ⎝ftni (Li ) + Δi tn
γik wik ⎠
i∈Nnt ∩[m] k∈[K]
c
tn
Let .αtn = Utn + i∈Nnt ∩[m] ftni (Li ) and .βtnik = Δi γik for all .t ∈ I, n ∈
[N ], i ∈ S. Then, the approximation problem of cost optimization in CFL prob-
lem can be formulated as
σtn −1 c
n∈[N ] Wtn Utn
. max F (w) = qt − qt · (MCP-Approx)
w σtn
t∈I t∈I n∈[N ] Wtn
s.t. Wtn = αtn + βtnik wik , ∀t ∈ [T ], n ∈ [N ].
i∈Nnt k∈[K]
(Li + Δi wik ) ≤ B,
i∈[m] k∈[K]
4.2 Convexification
It can be further shown that the above nonlinear non-convex program can refor-
mulated as the following optimization problem:
. max θ (MCP-Reform)
w,y,z,θ,W
. s.t. θ≤ qt − qt exp(ytn − zt ), (4)
t∈[T ] t∈[T ] n∈[N ]
c
ytn ≥ (σtn − 1) log(Wtn ) + log(Utn ), ∀t ∈ [T ], n ∈ [N ] (5)
⎛ ⎞
zt ≤ log ⎝ Wtnσtn ⎠
, ∀t ∈ [T ] (6)
n∈[N ]
Wtn = αtn + βtnik wik , ∀t ∈ [T ], n ∈ [N ]
i∈Nnt ∩[m] k∈[K]
(Li + Δi wik ) ≤ B,
i∈[m] k∈[K]
5 Numerical Experiments
In this section, we present experimental results to evaluate the performance of
our proposed methods for solving the cost optimization in CFL problem under
the CNL model. We first present the settings of the experiment, followed by a
comparison of the results between the approximation approach and the direct
application of PGD in non-convex settings to evaluate the performance of the
proposed algorithm.
formly generated in range .[0, 5] and .[5, 10], respectively. The maximum bud-
get is considered between the total of lower bound and upper bound of the
investment cost. We define a new parameter .δ for the maximum budget .B such
that .B = i∈[m] Li + ( i∈[m] (Ui − Li )) × δ. The parameter .δ is chosen in
.{0.1, 0.3, 0.5}. Consequently, each problem instance has a total of .30 instances.
that each nest contains at least two distinct locations. The allocation parameter
t
αin
. for .i ∈ [m] ∪ C is randomly generated from a uniform distribution over the
t
interval .[0, 1]. These parameters
are normalized such that .αin = 0 if location .i
t t
is not a member of nest .Nn and . n∈[N ] αin = 1. Different customer types have
distinct sets of available locations within each nest.
In this experiment, we will compare our approaches, specifically CP, in solving
the mixed-integer convex program in (MCP-Reform) using outer-approximation
cuts. We also include a PGD algorithm with ADAM optimizer [18] as the baseline
for comparison. This algorithm starts from the lower bound of the cost and
iteratively selects locations one by one to maximize the objective value. By
considering the gradient descent method, we aim to compare the performance
of this heuristic approach with our approximation methods. The learning rate
of PGD is set to .10−2 and the algorithm was executed for .10000 epochs. The
parameters for the ADAM optimizer are as follows: .β1 = 0.9, β2 = 0.999, =
10−3 [35].
The algorithms were written in C++, and the MILP models were solved by
Gurobi version .11.0.3, where the number of CPUs is set to .8 cores. All experi-
ments were run on .13th Gen Intel(R) Core(TM) i.5-.13500 @ .4.8 GHz, and .64 GB
of RAM. The CP algorithm terminates when it reaches the maximum number of
iterations .nbiter = 10000 or the running time exceeds .3600 seconds. The optimal
gap . is set to .10−4 . The number of intervals in approximation procedure .K is
set to .20.
Table 1. Numerical results for cost optimization in competitive facility location under
CNL model, average computing times are given in seconds.
In terms of optimality, our method always returns the optimal solution for
the approximation problem, and these solutions outperform the local optima
obtained by PGD when .δ ≥ 30%. However, in larger instances, particularly with
.T = 200 and .δ = 10%, the cutting plane algorithm returns optimal solutions
for only 8 instances when .m = 25, .1 instance for .m = 50 and .4 instances
for .m = 100. Regarding the computing time, as shown in Table 1, the PGD
algorithm converges in less than .100 seconds for most problem instances, while
the cutting plane algorithm takes longer to find optimal solutions, especially for
lower .δ. Specifically, in larger-sized instances with .T ≥ 100, m ≥ 50 and .δ = 10%,
our method takes more than 1500 s, whereas for instances with .δ ≥ 30%, it
consistently takes less than .250 seconds.
To analyze the impact of budget, we conduct some small experiments with
.T = 50, examining the relative change in objective value and average computing
time of our approach as .δ varies. Figure 1 shows the results for .δ ranging from
.0.1 to .1. In Fig. 1a, we observe that as the budget increases (i.e., higher .δ), the
expected captured market share rises. Notably, when .δ ≤ 50%, the objective
value grows more rapidly than when .δ > 50%, suggesting that a few key facili-
ties have the largest impact on customer demand. Thus, it is more effective to
concentrate the budget on these facilities. For the remaining facilities, increas-
ing investment yields diminishing returns in terms of customer attraction. This
explains why we chose .δ ∈ {0.1, 0.3, 0.5} for the main experiments. In terms of
computation time, the problem becomes harder to solve as .δ decreases, indicat-
ing that with a limited budget, the cutting plane method requires more time to
identify the most impactful facilities.
Fig. 1. Relative change of the objective value and average computing time of cutting
plane algorithm with respect to .δ.
6 Conclusion
We have studied the cost optimization problem in competitive facility loca-
tion, where customer demand is predicted using the cross-nested logit model.
Given the highly non-convexity of this problem, it poses significant challenges.
Cost Optimization in Competitive Facility 383
References
1. Aouad, A., Farias, V., Levi, R., Segev, D.: The approximability of assortment
optimization under ranking preferences. Oper. Res. 66(6), 1661–1669 (2018)
2. Beine, M.A., Bierlaire, M., Docquier, F.: New York, Abu Dhabi, London or Stay at
Home? using a cross-nested logit model to identify complex substitution patterns
in migration. IZA Discussion Paper No. 14090 (2021)
3. Ben-Akiva, M., Lerman, S.R.: Discrete Choice Analysis: Theory and Application
to Travel Demand. MIT Press, Cambridge, Massachusetts (1985)
4. Ben-Akiva, M.: The structure of travel demand models. Ph.D. thesis, MIT (1973)
5. Ben-Akiva, M., Bierlaire, M.: Discrete Choice Methods and their Applications to
Short Term Travel Decisions, pp. 5–33. Springer US, Boston, MA (1999)
6. Ben-Akiva, M., Bierlaire, M.: Discrete choice methods and their applications to
short term travel decisions. Handbook of Transportation Science, pp. 5–33 (1999)
7. Benati, S., Hansen, P.: The maximum capture problem with random utilities: Prob-
lem formulation and algorithms. Eur. J. Oper. Res. 143(3), 518–530 (2002)
8. Bierlaire, M.: A theoretical analysis of the cross-nested logit model. Ann. Oper.
Res. 144, 287–300 (2006)
9. Bonami, P., et al.: An algorithmic framework for convex mixed integer nonlinear
programs. Discret. Optim. 5(2), 186–204 (2008)
10. Şen, A., Atamtürk, A., Kaminsky, P.: A conic integer programming approach to
constrained assortment optimization under the mixed multinomial logit model.
Oper. Res. 66(4), 994–1003 (2018)
11. Ding, C., Mishra, S., Lin, Y., Xie, B.: Cross-nested joint model of travel mode
and departure time choice for urban commuting trips: case study in maryland-
washington, dc region. J. Urban Plann. Develop. 141(4), 04014036 (2015)
384 B. L. Le et al.
12. Drabas, T., Wu, C.L.: Modelling air carrier choices with a segment specific cross
nested logit model. J. Air Transp. Manag. 32, 8–16 (2013)
13. Duong, N.H., Dam, T.T., Ta, T.A., Mai, T.: Joint location and cost planning
in maximum capture facility location under random utilities. Comput. Oper.
Res. 159, 106336 (2023). https://doi.org/10.1016/j.cor.2023.106336, https://www.
sciencedirect.com/science/article/pii/S0305054823002009
14. Duong, N.H., Ta, T.A.: Approximation methods for a nonlinear competitive facility
cost optimization problem. In: 2022 14th International Conference on Knowledge
and Systems Engineering (KSE), pp. 1–6. IEEE (2022)
15. Duran, M.A., Grossmann, I.E.: An outer-approximation algorithm for a class of
mixed-integer nonlinear programs. Math. Program. 36, 307–339 (1986)
16. Fosgerau, M., Bierlaire, M.: Discrete choice models with multiplicative error terms.
Transp. Res. Part B 43(5), 494–505 (2009)
17. Fosgerau, M., McFadden, D., Bierlaire, M.: Choice probability generating func-
tions. J. Choice Modell. 8, 1–18 (2013)
18. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017). https://
arxiv.org/abs/1412.6980
19. Lai, X., Bierlaire, M.: Specification of the cross-nested logit model with sampling
of alternatives for route choice models. Transport. Res. Part B: Methodol. 80,
220–234 (2015)
20. Le, B.L., Mai, T., Ta, T.A., Ha, M.H., Vu, D.M.: Competitive facility location
under cross-nested logit customer choice model: Hardness and exact approaches
(2024). https://arxiv.org/abs/2408.02925
21. Le, C., Mai, T.: Constrained assortment optimization under the cross-nested logit
model. Production and Operations Management p. 1 (2024). https://doi.org/10.
1177/10591478241263857
22. Mai, T.: A method of integrating correlation structures for a generalized recursive
route choice model. Transport. Res. Part B: Methodol. 93, 146–161 (2016)
23. Mai, T., Frejinger, E., Fosgerau, M., Bastin, F.: A dynamic programming approach
for quickly estimating large network-based MEV models. Transport. Res. Part B:
Methodol. 98, 179–197 (2017)
24. Mai, T., Lodi, A.: A multicut outer-approximation approach for competitive facility
location under random utilities. Eur. J. Oper. Res. 284(3), 874–881 (2020)
25. McFadden, D.: Conditional logit analysis of qualitative choice behaviour. In:
Zarembka, P. (ed.) Frontiers in Econmetrics, pp. 105–142. Academic Press, New
York (1973)
26. McFadden, D.: Modelling the choice of residential location. Transportation
Research Record (1978)
27. McFadden, D.: Econometric models of probabilistic choice. In: Manski, C., McFad-
den, D. (eds.) Structural Analysis of Discrete Data with Econometric Applications,
chap. 5, pp. 198–272. MIT Press (1981)
28. McFadden, D.: Economic choices. American Economic Review, pp. 351–378 (2001)
29. McFadden, D., Train, K.: Mixed MNL models for discrete response. J. Appl.
Econom. 447–470 (2000)
30. Small, K.A.: A discrete choice model for ordered alternatives. Econometrica 55(2),
409 (1987). https://doi.org/10.2307/1913243
31. Train, K.E.: Discrete choice methods with simulation. Cambridge university press
(2009)
32. Vovsha, P.: Application of cross-nested logit model to mode choice in Tel Aviv,
Israel, metropolitan area. Transp. Res. Rec. 1607(1), 6–15 (1997)
Cost Optimization in Competitive Facility 385
33. Vovsha, P., Bekhor, S.: Link-nested logit model of route choice Overcoming route
overlapping problem. Transp. Res. Rec. 1645, 133–142 (1998)
34. Yang, L., Zheng, G., Zhu, X.: Cross-nested logit model for the joint choice of
residential location, travel mode, and departure time. Habitat Int. 38, 157–166
(2013)
35. Zaheer, M., Reddi, S., Sachan, D., Kale, S., Kumar, S.: Adaptive methods for
nonconvex optimization. In: Advances in Neural Information Processing Systems,
vol. 31 (2018)
A Historical GPS Trajectory-Based
Framework for Predicting Bus Travel Time
1 Introduction
Integrated Circuit card data) were collected. Extracting accurate bus operation
information from these data has become key research and is important in improv-
ing bus services, particularly passenger satisfaction, efficient bus operation, etc.
The goal of predicting bus travel time is to estimate the time a bus takes to
travel between two bus stops [27]. However, unpredictable factors may cause an
early arrival or delay at a specific bus stop, causing passengers to lose confidence
in public transport and prefer private cars. Cyclical factors, like weather condi-
tions [9] (e.g., rain, snow, fog), time intervals [11] (e.g., peak and off-peak hours)
and holidays have a big effect on bus travel time prediction. Moreover, there are
various challenges associated with the processing of GPS data, like discontinu-
ities, non-uniformities, poor network coverage, and human errors. Additionally,
the systematic error and uneven receive time interval of original GPS data make
it rather difficult to estimate the bus travel time at any two consecutive stops.
A large amount of research based on bus GPS data has been conducted on
the bus evaluation methodologies [28], bus arrival time prediction models [8] and
influence factors to improve the performance of bus travel time prediction [22].
Over the years, many traditional approaches have been explored to predict bus
travel time. In earlier studies, methods such as the k-nearest neighbors (KNN)
algorithm were applied to address this problem [4]. Other research has combined
different model types to enhance prediction accuracy [2]. More recently, Pang
et al. [16] utilized Recurrent Neural Networks with Long Short-Term Memory
blocks (RNN-LSTM) to improve accuracy further. Additionally, Han et al. [8]
divided entire bus routes into segments and proposed a hybrid-BAT (Bus Arrival
Time) model, incorporating weighted LSTM to predict multi-stop bus arrival
times more effectively. However, many of them have not considered stochastic
features or the variability of bus travel time throughout a particular interval of
a day, which can significantly affect the travel time. While traffic varies a lot
between road segments, most prior studies have used a single model to predict
the travel time for the entire route, which cannot effectively capture sophisti-
cated dependencies among different road segments at various periods. Moreover,
the existing studies usually focused on the delay at special bus operation facil-
ities such as intersections and travel speed at the route level, and few research
studies were supported by the route map based on bus line data [30]. There-
fore, few studies address how to accurately estimate bus arrival times using GPS
trajectory data or predict bus travel times at different intervals of the day.
In this study, we propose a prediction framework to estimate bus travel times
by leveraging collected GPS trajectory data. This approach uses historical GPS
data to estimate travel times. Based on the pre-processing of GPS trajectory
data, bus route data, and mapped GPS points with the route map, the bus stop
arrival time was estimated and then it could acquire the bus travel time using
the clustering method and information decay function to downweight records
that correspond to older data.
The main contributions of this study are the following: (1) This study pro-
poses a bus travel time prediction framework that utilizes historical GPS tra-
jectory data and mapping techniques to accurately estimate bus stop arrival
388 K. Nguyen Duy et al.
times. By dividing the bus route into segments and considering multiple similar
trajectories, the framework addresses variability in travel conditions, improving
the accuracy of travel time prediction. (2) The framework takes advantage of
the positional relationship between bus stop locations and nearby mapped GPS
points to estimate bus stop arrival times, rather than relying on GPS instanta-
neous speed values. (3) This study conducts extensive experiments using real-
world GPS bus trajectory data from Sri Lanka and Vietnam to validate the pro-
posed framework. The results demonstrate that the model significantly improves
prediction accuracy compared to existing methods by considering time intervals
and providing a more robust solution for travel time estimation.
The rest of this study is organized as follows. Section 2 reviews the related
work, Sect. 3 introduces terminology, and formulates the research problem. Later,
Sect. 4 gives an overview of the framework, detailing its prediction model. Then,
a comprehensive experimental study is conducted using the collected real dataset
of bus trajectories in Sect. 5. Finally, the conclusions are outlined in Sect. 6.
2 Related Work
The problem of bus travel time prediction was studied by considering different
models and various essential factors. In a study by Gaikwad et al. [7], the crucial
features for bus travel time prediction and standard evaluation metrics were pre-
sented. Former studies have demonstrated that travel time tends to be affected
by traffic congestion and weather conditions. Thus, it can be periodic (e.g., the
daily growth in traffic over the peak interval on weekdays) or non-periodic (like
accidents and abnormal weather) over many time intervals in a day or day-to-
day. Over the last several years, many studies have been proposed to predict the
bus travel time in urban networks, mainly between two consecutive bus stops
along the path. Multiple prediction methods and mechanisms have been devel-
oped to predict bus travel time using historical trajectory or GPS data. These
methods can be categorized into statistical methods, machine learning methods,
and neural network methods.
Statistical approaches for predicting bus travel times can be classified into
historical average methods [6], regression methods [3], and time-series meth-
ods [8]. Historical average methods predict travel times by averaging past travel
times for the same time interval across multiple days [6]. This method follows
a decentralized model that doesn’t rely on specific training data or assumptions
about the data’s underlying patterns. However, it is often difficult to gather suf-
ficient trip data for each bus stop across different time intervals. Furthermore,
this approach requires stable traffic conditions to be effective, and its accuracy
significantly decreases when there are major fluctuations in traffic patterns [21].
Regression methods are designed to assess the impact of various factors on bus
travel time, explaining travel time (the dependent variable) through a set of
independent variables (e.g., passenger loads, and street characteristics). Traffic
patterns, journey distance, and dwell time are treated as separate variables. It
should be noted that nonlinear relationships between independent and dependent
A Historical GPS Trajectory-Based Framework 389
variables may also exist [3]. The time-series method supposes that the coming
status of a bus relies on the previous status of the same bus. These methods
often lead to minimal delays between predicted and actual times. However, they
are highly sensitive to complex and unusual circumstances, such as congestion or
delays at intersectdata, which are common on bus routes. The common applied
time-series methods are ARIMA [20], and generalized autoregressive conditional
heteroscedasticity [26].
Machine learning techniques have been increasingly applied to predict bus
travel time. Methods like Support Vector Machine (SVM) and Artificial Neural
Networks (ANN) are particularly popular due to their ability to capture com-
plex relationships efficiently. In study [1], an ANN-based model was developed to
predict bus travel times using GPS data from a route in Delhi, India. Similarly,
a study focusing on bus arrival predictions in both urban and rural regions of
China [23] utilized a combination of Support Vector Regression (SVR) and KNN
to compare with traditional prediction methods, using data from a bus route
that spanned urban and rural areas. Leveraging the Internet of Things, cluster-
ing method, study [12] showed that accurate arrival time under different traffic
conditions can be predicted using the bus and route with different parameters
such as average speed, number of passengers, rush hour information, and num-
ber of bus stops. To further enhance prediction accuracy, many studies [13] have
incorporated the Kalman Filter algorithm into machine learning-based methods.
In addition, with recent advancements in deep learning, researchers have
demonstrated that bus travel time prediction can be further enhanced by leverag-
ing deep learning architectures like LSTM networks [10], which excel at learning
temporal correlations. A more recent study [8] introduced a GPS position calibra-
tion technique to improve arrival time accuracy through the use of LSTM models.
Furthermore, a hybrid model as a combination of convolutions and LSTM layers
into a single structure, ConvLSTM [24] was applied [18]. Here, the convolution
layer could learn spatial dependencies between segments. In parallel, ensemble
learning methods have gained popularity in travel time prediction. Gradient
boosting ensembles, such as XGBoost [29], have been widely used, improving
prediction accuracy while mitigating the risk of overfitting.
Furthermore, there is no standard dataset used by these researchers, which
reflects no direct comparison. Many studies have not considered characteristics of
bus travel time, such as stochastic features or variability across different intervals
of the day, which can significantly impact accuracy. Although traffic conditions
vary widely between road segments, most prior studies have used a single model
to predict travel time for entire routes, which fails to effectively capture complex
dependencies across segments at various times.
In this section, we define the essential concepts that are employed throughout
this study and formulate the targeted research problem.
390 K. Nguyen Duy et al.
3.1 Terminology
To support our discussions throughout this study, we begin by defining several
key terms, including bus route, route segment, road link, road map, GPS point,
raw GPS trajectory, mapped GPS trajectory, and segment travel time.
Definition 1. Bus route: A bus route .R is represented as a sequence of bus
stops, .R = (s1 , s2 , ..., sn ) where .si stands for a .(i)th bus stop and .n is the
number of bus stops. Additionally, in public transportation, a bus regularly runs
on a fixed route, which can be represented by a sequence of two-dimensional
(latitude and longitude) route points .(p1 , p2 , ..., pm ) where .pi = (xi , yi ) and .m is
the number of route points.
Definition 2. Route segment: In this study, based on bus stops, we divide
a route into route segments which are partial segments between two consecu-
tive bus stops. Accordingly, a bus route .R can be represented by a sequence of
segments, .(S1 , S2 , ..., Sn−1 ).
Definition 3. Road link: A road link is a directed link, defined as .e =
start , pend ), where .eid represents the road link identifier, .pstart represents
(eid , pid id id
id
the road link start point .pid , and .pend represents the road link end point .pid .
Definition 4. Road map: The road map is represented by a single directed
graph .G = (V, E), where .V is the set of road link nodes, expressed as .V =
(v1 , v2 , ..., vm ), .E is the set of all road links, expressed as .E = (e1 , e2 , ..., em−1 ).
Definition 5. GPS point: The GPS point (also called trajectory point) rep-
resents the location information acquired by the GPS device within a certain
time interval, defined as .tri = (lati , lngi , ti ), where .lati and .lngi represent the
latitude and longitude coordinates of the current location point, respectively, .ti
represents the time of the current location point of the vehicle.
Definition 6. Raw GPS trajectory: The raw GPS trajectory (also called
original GPS trajectory) consists of a set of GPS points ordered by timestamps,
defined as .T R = (tr1 , tr2 , ..., tru ), where .u is the number of GPS points contained
in .T R.
Definition 7. Mapped GPS trajectory: Given an original GPS trajectory,
its mapped trajectory .T Rm represents the actual driving link of the bus in the
road map, defined as .T Rm = (c1 , c2 , ..., cu ).
Definition 8. Segment travel time: A bus trajectory is represented as a
sequence of .((s1 , T1 ), (s2 , T2 ), ..., (sn , Tn )) where .Ti denotes the arrival time of
bus at the bus stop .si which can calculated from mapped trajectory. With the
arrival time at each bus stop, it is straightforward to derive the segment travel
time between two consecutive bus stops which can also be represented as a
sequence .(ΔT1 , ΔT2 , ...ΔTn−1 ), where .ΔTi denotes the travel time on segment
.(i)th. In this study, the segment travel time is defined as the time from arrival at
one bus stop to arrival at the next bus stop. Therefore, the dwell time which is
the travel time before and after the stop, is included in the segment travel time.
A Historical GPS Trajectory-Based Framework 391
Fig. 2. A GPS point has several candidate-mapped points in the roundabout area.
where .p1 and .p2 are the two endpoints of the road link .ej respectively, and .c is the
vertical projection point from .tri to .ej . If .c is not located in .ej , then .disp(tri , c)
is set to .+∞, .σ is the standard deviation of location measurement, which is
generally set to 20-meter [14]. However, mapped road links determined only by
spatial probability will have corresponding errors. Therefore, the transmission
probability is proposed to improve the mapped accuracy.
Transmission probability: Given two candidate points .ci−1 and .ci for two
neighboring GPS sampling points .tri−1 and .tri respectively, the transmission
probability from .ci−1 to .ci is defined as the likelihood that the path from .tri−1
to .tri follows the path from .ci−1 to .ci . We compute transmission probability as:
disp(tri−1 , tri )
G2 (ci−1 , ci ) =
. (3)
disp(ci−1 , ci )
By combining the spatial probability and the transmission probability, this study
defines the comprehensive probability as .G, which is calculated by:
.G = G1 × G2 (4)
According to the above calculation and discussion, there is at least one eligible
road link around each trajectory point, so there is at least one comprehensive
probability. Based on the comprehensive probability value, we select the candi-
date road link with the highest probability for the selection of the best-mapped
link. Additionally, we ensure that the direction of travel is consistently forward,
eliminating any instances of the vehicle moving backward. The objective is to
ensure that the bus’s movement remains consistent and avoids any cases of the
bus traveling backward.
A Historical GPS Trajectory-Based Framework 393
According to the positional relationship of the bus stop site and adjacent mapped
GPS points, the bus stop arrival time could be estimated. The different positional
relationships are applied to different arrival time estimated methods, as shown
in Fig. 3.
Scenario 1: The mapped GPS point is near the bus stop site. A bus stop with
many bus routes often appears to be a queuing phenomenon caused by vehicles
waiting to dwell. Moreover, bus station platforms have a fixed length in general.
Therefore, this study developed a bus stop area which refers to the 20-meter
distance in front of the bus stop site. If the GPS mapped point is in this range,
the return time .tm−1 of the GPS mapped point will be regarded as the bus
arrival time .Ts . However, bus dwell time will be added to the next travel time
interval between bus stops.
Scenario 2: The mapped GPS point is outside the bus stop area. During the bus
operation, the behaviors of accelerating or decelerating happen frequently. To
process conveniently, this study established a hypothesis that the speed between
two GPS-mapped points is even. As a result, if the GPS mapped point is outside
the scope of the bus stop area, the bus arrival time with the return time of two
continuous GPS mapped points, as expressed by Eq. 5.
where .Ts is the bus arrival time at bus stop .(s)th, .Tm−1 is the return time of
(m − 1)th GPS mapped point, .Tm is the return time of .(m)th GPS mapped
.
point, .d(m−1,s) and .d(m−1,m) are the distance between .(m − 1)th GPS mapped
point, bus stop .(s)th and .(m)th GPS mapped point on the bus route map.
The prediction model adopts a stop-to-stop approach to predict the travel time
between two consecutive bus stops, i.e., Cho Lon Terminal .→ Bus Stop 1, Bus
Stop 1 .→ Bus Stop 2, and so on as seen in Fig. 4. Therefore, the bus travel
time will be predicted by adding the predicted segment travel time of preceding
segments not yet traveled:
D−1
.TOD = T Si (6)
i=O
where .TOD is the travel time from the origin station (.O) to the destination
station (.D), .TSi is the predicted travel time on segment .Si .
In this study, the prediction model also assumes that bus travel time is cycli-
cal or similar for the same day of week and time of day, for example, bus travel
time on successive Tuesdays is cyclical or is characterized by similar travel times
for similar segments at similar days of the week. Moreover, the bus travel time
394 K. Nguyen Duy et al.
Fig. 4. A stop-to-stop approach to predict the travel time between two consecutive bus
stops.
is changing over time. Thus, the model for travel time prediction also consid-
ers the variation in period. Before clustering similar historical trajectories, the
model first selects trajectories from the same day of the week and then considers
the time interval for each segment. The time interval depends on when the bus
enters the segment. Take segment .Si , for example. The time interval .τ to which
.Si belongs:
groups. To partition the segment travel time into groups, we consider the clus-
tering method which partitions the travel times on a segment to multiple groups
with minimized variances. Given the list of all the travel times on a segment
k
.S in the time interval .k, we first sort .S k by their values and then recursively
perform binary partition on the sorted .S k into sub-lists. Basically, in each itera-
tion, we compute the variances of all the data in .S k to find the best-split point
following the minimal weighted average variance (WAV) as defined below:
|SAk
(i)| |S k (i)|
W AV (i, S k ) =
.
k
V ar(SA (i)) + B k V ar(SB
k
(i)) (8)
|S |
k |S |
k
where .SA (i) and .SB
k
(i) are two sub-lists of .S k split at the .(i)th element and .V ar
represents the variance. This best split point leads to a maximum decrease of
.ΔV (i):
– No decay: .f (a) = 1, where all records are treated equally regardless of their
timestamp.
– Exponential decay: .f (a) = 1 − λ for .λ > 0, where .λ is the decay constant.
−a
– Polynomial decay: .f (a) = (a + 1) , this function is required because is
slower than the exponential decay, which is too fast.
In this study, based on the idea of decay function we use polynomial decay
to weighting for historical segments of travel time. The mean lifetime .a can
be calculated from the domain knowledge of the dataset by taking the week
difference between the most significant timestamp (MST) for prediction and the
−t
timestamp of segment .t, .a = M ST
7 . We consider all segments whose timestamps
are greater than MST to weigh one. This case is under the assumption that all
records in the future are the best predictors of the future. Therefore, with a
sample set of similar trajectories .Tp obtained through segment filtering above
−a
and the weight calculated by .f (a) = (a + 1) , the travel time on segment .Si
was predicted by calculating as:
Nk
wi
W =
. wi (10) wi =
. (11)
W
i=1
1
Nk
T
. Si = (ΔTi × wi ) (12)
Nk i=1
where .wi , wi are weights and normalized weights of segment .Si , which .Si is the
segment of trajectory in .Tp , .Nk is number of trajectory in .Tp .
5 Experiments
This section evaluates the results of the proposed travel time prediction frame-
work on real-world trajectory datasets. We evaluated the accuracy and reliability
using the mean absolute error (MAE) and the root mean square error (RMSE).
5.1 Datasets
In this study, we used two real-world datasets from buses in Kandy, Sri Lanka,
and Ho Chi Minh City, Vietnam:
– Kandy, Sri Lanka: The raw GPS data were processed to extract the segment
running times and dwell times. The detailed explanation of processing the raw
GPS data was presented in study [19]. The published dataset of Route No.
654 from Kandy to Digana terminals with 14 bus stops was considered.1 From
the data of 4 months (from Oct 2021 to Jan 2022), we used 3 months of data
for training and 1 month for testing.
1
https://www.kaggle.com/datasets/shiveswarranr/bus-travel-time-data.
A Historical GPS Trajectory-Based Framework 397
Table 1. Comparison of the performance of the proposed model with other existing
approaches.
– Ho Chi Minh City, Vietnam: The raw bus GPS trajectory data used in
this study (a week of November 9 to 16, 2012) was provided by Ho Chi Minh
City Department of Transportation. The GPS trajectory data for Route No.
94 (Cho Lon Terminal to Cu Chi Terminal) with 26 bus stops and Route No.
74 (An Suong Terminal to Cu Chi Terminal) with 27 bus stops were extracted
and mapped onto the road network to estimate bus stop arrival times, then
used to predict travel times. We used 4 d of data (Monday to Thursday) for
training and 1 day (Friday) for testing.
Table 1 shows the overall evaluation results of the proposed framework with other
existing approaches. The results show that our proposal is better than the other
existing models.
We also test the proposed framework on all bus stops on the route. The
estimated arrival time and the measured arrival time of one bus for 26 bus stops
of route No. 94 and 27 bus stops of route No. 74 are compared and the results
are presented in Fig. 5. Overall, we can see that the predicted values are close to
the actual values. However, as distance increases, the deviation also increases.
Moreover, we conducted practical experiments to compare the prediction
results for different minimum numbers of trajectories (MNT) in the segment
filtering step. Figure 6 shows the comparison results of route No. 654 and route
No. 94.
398 K. Nguyen Duy et al.
Fig. 5. The comparison of the predicted and measured time for 26 bus stops of route
No. 94 and 27 bus stops of route No. 74.
Fig. 6. The comparison of the prediction results (MAE) of difference minimum number
of trajectories (MNT).
6 Conclusions
In this study, we investigate the limitations of existing methods and the low
accuracy of state-of-the-art models in predicting bus arrival times using GPS
trajectory data only, especially in capturing the high variability of travel times.
To address this, we implemented a historical trajectory-based framework to pro-
cess GPS trajectory data using a mapping method and to predict travel times
with consideration of date and time intervals. This proposed approach enables
more accurate long-term predictions compared to traditional approaches, where
each segment’s travel time is predicted independently. Additionally, we applied
polynomial decay to down-weight records corresponding to older data. We con-
ducted comprehensive experiments using real datasets collected from buses in
Kandy, Sri Lanka, and Ho Chi Minh City, Vietnam, demonstrating that our
model outperforms other popular and recent methods. The data required for the
proposed model is straightforward, consisting of the standard outputs generated
by most ITS commonly used in public transport.
A Historical GPS Trajectory-Based Framework 399
References
1. Amita, J., Jain, S., Garg, P.K.: Prediction of bus travel time using ann: a case
study in Delhi. Transport. Res. Proc. 17, 263–272 (2016)
2. As, M., Mine, T., Yamaguchi, T.: Prediction of bus travel time over unstable
intervals between two adjacent bus stops. Int. J. Intell. Transp. Syst. Res. 18,
53–64 (2020)
3. Chen, M., Liu, X., Xia, J., Chien, S.I.: A dynamic bus-arrival time prediction model
based on APC data. Comput.-Aided Civil Infrastruct. Eng. 19(5), 364–376 (2004)
4. Coffey, C., Pozdnoukhov, A., Calabrese, F.: Time of arrival predictability horizons
for public bus routes. In: Proceedings of the 4th ACM SIGSPATIAL International
Workshop on Computational Transportation Science, pp. 1–5 (2011)
5. Cormode, G., Shkapenyuk, V., Srivastava, D., Xu, B.: Forward decay: a practi-
cal time decay model for streaming systems. In: 2009 IEEE 25th International
Conference on Data Engineering, pp. 138–149. IEEE (2009)
6. Dhivya Bharathi, B., Anil Kumar, B., Achar, A., Vanajakshi, L.: Bus travel time
prediction: a log-normal auto-regressive (AR) modelling approach. Transportmet-
rica A: Transp. Scie. 16(3), 807–839 (2020)
7. Gaikwad, N., Varma, S.: Performance analysis of bus arrival time prediction using
machine learning based ensemble technique. In: Proceedings 2019: Conference on
Technologies for Future Cities (CTFC) (2019)
8. Han, Q., Liu, K., Zeng, L., He, G., Ye, L., Li, F.: A bus arrival time prediction
method based on position calibration and LSTM. IEEE Access 8, 42372–42383
(2020)
9. He, P., Jiang, G., Lam, S.K., Sun, Y.: Learning heterogeneous traffic patterns for
travel time prediction of bus journeys. Inf. Sci. 512, 1394–1406 (2020)
10. He, P., Jiang, G., Lam, S.K., Tang, D.: Travel-time prediction of bus journey with
multiple bus trips. IEEE Trans. Intell. Transp. Syst. 20(11), 4192–4205 (2018)
11. Huang, Y., et al.: Bus arrival time prediction and reliability analysis: an experimen-
tal comparison of functional data analysis and bayesian support vector regression.
Appl. Soft Comput. 111, 107663 (2021)
12. Jalaney, J., Ganesh, R.: Highly accurate bus arrival time prediction using k-nearest
neighbor prediction in the internet of things (Iot) environment. J. Green Eng.
10(9), 4752–62 (2020)
13. Liu, H., Van Lint, H., Van Zuylen, H., Zhang, K.: Two distinct ways of using kalman
filters to predict urban arterial travel time. In: 2006 IEEE Intelligent Transporta-
tion Systems Conference, pp. 845–850. IEEE (2006)
14. Lou, Y., Zhang, C., Zheng, Y., Xie, X., Wang, W., Huang, Y.: Map-matching for
low-sampling-rate gps trajectories. In: Proceedings of the 17th ACM SIGSPATIAL
International Conference on Advances in Geographic Information Systems, pp.
352–361 (2009)
400 K. Nguyen Duy et al.
15. Müller, M.: Dynamic time warping. Information retrieval for music and motion,
pp. 69–84 (2007)
16. Pang, J., Huang, J., Du, Y., Yu, H., Huang, Q., Yin, B.: Learning to predict bus
arrival time from heterogeneous measurements via recurrent neural network. IEEE
Trans. Intell. Transp. Syst. 20(9), 3283–3293 (2018)
17. Paterson, M., Dančík, V.: Longest common subsequences. In: International Sym-
posium on Mathematical Foundations of Computer Science, pp. 127–142. Springer
(1994)
18. Petersen, N.C., Rodrigues, F., Pereira, F.C.: Multi-output deep learning for bus
arrival time predictions. Transport. Res. Proc. 41, 138–145 (2019)
19. Ratneswaran, S., Thayasivam, U.: Extracting potential travel time information
from raw gps data and evaluating the performance of public transit-a case study
in kandy, sri lanka. In: 2023 3rd International Conference on Intelligent Commu-
nication and Computational Techniques (ICCT), pp. 1–7. IEEE (2023)
20. Thomas, T., Weijermars, W., Van Berkum, E.: Predictions of urban volumes in
single time series. IEEE Trans. Intell. Transp. Syst. 11(1), 71–80 (2009)
21. Vanajakshi, L., Rilett, L.R.: Support vector machine technique for the short term
prediction of travel time. In: 2007 IEEE Intelligent Vehicles Symposium, pp. 600–
605. IEEE (2007)
22. Wang, L., Zhang, D., Wang, Y., Chen, C., Han, X., M’hamed, A.: Sparse mobile
crowdsensing: challenges and opportunities. IEEE Commun. Mag. 54(7), 161–167
(2016)
23. Wang, Y.: Intellectualization of the urban and rural bus: the arrival time prediction
method. J. Intell. Syst. 30(1), 689–697 (2021)
24. Wu, J., Wu, Q., Shen, J., Cai, C.: Towards attention-based convolutional long
short-term memory for travel time prediction of bus journeys. Sensors 20(12),
3354 (2020)
25. Xie, Z.Y., He, Y.R., Chen, C.C., Li, Q.Q., Wu, C.C.: Multistep prediction of bus
arrival time with the recurrent neural network. Math. Probl. Eng. 2021(1), 6636367
(2021)
26. Yang, M., Liu, Y., You, Z.: The reliability of travel time forecasting. IEEE Trans.
Intell. Transp. Syst. 11(1), 162–171 (2009)
27. Yu, B., Lam, W.H., Tam, M.L.: Bus arrival time prediction at bus stop with
multiple routes. Transport. Res. Part C: Emerg. Technol. 19(6), 1157–1170 (2011)
28. Zhang, J., Yu, X., Tian, C., Zhang, F., Tu, L., Xu, C.: Analyzing passenger density
for public bus: Inference of crowdedness and evaluation of scheduling choices. In:
17th International IEEE Conference on Intelligent Transportation Systems (ITSC),
pp. 2015–2022. IEEE (2014)
29. Zhu, L., Shu, S., Zou, L.: Xgboost-based travel time prediction between bus sta-
tions and analysis of influencing factors. Wirel. Commun. Mob. Comput. 2022(1),
3504704 (2022)
30. Zhu, T., Ma, F., Ma, T., Li, C.: The prediction of bus arrival time using global
positioning system data and dynamic traffic information. In: 2011 4th Joint IFIP
Wireless and Mobile Networking Conference (WMNC 2011), pp. 1–5. IEEE (2011)
Influence Maximization with Fairness Allocation
Constraint
1 Introduction
Information Maximization (IM) in Online Social Networks (OSNs) has recently been
a hot research topic due to their wide range of commercial, viral marketing, and social
network analysis. Kempe et al. [13] first mathematically formulated the problem of .IM,
which aimed at finding a set of .k (budget) influential users (called seed set) in an online
social network to begin an influence process that could possibly influence the largest
number of users. Since then, the .IM problem has demonstrated its important role in
various domains, not only in product promotion and social influence [15, 18], but also in
other applications such as social network monitoring [21, 31], epidemic prevention [16,
22] and recommendation systems [30].
In some realistic scenarios, the selection of a seed set can be considered in some
groups of users, where each group usually has a group budget. The group budget con-
straint ensures that each group’s budget is evenly distributed. A typical example is prod-
uct promotion in viral marketing, where a company conducts a marketing strategy to
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 401–414, 2025.
https://doi.org/10.1007/978-981-96-4282-3_32
402 H. T. Nguyen et al.
some interested groups of users; each group has the same local location. They want
to distribute sample products fairly to groups so that the total number of influenced
users is maximized. Motivated by the aforementioned phenomena, in this paper, we
study Influence Maximization with Fairness Allocation (.IMFA) Constraint, defined as
follows:
Definition 1. Given a social network represented by a graph .G = (V, E), where .|V |
is the set of users and .E is the set of links under the information diffusion model .M .
Given a budget .k, a set of .K groups .C1 , C2 , . . . , CK , .Ci ⊆ V, i ∈ [K] and a fairness
ratio .α ∈ (0, 1). The problem asks to find a seed set .S that maximizes the number of
influenced users such that the size is at most .k and the number of elements in each
group .Ci is at most .αk, i.e., .|S| ≤ k and .|S ∩ Ci | ≤ αk.
We focus on developing an efficient approximation algorithm for the problem. Our
contributions are as follows:
– We first formulate the Influence Maximization with Fairness Allocation Constraint
under the well-known Independent Cascade (.IC) model.
– We design an efficient approximation algorithm, named .SIMF, that returns an
approximation ratio of .1/2 − with probability at least .1 − δ and takes .O((m +
log( k )) n2 (k log n + log( 1δ ))) time complexity, where ., δ ∈ (0, 1) are constants .
– Finally, we conduct several extensive experiments on real-world social networks.
The results show that our algorithms produce the comparative solution quality but
take less running time than state-of-the-art algorithms.
Additional Related Works. Inspired by the advantages of social media platforms for
product promotion, Kempe et al. [13] first mathematically introduced two informa-
tion diffusion models: Independent Cascade (.IC) and Linear Threshold (.LT), and then
studied Influence Maximization (IM) as a discrete optimization problem. They showed
that the IM is NP-hard and proposed a greedy algorithm that can return .(1 − 1/e)-
approximation ratio. The Kempe et al.’s work has inspired later works on further study-
ing IM problems in efficient and salable algorithms [6, 7, 14, 17, 18, 20, 26], importance
variants of IM [4, 11, 18–20, 23], etc.
From an algorithmic perspective, several fast heuristic algorithms that improve the
running time in practical for large networks, including converting the original graph to
a directed acyclic graph [6, 7], path-based algorithms [12], and community-based algo-
rithms [32]. However, these algorithms did not provide any guaranteed approximation
quality and may not perform well with large networks (billions of vertices and links).
Borgs et al. [3] made a breakthrough by introducing a sampling method, Reverse Influ-
ence Sampling (RIS), which achieved an optimal ratio of .(1−1/e−) within near-linear
time complexity. Recently, several efforts have tried to keep the ratio but further run-
ning time to .O((m + n) log(n)) by modifying the RIS model [17, 26] (see [26] for an
overview about RIS-based algorithms).
Several variants of IM that capture practical applications have been studied. Nguyen
et al. [19] investigated IM at the Community level (IMC) that asked to find a seed
set that could influence the largest number of groups and proposed several efficient
Influence Maximization with Fairness Allocation Constraint 403
algorithms with theoretical guarantees. The authors in [33, 34] investigated the problem
of Group Influence Maximization (GIM), which also asked to find a set of users with the
largest influence of groups. A sandwich approximation algorithm with a data-dependent
ratio was proposed. Recently, the authors [23] studied the problem of finding a seed set
with a minimal total cost that can influence all (given) target groups. More recently,
Tsang et al. [29] have proposed an algorithmic approximation framework with constant
approximation ratios .IM under fairness constraints to influence groups. The authors in
[9] tried solving the problem using integer programming formulation. However, this
method only worked with a fixed-size sample graph and may not expand to medium-
sized networks.
In general, these studies focused on the problem of influencing individuals or groups
of users under some constraints, which was different from the context of distribution
fairness over groups. Therefore, the existing algorithms may not be directly applied to
the proposed problem with approximation ratio.
Organization. The rest of the paper is organized as follows. Section 2 provides some
useful notations. Section 3 presents our proposed algorithm for .IMFA. The experiments
and results are presented in Sect. 5. Finally, Sect. 6 concludes this work and discusses
future studies.
Notional Description
.G = (V, E) a graph representing a social network, node set .V represents the set of
users, and edge set .E represents the set of links.
.n, m .n = |V |, .m = |E|.
.Nout (v), Nin (v) The sets of outgoing and incoming neighbor nodes of .v.
represents an OSN, where .V is the set of nodes/vertices and .E is the set of edges
with .|V | = n and .|E| = m. Denote .Nout (u)(.Nin (u)) be the set of out-neighbors (in-
neighbors) of node .u, respectively. For any initial set of influence uses (defined as a
seed set) .S ⊆ V , the process of influence propagation happens in discrete steps, and
more nodes can be influenced. Under the .IC model, each edge .e = (u, v) ∈ E has a
propagation probability .p(e) ∈ [0, 1] representing the information transmission from
a node .u to another node .v. The diffusion process happens from a seed set from .S as
follows.
– At the beginning, i.e., the first round .t = 1, all nodes in .S are active and other nodes
in .V are inactive.
– At step .t > 1, each node .u activated at step .t − 1 has a single chance to activate each
currently inactive node .v ∈ Nout (u) with a successful probability .p(e).
– If a node is activated, it remains active till the end of the diffusion process. The
propagation process terminates at step .t if no new node is activated in this step.
Kempe et al. [13] showed that the .IC model is equivalent to a live-edge model defined
as follows. From the graph .G = (V, E), a random sample graph .g is generated from .G
by selecting an edge .e ∈ E with probability .p(e) and non-selecting .e with probability
.1 − p(e). We refer to .g as a sample of .G and write .g ∼ G and .Ω as the space of all
sample graphs. The probability of generation a sample graph .g from .G is calculated by:
. Pr[g ∼ G] = p(e) (1 − p(e)) (1)
e∈Eg e∈E
/ g
where .Eg is the set of edges in the graph .g. The influence spread from a set of nodes .S
to any node .u is calculated as follows:
.I(S, u) = Pr[g ∼ G] · Rg (S, u) (2)
g∼G
where .Rg (S, u) = 1 if .u is reachable from .S in .g and .Rg (S, u) = 0 otherwise. The
influence spread of .S in network .G (number of influenced nodes) is:
.I(S) = I(S, u). (3)
u∈V
2.3 Matroid
We introduce some notations about matroid which is useful for designing our algo-
rithm. Given a ground set .V , and .M ⊆ 2V . We call a system .(V, M) as a matroid if it
satisfies:
1) Downward closed property: If for all .S ⊆ T such that .T ∈ M, then .S ∈ M.
2) Augmentation property: If .S, T ∈ M and .|S| < T , then there exists .e ∈ T \ S such
that .S ∪ {e} ∈ T .
The rank of a matroid .rank(M) is the maximum size of a set .S ∈ M.
3 Proposed Algorithm
From the analysis and discussion in Sect. 2, one can use .Î(S) to effectively estimate
I(S) if the number of samples .|R| is sufficiently large. For a set of RR sets .R, we need
.
to find the a set of nodes .S that satisfies fairness allocation constraint with high objec-
tive function value. Thus, we investigate the Maximum Cover with Fairness Allocation
Constraint MCFA problem defined as follows:
Definition 3 (MCFA). Given a graph .G = (V, E) under .IC model, a set of RR sets
R = {R1 , R2 , . . . , RT } generated from .G, .C = {C1 , · · · , CK } is a collection of .K
.
disjoint subsets of .V , .Ci ⊆ V, Ci ∩ Cj = ∅. Given a total budget .k and a positive
constant .α ∈ (0, 1).MCSG asks to find the set .S satisfying .|S| ≤ k and .|S ∩ Ci | ≤ αk
n min{|S∩Ri |,1}
so that .ÎR (S) = Ri ∈R
|R| is maximized.
.ÎR (S) is monotone and submodular [17, 26]. We show that .MCFA is a form of well-
One can adapt the Greedy algorithm with the ratio of .1/2 for the .MCFA [10]. How-
ever, this algorithm takes .O(|R|k) times and may be impractical for medium social
networks. Therefore, in this work we propose a Threshold Greedy (.ThGreedy) that
reduces the time complexity to .O(|R| log k) while returning an approximation ratio of
.1/2 − for .MCFA problem, where . ∈ (0, 1/2) is an arbitrarily small constant.
Influence Maximization with Fairness Allocation Constraint 407
Our algorithm adapts the idea of the threshold greedy method [1] to select the good
elements without violating the matroid constraint in each iteration. Notably, it first ini-
tiates a solution .S as empty and finds .M the best estimation value (.Î(·)) of an ele-
ment. The algorithm primarily operates in the main loop (Lines 2-11) with at most
.log1/(1−) (k/) + 1 iterations. In iteration .i, it considers all elements in .V \ S and adds
ones that do not violate the allocation constraint (Line 4) and their marginal gain larger
or equal to .θ, where .θ = (1 − )i M (Line 5). At the end of each iteration (outer loop),
the threshold .θ is decreasing by a factor of .1 − (Line 7). Finally, the algorithm ter-
minates after .log1/(1−) (k/) + 1 iterations and returns the solution .S. The pseudo of
.ThGreedy is depicted in Algorithm 1.
Algorithm 1: .ThGreedy(R, C, )
Input: A set of RRs .R, a set of groups .C = {C1 , C2 , . . . , CK }, α ∈ (0, 1), budget .k,
. > 0
Output: A set .S
1: .S ← ∅, M ← maxe∈V ÎR ({e}), θ ← M
2: while .θ ≥ M/k do
3: foreach .e ∈ V do
4: if .|S ∩ C(e)| < αk and .|S| ≤ k then
5: if .ÎR (S ∪ {e}) − ÎR (S) ≥ θ then
6: .S ← S ∪ {e}
7: .θ ← (1 − )θ
8: return.S
k k log( k ) log( k )
. 1 (
log 1− ) + 1 ≤ log1+ ( ) + 1 = +1≤ + 1. (12)
log(1 + ) /2
Each iteration takes at most .O(|R|) time, so the time complexity of the algorithm is
log( k )
.
/2+ 1 + O(n) = O( |R| k
log( )). Now, assume that .S = {s1 , s2 , . . . , sk } is .S after
ending the main loop (if .|S| < k, we add more some empty elements into .S). Denote
.si as the .i-th element added into .S and .Si = {s1 , s2 , . . . , si }, .θ(si ) as .θ at the iteration
.si be added to .S. Let .O = {o1 , o2 , . . . , ok } such that .{s1 , s2 , . . . , si−1 , oi } ∈ M, ∀i ≤
k − 1 is feasible for all .i, which exists by the augmentation property of matroids. We
first prove that
If .si is added into .S at the first iteration, we have .f (si |Si−1 ) = f (emax ) ≥ (1 −
)f (oi |Si−1 ). If .si is added into .S at the first iteration .t ≥ 2. If .oi ∈ Si , (13) holds. If
408 H. T. Nguyen et al.
o ∈
/ Si , .oi was note added in to .S at previous iterations, so we have:
. i
(1 − )θ(si )
.f (si |Si−1 ) ≥ θ(si ) = ≥ (1 − )f (oi |Si−1 ). (14)
1−
Therefore the inequality 13 holds. By the submodularity of .f and the selection rule of
the Algorithm 1, we have:
k
. f (S) = f (si |{s1 , s2 , . . . , si−1 }) = f (si |Si−1 ) (15)
i=1
k
≥ (1 − )f (oi |Si−1 ) (Due to (13)) (16)
i=1
k
≥ (1 − )f (oi |S ∪ {o1 , . . . , oi−1 }) (Due to the submodularity of f ) (17)
i=1
≥ (1 − )f (O|S) ≥ (1 − )(f (O) − f (S)) (18)
We now present Sampling Approximation Algorithm for .IMFA (.SIMF), a .(1/2 − )-
approximation algorithm for .IMFA. This algorithm combines the solution of Algo-
rithm 1 with sampling algorithm framework in [17, 26] for IM problem.
The .SIMF algorithm receives an instance of .IMFA problem: Graph .G = (V, E),
groups .C = {C1 , . . . , CK }, budget .k, fairness ratio .α, and accuracy parameters ., δ ∈
(0, 1/2) as inputs. The Algorithm initiates two sets of RR sets .R1 and .R2 (Line 2).
It then mainly works in at most .imax ← log2 (Nmax /N1 ) iterations, where .Nmax is
provided in Lemma 6. At each iteration .i, it finds a candidate solution .S over a set of .R1
by calling Algorithm 1 and then checks the condition of solution quality in Line 5. If the
condition is true the algorithm immediately returns the solution .S. If no, it doubles size
of .R1 , R2 simultaneously (Line 7) and moves the next iteration. The details of pseudo
of .SIMF are depicted in Algorithm 2.
For theoretical analysis, we first recap the Lemma 4.2 in [26] to give a lower bound
of .I(S) for any set .S:
Lemma 4 (Lemma 4.2 in [26]). For any .δ ∈ (0, 1), a set of RR sets .R and a
S, we have .Pr[I(S) ≥ IlR (S)] ≥ 1 − δ, where .a = ln(1/δ), and .IlR (S) =
set .
a 2
ΛR (S) + 2a
9 − 2
a
− 18 n
|R| .
Lemma 5. For any .δ ∈ (0, 1), a set of RR sets .R. Assume that .S is returned by
ThGreedy with .R in line 5 of Algorithm 2, we have .Pr[I(O) ≤ IuR (O)] ≥ 1 − δ,,
.
2
ΛR (S) a a n
where .a = ln(1/δ), .IuR (O) = 1/2− + 2 + 2 |R| .
By adapting the similar reasoning with Lemma 6.1 in [26], we establish the number of
required RR sets to obtain the performance guarantee as follows:
Lemma 6. For any ., δ ∈ (0, 1), a set of RR sets .R. Assume that .S is returned by
ThGreedy with .R in line 5 of Algorithm 2, if
.
2
2n (1/2 − ) ln( 2δ ) + (1/2 − )(ln nk + ln( 2δ ))
.|R| ≥ Nmax = (23)
2 k
then .S is a .(1/2 − ) approximation solution with probability at least .1 − δ.
Finally, we give the performance of .SIMF algorithm in Theorem 2.
Theorem 2. Algorithm 2 returns a solution .S satisfying .I(S) ≥ (1/2 − ) with proba-
bility at least .1 − δ.
410 H. T. Nguyen et al.
Proof. If the algorithm terminates at the iteration .i = imax , the number of RR sets is
Nmax . By Lemma 6, the algorithm returns the approximation ratio of .1/2− ≥ 1/2−
.
with a probability of at least .1 − δ1 .
If the algorithm terminates at some iteration .i < imax , i.e., it meets the condition in
Line 6. By Lemma 4 and Lemma 5, with probability at least .1 − δ1 , we have
I(S) ≥ IlR1 (S), I(O) ≥ IuR2 (O).
. (24)
Therefore, with probability at least .1 − 2δ1 , we have
I(S) Il (S) 1
. ≥ uR1 ≥ − . (25)
I(O) IR2 (O) 2
By the bound of probability, the probability the algorithm failure at most .δ1 +imax ·δ1 =
δ/3 + 2δ/3 = δ. The time computation of the algorithm has two components: the time
to generate RR sets and the time to find the solution. The algorithm needs at most .Nmax
RR sets. By previous work [27], the time to generate a RR sample at most .O( m n I({v})),
where .v is a node selected randomly from those in G with probabilities proportional to
their in-degrees. The time to generate .Nmax RR sets is bounded by
m n 1
O(Nmax
. I({v})) = O( 2 (k log n + log( ))m) (26)
n δ
In the other hand, by Theorem 1 the time complexity to find the solution is at most
Nmax k n 1 k
O(
. log( )) = O( 2 (k log n + log( )) log( )) (27)
δ
Therefore, the time complexity of the algorithm is at most
k n 1
.O (m + log( )) (k log n + log( )) . (28)
2 δ
which completes the proof.
4 Experiment Evaluation
4.1 Settings
In this section, we conduct experiments showing the performance of our proposed .SIMF
algorithm as compared with the OPIM, the state-of-the-art algorithm for Influence Max-
imization (IM) Problem [26] on two major metrics: the solution quality (the influence
spread) and running time. OPIM is one of the best solutions for IM problems which to
find the seed set .S with .|S| ≤ k so that .I(S) is maximized. Since the solution of the
IM problem may not be a feasible solution to the .IMFA problem, we adapt OPIM to the
.IMFA problem by following these steps.
– For each group .Ci , we adapt OPIM to find solution .Si ⊆ Ci with .|Si | ≤ αk.
K
– After finding solutions .Si , we check the overall solution .S = i=1 Si . If there exists
a group .i that violates the group budget constraint, i.e., .|S ∩ Ci | > αk, we remove
some nodes from .S so that .|S ∩ Ci | ≤ αk.
Influence Maximization with Fairness Allocation Constraint 411
Table 2. Datasets
Parameters Setting. All experiments are under the .IC model with edge probabilities set
to .p(u, v) = 1/|Nin (v)|. This weight setting is adopted from prior works [13, 17, 26,
27]. We vary .k ∈ {100, 200, 300, 400, 500} for each dataset, and set .K = 10 (the
number of groups), .α = 0.1. We also set parameters . = 0.1 and .δ = 0.1 according
previous works [17, 26].
Fig. 1. The influence spread of algorithms: (a) Facebook, (b) Google Plus.
Fig. 2. The running time of algorithms: (a) Facebook, (b) Google Plus.
4.2 Results
Figures 1 and 2 display the performance of compared algorithms on the four datasets for
IMFA problem. Figure 1 shows the influence spread of algorithms. It can be observed
.
that the lines of .SIMF are almost. Although OPIM gives better influence spread in Ego-
Facebook, the gap is insignificant (below 5 %). Our algorithm gives better influence
412 H. T. Nguyen et al.
spread for Google-Plus network. This result may be because our algorithm does not
select enough .k nodes in the seed set.
Figure 2 shows that our .SIMF significantly outperforms OPIM in terms of time
taken. Specifically, .SIMF runs 1.12 to 4.61 times faster than that of OPIM. The above
results show that our algorithms provide the comparative solution quality while wasting
the lowest running time.
5 Conclusions
In this paper, we investigated a novel .IMFA problem that asks to find a seed set in
a social network that maximizes influence spread subject to group fairness allocation
constraint. We propose a .1/2 − approximation algorithm near-linear time complex-
ity for the problem. We further investigate the practical performance of our algorithm
compared to the state-of-the-art Influence maximization problem. The results show our
superiority in terms of running time. In the future, we will improve the quality of our
algorithm’s solutions in theory and practice.
References
1. Badanidiyuru, A., Vondrák, J.: Fast algorithms for maximizing submodular functions. In:
Proceedings of the 2014 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA),
pp. 1497–1514 (2014)
2. Borgs, C., Brautbar, M., Chayes, J.T., Lucier, B.: Maximizing social influence in nearly opti-
mal time. In: Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete
Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pp. 946–957 (2014)
3. Borodin, A., Filmus, Y., Oren, J.: Threshold models for competitive influence in social net-
works. In: Saberi, A. (ed.) WINE 2010. LNCS, vol. 6484, pp. 539–550. Springer, Heidelberg
(2010). https://doi.org/10.1007/978-3-642-17572-5_48
4. Chen, N.: On the approximability of influence in social networks. SIAM J. Discret. Math.
23(3), 1400–1415 (2009)
5. Chen, W., Lakshmanan, L., Castillo, C.: Information and Influence Propagation in Social
Networks. Morgan & Claypool Publishers, Synthesis Lectures on Data Management (2013)
6. Chen, W., Wang, C., Wang, Y.: Scalable influence maximization for prevalent viral market-
ing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 1029–1038. KDD ’10, Associa-
tion for Computing Machinery, New York, NY, USA (2010)
7. Chen, W., Yuan, Y., Zhang, L.: Scalable influence maximization in social networks under the
linear threshold model. In: ICDM 2010, Proceedings of the 10th IEEE International Confer-
ence on Data Mining, Sydney, Australia, 14-17 December 2010, pp. 88–97 (2010)
8. Ene, A., Nguyen, H.L.: Streaming algorithm for monotone k-submodular maximization with
cardinality constraints. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G.,
Sabato, S. (eds.) International Conference on Machine Learning, ICML 2022, 17-23 July
2022, Baltimore, Maryland, USA. Proceedings of Machine Learning Research, vol. 162, pp.
5944–5967. PMLR (2022)
9. Farnadi, G., Babaki, B., Gendreau, M.: A unifying framework for fairness-aware influence
maximization. In: Companion Proceedings of the Web Conference, Taipei, Taiwan, April
20-24, 2020, pp. 714–722 (2020)
Influence Maximization with Fairness Allocation Constraint 413
10. Fisher, M.L., Nemhauser, G.L., Wolsey, L.A.: An analysis of approximations for maximizing
submodular set functions—ii. In: Polyhedral Combinatorics: Dedicated to the memory of
D.R. Fulkerson, pp. 73–87. Springer Berlin Heidelberg (1978)
11. Goyal, A., Bonchi, F., Lakshmanan, L., Venkatasubramanian, S.: On minimizing budget and
time in influence propagation over social networks. Soc. Netw. Anal. Min. 3(2), 179–192
(2013)
12. Goyal, A., Lu, W., Lakshmanan, L.V.S.: SIMPATH: an efficient algorithm for influence max-
imization under the linear threshold model. In: Cook, D.J., Pei, J., Wang, W., Zaïane, O.R.,
Wu, X. (eds.) Proceedings of the 11th IEEE International Conference on Data Mining, ICDM
2011, Vancouver, BC, Canada, December 11-14, 2011, pp. 211–220. IEEE Computer Soci-
ety (2011)
13. Kempe, D., Kleinberg, J.M., Tardos, É.: Maximizing the spread of influence through a social
network. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, Washington, DC, USA, August 24 - 27, 2003, pp. 137–146
(2003)
14. Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J.M., Glance, N.S.: Cost-
effective outbreak detection in networks. In: Proceedings of the 13th ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining, San Jose, California, USA,
August 12-15, 2007, pp. 420–429 (2007)
15. Li, Y., Zhang, D., Tan, K.: Real-time targeted influence maximization for online advertise-
ments. Proc. VLDB Endow. 8(10), 1070–1081 (2015)
16. Nguyen, H.T., Cano, A., Tam, V., Dinh, T.N.: Blocking self-avoiding walks stops cyber-
epidemics: a scalable gpu-based approach. IEEE Trans. Knowl. Data Eng. 32(7), 1263–1275
(2020)
17. Nguyen, H.T., Thai, M.T., Dinh, T.N.: Stop-and-stare: Optimal sampling algorithms for viral
marketing in billion-scale networks. In: Proceedings of the 2016 International Conference
on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 -
July 01, 2016, pp. 695–710 (2016)
18. Nguyen, H.T., Thai, M.T., Dinh, T.N.: A billion-scale approximation algorithm for maximiz-
ing benefit in viral marketing. IEEE/ACM Trans. Network. 25(4), 2419–2429 (2017)
19. Nguyen, L.N., Zhou, K., Thai, M.T.: Influence maximization at community level: A new
challenge with non-submodularity. In: Proceedings of the 39th IEEE International Confer-
ence on Distributed Computing Systems, ICDCS 2019, Dallas, TX, USA, July 7-10, 2019,
pp. 327–337 (2019)
20. Pham, C.V., Duong, H.V., Thai, M.T.: Importance sample-based approximation algorithm for
cost-aware targeted viral marketing. In: Tagarelli, A., Tong, H. (eds.) CSoNet 2019. LNCS,
vol. 11917, pp. 120–132. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34980-
6_14
21. Pham, C.V., Pham, D.V., Bui, B.Q., Nguyen, A.V.: Minimum budget for misinformation
detection in online social networks with provable guarantees. Optimiz. Lett. 16(2), 515–544
(2022)
22. Pham, C.V., Phu, Q.V., Hoang, H.X., Pei, J., Thai, M.T.: Minimum budget for misinformation
blocking in online social networks. J. Comb. Optim. 38(4), 1101–1127 (2019). https://doi.
org/10.1007/s10878-019-00439-5
23. Pham, P., Pham, C.V., Duong, H.V., Snásel, V., Nguyen, T.T.: Minimizing cost for influencing
target groups in social network: a model and algorithmic approach. Comput. Commun. 212,
182–197 (2023)
24. Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph analytics
and visualization. In: Bonet, B., Koenig, S. (eds.) Proceedings of the Twenty-Ninth AAAI
Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, pp. 4292–
4293. AAAI Press (2015)
414 H. T. Nguyen et al.
25. Rozemberczki, B., Davies, R., Sarkar, R., Sutton, C.: Gemsec: Graph embedding with self
clustering. In: Proceedings of the 2019 IEEE/ACM International Conference on Advances
in Social Networks Analysis and Mining 2019, pp. 65–72. ACM (2019)
26. Tang, J., Tang, X., Xiao, X., Yuan, J.: Online processing algorithms for influence maxi-
mization. In: Proceedings of the 2018 International Conference on Management of Data,
pp. 991–1005. SIGMOD ’18, Association for Computing Machinery, New York, NY, USA
(2018)
27. Tang, Y., Shi, Y., Xiao, X.: Influence maximization in near-linear time: a martingale app-
roach. In: Proceedings of the 2015 ACM SIGMOD International Conference on Manage-
ment of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pp. 1539–1554 (2015)
28. Tang, Y., Xiao, X., Shi, Y.: Influence maximization: near-optimal time complexity meets
practical efficiency. In: Proceedings of the International Conference on Management of Data,
SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pp. 75–86 (2014)
29. Tsang, A., Wilder, B., Rice, E., Tambe, M., Zick, Y.: Group-fairness in influence maximiza-
tion. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intel-
ligence, IJCAI 2019, Macao, China, August 10-16, 2019, pp. 5997–6005 (2019)
30. Ye, M., Liu, X., Lee, W.: Exploring social influence for recommendation: a generative model
approach. In: Proceedings of the 35th International ACM SIGIR conference on research and
development in Information Retrieval, SIGIR ’12, Portland, OR, USA, August 12-16, 2012,
pp. 671–680 (2012)
31. Zhang, H., Kuhnle, A., Zhang, H., Thai, M.T.: Detecting misinformation in online social
networks before it is too late. In: Proceedings of the IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining, ASONAM, San Francisco, CA, USA,
August 18-21, 2016, pp. 541–548 (2016)
32. Zhang, X., Zhu, J., Wang, Q., Zhao, H.: Identifying influential nodes in complex networks
with community structure. Knowl.-Based Syst. 42, 74–84 (2013)
33. Zhu, J., Ghosh, S., Wu, W.: Group influence maximization problem in social networks. IEEE
Trans. Comput. Social Syst. 6(6), 1156–1164 (2019)
34. Zhu, J., Ghosh, S., Wu, W., Gao, C.: Profit maximization under group influence model in
social networks. In: Tagarelli, A., Tong, H. (eds.) CSoNet 2019. LNCS, vol. 11917, pp. 108–
119. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34980-6_13
Exemplar-Embed Complex Matrix
Factorization with Elastic-Net Penalty:
An Advanced Approach for Data Representation
Abstract. This paper presents an advanced method for complex matrix factoriza-
tion, termed exemplar-embed complex matrix factorization with elastic net penalty
(ENEE-CMF). The proposed ENEE-CMF integrates both L1 and L2 regulariza-
tions on the encoding matrix to enhance the sparsity and effectiveness of the pro-
jection matrix. Utilizing Wirtinger’s calculus for differentiating real-valued com-
plex functions, ENEE-CMF efficiently addresses complex optimization challenges
through gradient descent, enabling more precise adjustments during factorization.
Experimental evaluations on facial expression recognition task demonstrate that
ENEE-CMF significantly outperforms traditional non-negative matrix factoriza-
tion (NMF) and similar complex matrix factorization (CMF) models, achieving
superior recognition accuracy. These findings highlight the benefits of incorpo-
rating elastic net regularization into complex matrix factorization for handling
challenging recognition tasks.
1 Introduction
Data representation is crucial in face imaging tasks such as facial expression recogni-
tion (FER), as the quality and effectiveness of the representation significantly influence
the system’s ability to accurately interpret facial features. The performance of an FER
system can be affected by various factors, including age, ethnicity, gender, facial hair,
makeup, gestures, occlusions, and lighting conditions [1]. An effective representation
must preserve key characteristics of facial expressions to facilitate accurate recognition
and classification of emotions. Developing a robust FER system remains challenging,
with feature extraction playing a pivotal role. Recent advancements have focused on
subspace projection techniques for appearance-based features, utilizing matrix factor-
ization in both real and complex domains to create a new feature matrix that effectively
maps data into a lower-dimensional subspace.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 415–426, 2025.
https://doi.org/10.1007/978-981-96-4282-3_33
416 M. Q. Bui et al.
In the real domain, techniques such as Principal Component Analysis (PCA) [2], Lin-
ear Discriminant Analysis (LDA) [3, 10], and Nonnegative Matrix Factorization (NMF)
[4, 5] have been widely utilized to represent facial images as linear combinations of low-
rank basis images. Lee and Seung [4, 5] found that NMF exhibits superior performance in
parts-based representation. To further refine this method, Hoyer [6] introduced a sparsity
function that enhances decompositions by incorporating sparsity into NMF. Addition-
ally, Yuan and Oja [7] extended the technique by developing Projective NMF (PNMF),
which is specifically designed to capture localized features. Over the years, numerous
extensions of NMF have been developed to advance the field of FER. For example, Niki-
tidis et al. [8] introduced NMF variants that incorporate discriminant criteria, such as
Clustering-Based Discriminant Analysis (CDA) [9] and Linear Discriminant Analysis
(LDA) [10]. Lee and Chellappa [11] proposed the integration of sparsity constraints to
derive localized dictionaries from dense motion flow image sequences. These advance-
ments emphasize the critical role of regularization in most NMF frameworks to optimize
FER performance. Nevertheless, traditional NMF methods are inherently limited by their
strict requirement for nonnegative entries in the data matrix, which significantly restricts
their applicability. To circumvent this limitation, Semi-NMF and Convex NMF (Con-
NMF) algorithms have been introduced [12]. Notably, the Con-NMF algorithm ensures
that basis vectors are convex or linear combinations of data points with mixed signs.
Motivated by the work of Liwicki et al. [13], which demonstrated that the squared
Frobenius norm in the complex domain is equivalent to a robust dissimilarity measure in
the real domain, Duong et al. [14, 15] transformed real data into the complex domain for
complex matrix factorization and proposed unsupervised learning algorithms to enhance
FER model. By utilizing Wirtinger’s calculus and the gradient descent method [18, 19],
these effectively tackle complex optimization challenges, offering significantly enhanced
recognition accuracy compared to traditional matrix factorization techniques.
In this work, we present a novel algorithm for complex matrix factorization incor-
porating an elastic net penalty, known as ENEE-CMF. This method advances image
representation techniques in the complex domain. The key contributions of this study
are as follows:
• The introduction of the ENEE-CMF method for image analysis within the complex
domain.
• Derivation of updating rules for ENEE-CMF using gradient descent techniques
specifically designed for the complex domain.
• A thorough experimental evaluation of facial expression recognition, revealing that
the proposed ENEE-CMF method, enhanced with an elastic net penalty, outperforms
both standard and extended NMF and CMF techniques.
The structure of this paper is organized as follows Sect. 2 provides an overview of key
topics, including NMF and CMF techniques relevant to our model. Section 3 describes
the elastic net-constrained ENEE-CMF framework, with a focus on applying gradient
descent to solve the constrained optimization problem in the complex domain. Section 4
presents a comparison of experimental results obtained from ENEE-CMF with those
from various NMF and EE-CMF methods. Section 5 demonstrates the generalizability
of the proposed methods in comparison with standard NMF and EE-CMF approaches.
Exemplar-Embed Complex Matrix Factorization 417
2 Preliminaries
2.1 Nonnegative Matrix Factorization
Consider an N × M input data matrix S = (s1 , s2 ,..., sM ), where M denotes the number
of facial images and each column sm represents an image of size p by q (N = p × q). The
×K
goal of NMF problem is to identify matrices W ∈ RN + and V ∈ RK×M
+ that minimize
the following objective function:
1
min NMF (W, V) = S − WV2F (1)
W≥0,V≥0 2
Here, the matrix W consists of K basic vectors, which are combined linearly using the
coefficients in V to approximate the data matrix. Lee and Seung [4, 5] proposed iterative
algorithms to solve this problem, updating V and W as follows:
(WT S)ij
Vij ← Vij (2)
(WT WV)ij
(SVT )ij
Wij ← Wij (3)
(WVVT )ij
To address the limitation of non-negative data constraints, Ding et al. [12] developed
Convex Nonnegative Matrix Factorization (Con-NMF), which accommodates mixed-
sign data matrices. Con-NMF imposes the constraint that column vectors must reside
within the column space of S, W = SΠ where Π is an auxiliary adaptive weight matrix.
This approach leads to the following modified objective function [17]:
1
2
×K ×K
min conNMF (Π, V) = S − SVT s.t Π ∈ RM + , V ∈ RM + (4)
2 F
where ST S = (ST S)+ − (ST S)− , (xij )+ = max{0, xij }, and (xij )− = max{0, −xij }
Sparse NMF. Sparse representation has recently gained significant attention in relation
to the NMF problem due to its effective classification performance and robust properties
[6, 25, 26]. Given a non-negative matrix M ∈ RN ×P , NMF seeks to find two non-negative
matrices W and V such that: M ≈ WV, here W ∈ RN ×K and V ∈ RK×P . In sparse
NMF, the goal is to obtain a sparse projection matrix for explicit data representation and
effective interpretation. Generally, sparse dimensionality reduction NMF is expressed
compactly as [26]:
where p(V) is a penalty term that enforces sparsity in the learning process.
This sparse term is designed to restrict the number of nonzero elements in each
column of the projection matrix. In such scenarios, the L1 -norm is often used as a
relaxation of the L0 penalty [25, 26]. While the L1 -norm (lasso) is convex, it is not
differentiable, making it challenging to find solutions for the lasso-regularized model. To
address this issue, W. Liu et al. [25] imposed an L2 -norm on the lasso-penalized problem,
overcoming the limitations of the L1 -norm while preserving its beneficial properties.
The elastic net penalty, which combines the L1 -norm and the L2 -norm of the projection
matrix, is convex with respect to the projection matrix and thus provides the grouping
effect property [25].
3 Proposed Method
3.1 Exemplar-Embedded Complex Matrix Factorization with the Elastic Net
Penalty (ENEE-CMF)
This section proposes a new model based on EE-CMF by incorporating a complex elastic
net constraints into the coefficient matrix.
As previously discussed, the EE-CMF model aims to factorize a complex data matrix
T ∈ CN ×M into the form T = BE, where B is given by B = TW. The proposed
model introduces constraints on both the L1 -norm and the L2 -norm of the projection
matrix, incorporating these as regularization terms in the objective function to enhance
performance and stability. This leads to the comprehensive definition of our model,
Exemplar-Embedded Complex Matrix Factorization with Elastic Net Penalty (ENEE-
CMF), as follows:
1
min fENEE - CMF (W, E) = min [ T − TWE2F + ω1 E1 + ω2 E22 ] (12)
W,E W,E 2
M
M K M K
where E1 = E: j = ( Eij ); E22 = Eij2 (13)
1
j=1 j=1 i=1 j=1 i=1
and ω1 , ω2 are regularization parameters that balance the trade-off between the
approximation error and the sparse constraints.
To solve the minimization problem, we also apply the complex gradient descent
algorithm, utilizing Wirtinger’s calculus [18, 19].
First, by fixing W, the objective function in (11) is modified to depend only on the
single variable E. As a result, the objective function is transformed as follows:
1
min OENEE - CMF (E) = min [ T − TWE2F + ω1 E1 + ω2 E22 ] (14)
E E 2
Then, W is updated based on the Moore–Penrose pseudo inverse [28], and computed
by W = (T † T)E † with fixed E, where † denoted pseudo inverse matrix.
To solve sub-problems (13), the function O(E) is treated as O(E, E∗ ), where,
According to [19], at a given iteration round t, the following update rule is employed.
where βt is the learning step parameter for the t th iteration, estimated using the Armijo
rule [29]. Based on the Armijo rule, βt = μst , 0 < μ < 1, and st is the first non-negative
integer such that the following inequality is satisfied.
O(E(t+1) , E∗(t+1) ) − O(E(t) , E∗(t) ) ≤ 2σ Re ∇V∗ O(E(t) , E∗(t) ), E(t+1) − E(t)
(17)
The first order partial derivatives with respect to V* are evaluated as follows.
E
∇E∗ OENEE−CMF (E, E∗ ) = −WH TH T + WH TH TWE + α1 + α2 E (18)
abs(E)
The condition specified by (16) ensures that the function value decreases with each
iteration. Finally, a pre-defined threshold Γ can be selected, and the stopping criterion
is set as follows:
∇V∗ O(E, E∗ ) ≤ (19)
F
4 Experiments
This section evaluates the performance of the proposed ENEE-CMF framework for
FER. The classification ability of the derived encoding coefficient vector is compared
with that of various NMF-based methods and EE-CMF. The basis matrix Wtrain was
generated from Wtrain = Ttrain (Etrain Ttrain )† during the training phase. The test sample
was encoded as etest = (TWtr )† ttest , and classification was conducted using a nearest
neighbor classifier following projection.
Fig. 1. Cropped facial images displaying six distinct expressions from the JAFFE dataset [30] in
the first row, and from the CK + dataset [31] in the second row.
(weiNMF), which applies binary weights to the data matrix [32], (7) NeNMF, utilizing
Nesterov’s optimal gradient method for efficient optimization [33], and (8)(9) unsu-
pervised and supervised robust nonnegative graph embedding (uRNGE and sRNGE),
which replace the L2 -norm with the L21 -norm for greater robustness [34, 35], and (10)
EE-CMF [15].
To meet the condition specified in Eq. (16), the step size reduction rate μ was set
to 0.01. The stopping criterion, as described in Eq. (18), used a relative tolerance ε was
10–4 or a maximum of 10,000 iterations. For both the sparse complex ENEE-CMF and
the sparse real NMF algorithms, a sparseness control parameter ω of 0.1 was used in all
simulations.
Table 1. The accuracy rate (%) on the JAFFE dataset with various subspace dimensionalities
No ENEE- EE- NMF Con- Eud_ Kull_ EN- weiNMF NeNMF GSNMF uRNGE sRNGE
Base CMF CMF NMF SNMF SNMF NMF
20 67.51 66.99 65.24 45.53 35.66 64.20 69.65 63.36 66.85 68.11 22.38 27.62
30 71.61 66.36 68.11 49.09 48.53 68.67 71.19 68.32 61.05 70.07 22.10 33.78
40 70.07 72.31 70.84 48.60 57.20 68.39 70.21 68.95 63.28 69.23 28.18 39.79
50 73.99 72.03 71.68 52.17 64.76 70.21 70.77 69.02 61.82 69.09 32.66 43.15
60 73.15 72.45 71.12 46.36 64.48 71.47 69.93 72.38 63.71 70.28 36.78 47.41
70 71.47 72.31 69.79 27.34 48.81 70.07 59.44 69.16 62.52 70.98 42.24 51.19
80 72.73 72.59 26.15 27.90 14.41 72.03 46.29 23.50 62.87 69.93 46.64 59.86
90 74.97 71.68 16.01 26.01 11.33 70.91 50.21 26.71 61.40 70.91 49.37 64.55
100 74.83 73.57 18.60 35.59 14.41 70.07 43.50 15.25 63.36 70.56 53.64 66.36
Ave 72.26 71.14 53.06 39.84 39.95 69.56 61.24 52.96 62.98 69.91 37.11 48.19
Table 2. The accuracy rate (%) on the CK+ dataset with various subspace dimensionalities
No ENEE- EE- NMF Con- Eud_ Kull_ EN- weiNMF NeNMF GSNMF uRNGE sRNGE
Base CMF CMF NMF SNMF SNMF NMF
20 97.23 95.43 85.41 41.28 46.61 93.06 96.24 85.06 94.30 95.60 30.87 41.74
30 96.94 92.25 90.99 58.99 65.83 94.59 96.36 91.07 89.30 96.69 40.93 53.28
40 96.65 91.24 93.88 73.55 79.50 94.38 96.32 94.17 90.27 96.01 45.37 61.18
50 96.49 95.06 94.5 80.27 87.64 95.45 96.45 94.75 91.80 96.71 51.94 66.39
60 96.70 96.14 95.06 87.40 91.57 95.41 76.94 94.92 92.34 96.57 55.45 70.39
70 96.85 96.59 95.18 90.42 94.46 96.28 90.41 95.62 93.31 96.71 55.19 70.74
80 96.57 96.74 95.93 91.57 95.66 95.66 53.72 95.58 93.27 96.84 60.29 74.74
90 97.15 96.63 95.95 92.42 95.91 95.79 48.39 95.87 93.26 96.82 62.46 74.68
100 96.86 96.78 96.03 92.03 96.65 96.24 79.83 95.62 94.17 96.86 78.60 74.35
Ave 96.83 95.21 93.66 78.66 83.76 95.21 81.63 93.63 92.45 96.54 53.46 65.28
Fig. 2. Occluded images of six facial expressions, along with a neutral expression, from the CK+
dataset.
No ENEE- EE- NMF Con- Eud_ Kull_ EN- weiNMF NeNMF GSNMF uRNGE sRNGE
Base CMF CMF NMF SNMF SNMF NMF
20 74.46 73.26 50.62 20.04 24.71 60.66 76.12 48.60 68.97 50.50 21.78 26.90
30 81.65 68.77 58.39 24.59 28.68 61.86 80.00 58.97 63.88 75.21 22.93 32.69
40 84.13 68.48 62.27 26.16 28.02 70.74 85.29 63.43 61.07 74.09 28.55 35.33
50 85.45 68.71 65.29 27.52 32.48 71.73 53.39 65.00 59.71 68.22 28.06 39.21
60 85.87 72.20 70.37 27.19 36.86 72.45 63.64 67.85 55.54 59.01 34.17 40.95
70 86.28 71.31 70.33 26.82 35.21 71.27 67.52 72.36 52.98 67.85 34.92 45.08
80 87.85 73.91 73.31 27.81 37.77 75.60 47.77 72.40 47.23 65.62 38.60 46.61
90 80.74 73.38 73.39 27.73 36.61 78.41 75.37 73.31 39.09 72.98 41.78 48.97
100 75.45 73.20 75.25 29.34 38.93 78.17 62.73 74.38 37.60 70.08 40.54 54.79
Ave 82.43 71.47 66.58 26.36 33.25 71.21 67.98 66.26 54.01 67.06 32.37 41.17
Table 3 displays the recognition results for occlusions in random facial regions.
The ENEE-CMF consistently outperforms other methods, achieving the highest average
recognition accuracy of 82.43% across all occlusion levels, demonstrating exceptional
performance in handling occlusions. The EE-CMF also performs well, with an average
accuracy of 71.47%, but falls short of ENEE-CMF. Among advanced methods, EN-
NMF and weiNMF achieve moderate accuracies of 67.98% and 66.26%, respectively,
showing improvement over basic methods yet still lagging behind ENEE-CMF. NeNMF
and GSNMF yield competitive results, with average accuracies of 67.06% and 67.98%,
respectively, outperforming several basic methods but not reaching ENEE-CMF’s accu-
racy. Overall, the ENEE-CMF model demonstrates superior accuracy and robustness in
handling occlusions compared to its counterparts.
424 M. Q. Bui et al.
5 Analyzing Generalizability
To evaluate the generalizability of the proposed methods against standard NMF and EE-
CMF approaches, we conducted subject-independent and cross-dataset experiments. The
training phase used a subset of the JAFFE dataset, while testing involved images from
both JAFFE and CK+. The training set included images from seven individuals, and the
testing set consisted of images from the remaining three JAFFE individuals and three
randomly selected CK+ individuals. The average recognition accuracy for all algorithms
was below 40%. However, the ENEE-CMF models exhibited superior generalizability
compared to their counterparts, as shown in Fig. 3.
Fig. 3. The accuracy rate (%) on the JAFFE data for training and a combination of JAFFE and
CK + datasets for testing across various subspace dimensionalities
Figure 4 presents the basis images extracted from the training data of the CK+ dataset.
Fig. 4. Basis images learned from the training data in the CK + dataset by (a) the proposed
ENEE-CMF, (b) EE-CMF, (c) Con_NMF, (d) Eud_SNMF, (e) Kull_SNMF, and (f) EN_NMF
6 Conclusions
This work presents a novel complex matrix factorization approach with an elastic net
penalty. To solve complex matrix factorization problems, the gradient descent method
with Wirtinger calculus was employed. The model was evaluated on two facial expression
Exemplar-Embed Complex Matrix Factorization 425
datasets, achieving high recognition accuracy and outperforming both EE-CMF and
NMF algorithms. Future work will focus on enhancing the model by incorporating
additional regularization techniques and advanced optimization methods to improve
convergence and robustness, particularly in handling more complex occlusions and noisy
data. Another promising direction is extending the model to process dynamic facial
expressions in real-time by leveraging spatio-temporal features. Additionally, applying
the model to other domains, such as object recognition or medical image analysis, could
further expand its applicability.
References
1. Fasel, B., Luettin, J.: Automatic facial expression analysis: a survey. Pattern Recogn. 36(1),
259–275 (2003)
2. Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, San
Diego, CA (2013)
3. Noushath, S., Hemantha Kumar, G., Shivakumar, P.: Diagonal fisher linear discriminant
analysis for efficient face recognition. Neurocomputing 69, 1711–1716 (2006)
4. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization.
Nature 401(6755), 788–791 (1999)
5. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: NIPS, 13,
pp. 556–562 (2000)
6. Hoyer, P.: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res.
5(9), 1457–1469 (2004)
7. Yuan, Z., Oja, E.: Projective nonnegative matrix factorization for image compression and
feature extraction. In: Image Analysis: 14th Scandinavian Conference (SCIA 2005), pp. 333–
342. Springer, Berlin Heidelberg (2005)
8. Nikitidis, S., Tefas, A., Pitas, I.: Using subclasses in discriminant non-negative subspace
learning for facial expression recognition. In: 19th European Signal Processing Conference,
pp. 1964–1968. IEEE (2011)
9. Chen, X., Huang, T.: Facial expression recognition: a clustering-based approach. Pattern
Recogn. Lett. 24(9–10), 1295–1302 (2003)
10. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recognition
using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine
Intelligence 19(7), 711–720 (1997)
11. Lee, C.S., Chellappa, R.: Sparse localized facial motion dictionary learning for facial expres-
sion recognition. In: International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 3548–3552. IEEE (2014)
12. Ding, C.H., Li, T., Jordan, M.I.: Convex and semi-nonnegative matrix factorizations. IEEE
Trans. Pattern Anal. Mach. Intell. 32(1), 45–55 (2008)
13. Liwicki, S., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: Euler principal component analysis.
Int. J. Comput. Vision 101, 498–518 (2013)
14. Duong, V.H., Lee, Y.S., Pham, B.T., Mathulaprangsan, S., Bao, P.T., Wang, J.C.: Com-
plex Matrix Factorization for Face Recognition. https://arxiv.org/ftp/arxiv/papers/1612/1612.
02513.pdf
426 M. Q. Bui et al.
15. Duong, V.H., et al.: Exemplar-embed complex matrix factorization for facial expression
recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 1837–1841). IEEE (2017)
16. Duong, V.H., et al.: Projective complex matrix factorization for facial expression recognition.
EURASIP Journal on Advances in Signal Processing, pp. 1–11 (2018)
17. Wirtinger, W.: Zur formalen theorie der funktionen von mehr komplexen veränderlichen.
Math. Ann. 97(1), 357–375 (1927)
18. Hu, W., Choi, K.S., Wang, P., Jiang, Y., Wang, S.: Convex nonnegative matrix factorization
with manifold regularization. Neural Netw. 63, 94–103 (2015)
19. Amin, M.F., Amin, M.I., Al-Nuaimi, A.Y.H., Murase, K.: Wirtinger calculus based gradient
descent and Levenberg-Marquardt learning algorithms in complex-valued neural networks.
In: International Conference on Neural Information Processing, pp. 550–559. Springer Berlin
Heidelberg (2011)
20. Hjorungnes, A., Gesbert, D.: Complex-valued matrix differentiation: techniques and key
results. IEEE Trans. Signal Process. 55(6), 2740–2746 (2007)
21. Gilbert, S.: Linear Algebra and its Applications. Saunders College Publishing (1986)
22. Moskowitz, M.A.: A Course in Complex Analysis in One Variable. World Scientific (2002)
23. Eggert, J., Korner, E.: Sparse coding and NMF. In: IEEE International Joint Conference on
Neural Networks (IEEE Cat. No. 04CH37541), 4, pp. 2529–2533. IEEE (2004)
24. Schmidt, M.N.: Speech separation using non-negative features and sparse non-negative matrix
factorization. In: Interspeech, pp. 19–33. Technical University of Denmark, DTU (2007)
25. Liu, W., Zheng, S., Jia, S., Shen, L., Fu, X.: Sparse nonnegative matrix factorization with the
elastic net. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM),
pp. 265–268. IEEE (2010)
26. Sra, S., Dhillon, I.: Generalized nonnegative matrix approximations with Bregman diver-
gences. Adv. Neural. Inf. Process. Syst. 18, 283–290 (2005)
27. Peharz, R., Pernkopf, F.: Sparse nonnegative matrix factorization with L0 -constraints.
Neurocomputing 80, 38–46 (2012)
28. Barata, J.C.A., Hussein, M.S.: The Moore-Penrose pseudoinverse: a tutorial review of the
theory. Braz. J. Phys. 42, 146–165 (2012)
29. Lin, C.J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput.
19(10), 2756–2779 (2007)
30. Lyons, M., Akamatsu, S., Kamachi, M., Gyoba, J.: Coding facial expressions with gabor
wavelets. In: Proceedings Third IEEE International Conference on Automatic Face and
Gesture Recognition, pp. 200–205. IEEE (1998)
31. Kanade, T., Cohn, J.F., Tian, Y.: Comprehensive database for facial expression analysis.
In: Proceedings Fourth IEEE International Conference on Automatic Face and Gesture
Recognition (cat. No. PR00580), pp. 46–53. IEEE (2000)
32. Wang, D., Li, T., Ding, C.: Weighted feature subset non-negative matrix factorization and its
applications to document understanding. In: IEEE International Conference on Data Mining,
pp. 541–550. IEEE (2010)
33. Guan, N., Tao, D., Luo, Z., Yuan, B.: NeNMF: an optimal gradient method for nonnegative
matrix factorization. IEEE Trans. Signal Process. 60(6), 2882–2898 (2012)
34. Yang, J., Yang, S., Fu, Y., Li, X., Huang, T.: Non-negative graph embedding. In: 2008 IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
35. Zhang, H., Zha, Z.J., Yang, Y., Yan, S., Chua, T.S.: Robust (semi) nonnegative graph
embedding. IEEE Trans. Image Process. 23(7), 2996–3012 (2014)
A Method Combining the Reference
Information of the Adaptive Adjustment
Method and the Decision Maker
of Multi-objective Evolutionary
Algorithms
1 Introduction
In reality, optimization problems often have more than one objective, and most
of the objectives often conflict. From there, a class of multi-objective optimiza-
tion problems is formed. A multi-objective optimization problem is defined as
an optimization problem that has at least two objectives, these objectives are
conflicting and need to be optimized simultaneously. Mathematically, the multi-
objective optimization problem (MOP) with k objectives is expressed as follows:
f (x) = [f1 (x), f2 (x), ..., fk (x)] (1)
in which x is a vector of decision variables in v-dimensional Rv . In evolutionary
computation (EC), x represents an individual in the population to be evolved.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 427–438, 2025.
https://doi.org/10.1007/978-981-96-4282-3_34
428 L. Nguyen et al.
ary process to simultaneously meet two goals of adaptive self-adjustment and the
requirements from the decision maker. The paper is structured into 4 sections,
Sect. 1 is a general introduction, Sect. 2 presents the usage of reference infor-
mation in adaptive adjustment and multi-objective interaction, Sect. 3 presents
methodology for combining reference information from DM and from the self-
adaptation process, Sect. 4 is the experiment and results. Section 5 is the con-
clusion.
3 Methodology
3.1 Combining Reference Information
In this section, we propose a method to use appropriate reference information
to adjust the algorithm to meet both the requirements of self-adjustment and
the wishes of the decision maker. First of all, it must be assessed that both the
reference information provided by the self-adjustment process and the decision
maker are very important for the effectiveness of the algorithm in practice both in
theory and in practice. Through the survey, the use of multiple reference points
is a quite popular interactive methods [2, 5, 7, 9,12,15], so in our proposal, we
propose to combine the information between the adaptive adjustment process
and the interactive information of the decision maker of the multi-objective
432 L. Nguyen et al.
Xj = Pj ∪ Qj (2)
We can see that, in the early generations of the evolution process, the solu-
tions are still quite scattered and far from the Pareto optimal region. Therefore,
the role of auto-adjustment is very important to ensure the balance between the
Combining the Reference Information of MOEAs 433
ability to search extensively (or the ability to explore) and the ability to con-
verge quickly (or the ability to exploit). On the contrary, in the later stages of
the evolution process, the solutions have approached the Pareto optimal region,
the impact of auto-adjustment is not much. Therefore, we propose to use the two
sets X and Y according to the following strategy: We divide the evolution process
of the algorithm into the early phase and the late phase based on the number of
generations or the number of evaluations (the algorithm can be designed accord-
ing to the number of generations or the number of evaluations of the fitness
function). In the interactions in the early phase of the evolution process, we use
the set X as the reference set to adjust the search process of the algorithm. And
in the interactions in the late phase, we use the set Y as the reference set for
adjustment.
After determining the reference information that is combined between the ref-
erence information of the adaptive adjustment and the reference information
of the decision maker. The next step is to use this combined reference infor-
mation to adjust the evolutionary process. In principle, reference information
can be expressed in many forms such as: trade-off information, reference points,
reference directions and classification of objective functions. This information,
according to each algorithm with different mechanisms such as using to guide
the search direction, search area... has priority of fast convergence objective and
spread evenly over the area created by the references.
Fig. 2. The illustration of the determining reference points via empty regions
such as selection, mutation, crossover. Therefore, in the step of guiding the algo-
rithm according to the synthesized reference information in the proposal, it is
necessary to evaluate the control information to orient the algorithm according
to the newly created reference region. And this method is basically the same
as using reference information from the proposals in interaction and adaptive
adjustment. For example, the DMEA-II+ algorithm [2] uses reference points
(called bliss points) to create a new ray system from the mechanism of auto-
matically determining the reference points from the center of the empty regions
drawn in the Fig. 2; the MOEA/D++ algorithm is used to calculate reference
points in decomposition subproblems using information combined with the input
reference point set [9].
4 Experiments
4.1 Building up on Reference Points
To build up the technique, in this paper, we apply it to two algorithms MOEA/D
[17], DMEA-II [10] on ZDTs benchmark sets [19], two recently competing algo-
rithms in the field of multi-objective optimization using reference points as ref-
erence information in self-adjustment and interaction.
The experiment is conducted using the DMEA-II algorithm. The algorithm
is evaluated for 1000 generations and divided into two testing phases: the early
phase from generation 1 to 500 and the late phase from generation 501 to 1000.
The early phase performs the interaction at generations 150 and 400; and the late
phase performs the interaction at generations 600, 700 and 800. By determining
the bliss point at the time of interaction with the reference point determined by
the decision maker on the objective space, according to the above determination
method, we will use the union of two sets of reference points (set X) at the
interactions in the early stage, and use the intersection of two sets of reference
points (set Y) in the late stage. Using the X, Y set in interacting with DMEA-II,
a new ray system is added with rays passing through the reference points and
removing rays in the region far from the area containing the reference point set.
At each interaction, the reference points (bliss points) of the adaptive adjustment
process are derived using the method defined as in [2]. The newly created ray
system serves as a reference for the RD (Ray based density) niching function
used to control the algorithm’s search process (as in Fig. 3).
Similarly, experimenting with the MOEA/D algorithm, here the reference
points of the adaptive adjustment process are generated at each interaction time
according to the method in [9]. Combined with the reference points intuitively
inserted by the decision maker into the objective space, the unions of the two sets
of reference points X and the intersection of the reference point Y are generated,
which are used in the early and late stages as described in the method. Similar
to the determination of bliss points as reference points in DMEA-II through
the “empty region” concepts, the reference points of the adaptive auto-tuning
process of the MOEA/D algorithm in the publication [9] are determined, together
with the reference points introduced by the decision maker according to the
Combining the Reference Information of MOEAs 435
strategy using two stages of the search process as proposed in this paper. At
each interaction, the set X and Y are selected together with the reference points
currently used by the algorithm, and synthesized to create a new reference point.
This final reference point is used to guide the algorithm to prioritize the choice
of solutions for the evolution process according to the algorithm’s decomposition
mechanism (as in Fig. 4).
4.2 Results
the reference information fusion step, the strategy uses the union of two sets
according to the formula (2) at the early stage to simultaneously satisfy the
priority requirements of the adaptive adjustment process, the decision maker,
and the need to expand the global search information. In the late stage, the
intersection of two sets to the formula (3) is used to prioritize solutions for the
desired regions of the decision maker, when the auto-adaptation process has a
weak impact in the late stages of the evolution.
In the step of guiding the algorithm according to the combined reference
information, according to each algorithm’s evolutionary mechanism, that infor-
mation has become important control information that creates priority for indi-
viduals that satisfy the regions due to the adaptive adjustment process and the
decision maker’s desire on the objective space according to the mechanism of each
algorithm. Specifically, with the DMEA-II, by the niching mechanism based on
the RD (Ray based density) [10], the algorithm has created a new solution set,
converged and spread evenly on the Pareto layer of the reference point set area
synthesized from the self-adjustment process and entered by the decision maker.
With the MOEA/D, based on the mechanism of using reference points (or ideal
points) combined between the current reference point of the algorithm and the
reference point synthesized from the center point of the envelope containing the
reference points synthesized from formulas (2), (3) corresponding to the early
and late stages of the evolution process. The new reference point created in this
way serves as the basis for the algorithm to determine neighboring solutions in
the process of decomposition the subproblems of the algorithm’s decomposition
mechanism.
In general, the results obtained through four interactions with random points
on the objective space in the first and last stages of the two experimental algo-
rithms, with the mechanism of using reference points in adaptive adjustment
and interaction. The results obtained through visual observation of the graph-
ical representation on the objective space, can easily see the clear adjustment
according to the desired region of the decision maker, and importantly, at the
same time ensure the convergence speed and dispersion, or the diversity evenly
distributed according to the Pareto optimal layer. The results confirm that the
positive impact in combining the reference information of the adaptive adjust-
ment process and the reference information provided by the decision maker, has
practical value for practical problems.
5 Conclusion
At the same time, the adaptive adjustment of the algorithm ensures the popu-
lation quality and the search ability to meet the wishes of the decision maker,
which is very important in MOPs solved by evolutionary algorithms in prac-
tice. The paper has proposed a method of synthesizing reference information
in the form of reference points from the adaptive control process and the deci-
sion maker. The process of synthesizing reference information is carried out on
a two-stage evolutionary strategy, the early stage and the late stage, which is
Combining the Reference Information of MOEAs 437
suitable for the evolutionary characteristics of the population and still meets
the above simultaneous requirements. Reference information in the form of ref-
erence points, according to each mechanism of the algorithm, is used to guide
the algorithm to meet the requirements in each stage, at the interaction times.
The experimental results on the DMEA-II and the MOEA/D algorithms in 2-
dimensional space have confirmed the significance of the process of combining
reference information, which is valuable in practical problems. Further studies
can be extended to algorithms with multidimensional objective spaces, reference
information represented in reference lines, and value functions to effectively com-
bine reference information from the adaptive adjustment process and from the
decision maker’s wishes.
References
1. Binh, M., Nguyen, L.: An approach to enhance the equilibrium of search capa-
bilities for multi-objective evolutionary algorithms based on differential evolution.
In: 2024 7th International Conference on Information and Computer Technolo-
gies (ICICT), pp. 145–150. IEEE Computer Society, Los Alamitos (2024). https://
doi.org/10.1109/ICICT62343.2024.00029. https://doi.ieeecomputersociety.org/10.
1109/ICICT62343.2024.00029
2. Binh, M.T., Nguyen, L., Duc, D.N.: Using bliss points to enhance direction based
multi-objective algorithms. In: 2022 14th International Conference on Knowl-
edge and Systems Engineering (KSE), pp. 1–6 (2022). https://doi.org/10.1109/
KSE56063.2022.9953747
3. Binh, M.T., Nguyen, L., Duc, D.N.: An approach to maintain the balance between
exploitation and exploration of the evolutionary process in multi-objective algo-
rithms. In: 2023 6th International Conference on Information and Computer Tech-
nologies (ICICT), pp. 29–34. IEEE (2023)
4. Corrente, S., Greco, S., Matarazzo, B., Slowiński, R.: Explainable interactive evo-
lutionary multiobjective optimization. Omega 122, 102925 (2024)
5. Duc, D.N., Nguyen, L., Trung, K.T.: An interactive method for surrogate-assisted
multi-objective evolutionary algorithms. In: 2020 12th International Conference on
Knowledge and Systems Engineering (KSE), pp. 195–200. IEEE (2020)
6. Gong, C., Nan, Y., Shu, T., Pang, L.M., Ishibuchi, H., Zhang, Q.: Interactive
final solution selection in multi-objective optimization. In: 2024 IEEE Congress on
Evolutionary Computation (CEC), pp. 1–9. IEEE (2024)
7. Li, S., Zhang, Y., Wang, Q., He, L., Li, H., Ye, B.: A surrogate-assisted multi-
objective evolutionary algorithm guided by hybrid reference points. In: Tan, Y.,
Shi, Y. (eds.) Advances in Swarm Intelligence (2024)
8. Li, W., Tang, J., Wang, L.: Many-objective evolutionary algorithm with multi-
strategy selection mechanism and adaptive reproduction operation. J. Supercom-
put. 1–48 (2024)
438 L. Nguyen et al.
9. Minh, T.B., Long, N., Kien, T.T.: An adaptive reference point technique to improve
the quality of decomposition based multi-objective evolutionary algorithm. J. Mil.
Sci. Technol. (CSCE7) 3–14 (2023)
10. Nguyen, L., Bui, L.T., Abbass, H.A.: DMEA-II: the direction-based multi-objective
evolutionary algorithm-II. Soft. Comput. 18(11), 2119–2134 (2014)
11. Nguyen, L., Bui, L.T.: A ray based interactive method for direction based multi-
objective evolutionary algorithm. In: Huynh, V.N., Denoeux, T., Tran, D.H., Le,
A.C., Pham, S.B. (eds.) Knowledge and Systems Engineering. AISC, vol. 245, pp.
173–184. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-02821-7 17
12. Nguyen, L., Duc, D.N., Thanh, H.N.: An enhanced multi-point interactive method
for multi-objective evolutionary algorithms. In: Satapathy, S.C., Bhateja, V.,
Nguyen, B.L., Nguyen, N.G., Le, D.-N. (eds.) Frontiers in Intelligent Computing:
Theory and Applications. AISC, vol. 1013, pp. 42–49. Springer, Singapore (2020).
https://doi.org/10.1007/978-981-32-9186-7 5
13. Rudolph, G., Wagner, M.: Towards adaptation in multiobjective evolutionary algo-
rithms for integer problems. In: 2024 IEEE Congress on Evolutionary Computation
(CEC), pp. 1–8. IEEE (2024)
14. Vargas, D.E., Lemonge, A.C., Barbosa, H.J., Bernardino, H.S.: An interactive
reference-point-based method for incorporating user preferences in multi-objective
structural optimization problems. Appl. Soft Comput. 112106 (2024)
15. Vargas, D.E., Lemonge, A.C., Barbosa, H.J., Bernardino, H.S.: An interac-
tive reference-point-based method for incorporating user preferences in multi-
objective structural optimization problems. Appl. Soft Comput. 165, 112106
(2024). https://doi.org/10.1016/j.asoc.2024.112106. https://www.sciencedirect.
com/science/article/pii/S1568494624008809
16. Zhang, Q., Jiao, R., Zeng, S., Zeng, Z.: Balancing exploration and exploitation
with decomposition-based dynamic multi-objective evolutionary algorithm. Int. J.
Cogn. Inform. Nat. Intell. (IJCINI) 15(4), 1–23 (2021)
17. Zhang, Q., Li, H.: MOEA/D: a multiobjective evolutionary algorithm based on
decomposition. IEEE Trans. Evol. Comput. 11, 712–731 (2008). https://doi.org/
10.1109/TEVC.2007.892759
18. Zhao, W., Bian, X., Mei, X.: An adaptive multi-objective genetic algorithm for
solving heterogeneous green city vehicle routing problem. Appl. Sci. 14(15), 6594
(2024)
19. Zitzler, E., Thiele, L., Deb, K.: Comparision of multiobjective evolutionary algo-
rithms: emprical results. Evol. Comput. 8(1), 173–195 (2000)
Modeling Information Diffusion
in Bibliographic Networks Using
Pretopology
1 Introduction
1.1 Problem Definition
Information diffusion is a process in which information is propagated from one
object to another in a network. Numerous fields, including social science [10, 17],
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 439–451, 2025.
https://doi.org/10.1007/978-981-96-4282-3_35
440 T. K. T. Ho et al.
computer science [20], medical field [2, 12], have extensively researched informa-
tion dissemination.
A node participating in the information propagation process is considered
under two states: active and inactive. A node is active if it has already taken the
action related to diffused information and vice versa. For instance, a scientist
is considered active with topic “deep learning” since he studied and published
articles related to that topic, or with a marketing campaign, a customer is marked
as active if he bought the product.
The majority of previous studies have focused on homogeneous networks,
which are networks with only one kind of object and one kind of connection.
Instances of these types of networks include co-author networks, which have
an object author and a co-author link, or object users on Twitter who follow
links. The research can be divided into two branches, including diffusion models
and influence maximization. There are various proposed diffusion models such
as independent cascade model (IC), linear threshold model (LT), etc. [8, 10, 17].
Moreover, scientists proposed diverse algorithms for influence maximization [9,
14, 16, 21] in which the challenge of identifying a small subset of nodes with the
greatest potential for spreading impact.
However, the majority of networks in reality are heterogeneous, containing
a variety of object types and multiple relations. For example, a bibliographic
network is a heterogeneous network that includes a diversity of objects includ-
ing authors, papers, venues, affiliations, and so forth, and several relationships
among objects, such as co-author, common co-author, and so on. In this study,
we focus on the dissemination in heterogeneous bibliographic networks since this
problem plays a significant role in promoting scientific development and research
collaboration.
There are several studies about information diffusion on bibliographic net-
work [6, 11, 13, 18, 19]. In those studies, authors proposed methods to estimate the
activation probability from an active node to an inactive node based on meta-
paths. However, the propagation process is simulated using IC and LT model
which determines the neighbor set for spreading just only based on one relation
co-author and uses graph theory to model network structure.
Graph theory has been widely utilized to model network structure. How-
ever, there are two drawbacks. Firstly, the closure function in topological space
is idempotent, meaning that set .X can only be reached in one step from its
closure. As a result, we are unable to gradually observe a set’s extensibility. Sec-
ond, in graph theory, the combination of all elements’ neighbors in .X defines a
neighborhood set of a set .X. On the other hand, in the real-world network, the
identification of a set’s neighbor set is more intricate. Since pretopology theory
has been proved to be an extension of graph theory [4, 5], it provides a solution to
the aforementioned problems. Thus, we suggest using pretopology in this study
to analyze information spread on bibliographic networks.
In this study, we propose an information diffusion model in the bibliographic
networks using pretopology. This model is pretopological independent cascade
model, namely .P reto_IC, which is an expanded model of IC. In .P retoIC, the
Modeling Information Diffusion in Bibliographic Networks 441
highlights are seed set selection based on the concept of elementary closed subset
and the identification of the node’s neighbor set for spreading based on multi-
relations by pseudo-closure function. Firstly, we define a strong pseudo-closure
function .as (.) to capture the neighborhood set of set .A. This function will be con-
structed based on multi-relations. Next, we apply the concept of the elementary
closed subset to calculate the maximum extensibility of each node. We choose
a seed set from nodes with the highest extensibility. Finally, .P reto_IC model
based on IC with seed set from the previous step and determination of node’s
neighbor set at each spreading step will be based on .as (.). Experimental results
demonstrate that .P reto_IC model obtains higher influence spread comparison
with previous models.
The structure of our paper is organized as follows: Sect. 1 introduces the
problem definition and related works; Sect. 2 reviews preliminaries; our approach
is proposed in Sect. 3; Sect. 4 illustrates experiments and results; we conclude our
work in Sect. 5.
2 Preliminaries
2.1 Hetergeneous Bibliographic Network
2.3 Pretopology
Pretopology [4] is considered as topology’s expansion as a consequence of reduc-
ing its axiomatic constraints. Pretopology is a powerful mathematical foundation
for the notion of proximity. It allows for the gradual observation of a set’s exten-
sibility.
444 T. K. T. Ho et al.
3 Our Approach
In this section, we propose pretopological independent cascade model, namely
P reto_IC. This is an expanded model of the independent cascade model to
.
model the propagation on a heterogeneous network. There are two novel points
in .P reto_IC comparison with IC. Firstly, the seed set is selected based on
elementary closed subsets. Secondly, the identification of a neighborhood set of
an active node to activate infection based on multiple relations. This work is
calculated using strong pseudo-closure function.
To execute .P reto_IC model, we will perform the following steps:
1. Construct a pretopological space in which we define a strong pseudo-closure
function. This function is used to determine the neighborhood set of set .A
and its closure. This step is described in Subsect. 3.1.
2. Calculate elementary closed subsets based on the strong pseudo-closure func-
tion defined in Step 1 and select a seed set. This step is illustrated in Subsect.
3.2.
3. Simulate propagation with .P reto_IC. This work is revealed in Subsect. 3.3.
V (u) = {v ∈ V |u Ri v}
. i
The expansion of set A with several strength levels is represented by .as (.).
The threshold value .s reveals how many relations each element .u must satisfy
to be accepted into the neighborhood set of .A.
Algorithm 2. .P reto_IC
Require: .G = (V, {Vi }), .S, aA
1: procedure .P reto_IC(.G, S, aA)
total
2: .t ← 0, X ← S, X newest ← S
3: while infection occur do
4: .t ← t + 1; .Xt ← ∅
5: for u .∈ X newest do
in activ e
6: .Xt (u) ← {v ∈ aA[u], q <= p}; p, q ∼ U (0, 1)
7: .Xt ← Xt Xt (u)
8: end for
total
9: .X ← X total Xt ; .X newest ← Xt
10: end while
11: return .X total Output
12: end procedure
nodes from .V with the highest size of the elementary closed subset to be seed
set for the propagation process. Algorithm .ECS is described at (1).
Moreover, algorithm (1) also returns a dictionary aA which contains nodes
and their neighbor set respectively. The determination of node’s neighbor .aA is
calculated by pseudo-closure function .as (.). aA is also an input for .P reto_IC
that we will describe in the next subsection.
At each step .t where .X newest is the set of the newly active nodes at time .t−1,
each .u ∈ X newest activate the inactive neighbors .v ∈ aA[u] with a probability
.P (u, v). The spreading continues until no more infection can happen.
The novel points of .P reto_IC include: seed set S is obtained from elementary
closed subsets and neighbor’s set determination of each node u .∈ X newnest to
infect based on multiple relations through pseudo-closure function. Line 6 in
algorithm (2) illustrates this task, in which .aA is a dictionary of neighbor sets
calculated from pseudo-closure function .as (.) in algorithm (1).
Experimental results are shown in Figs. 3, Fig. 4 and Fig. 5. We evaluate per-
formance of models based on two criteria : influence spread and running time.
From the figures, it can be seen that .P reto_IC model is more efficient than
other models about influence spread. However, the running time of .P reto_IC is
slower than .HD_IC and .T IM + _IC and approximate with .CELF + +_IC
and .ECS_IC.
The influence spreads of these algorithms on dissimilar datasets data1, data2,
and data3 are shown in Fig. 3a, Fig. 4a and Fig. 5a respectively. We can see that,
our algorithm .ECS for selecting a seed set based on elementary closed subset
brings influence spread better than methods .HD and .CELF + + when we used
it for IC model. Particularly, .P reto_IC model with the combination of seed set
from .ECS and determination of neighbor set based on pseudo-closure function
.as (.) peaks the best performance. These results demonstrated the advantage of
propagation on multiple relations.
nodes, m is the number of edges and R is the number of Monte Carlo samples
used to estimate the expected spread of each node set. Algorithm’s complexity of
2
.T IM + is only .O((k + l)(n + m)logn/ ) where k is the size of seed set, n and m
are the number of nodes and edges respectively, .l, are parameters. Besides, the
Modeling Information Diffusion in Bibliographic Networks 449
References
1. Akula, R., Yousefi, N., Garibay, I.: DeepFork: Supervised Prediction of Information
Diffusion in GitHub, p. 12 (2019)
2. Anderson, R.M., May, R.M.: Infectious diseases of humans: dynamics and control.
Oxford university press (1991)
3. Banerjee, S., Jenamani, M., Pratihar, D.K.: A survey on influence maximization
in a social network. Knowl. Inf. Syst. 62(9), 3417–3455 (2020). https://doi.org/10.
1007/s10115-020-01461-4
4. Belmandt, Z.: Basics of Pretopology. Hermann (2011). http://ijpam.eu/contents/
2013-86-1/5/5.pdf
5. Bui, Q.V., Ben Amor, S., Bui, M.: Stochastic pretopology as a tool for topological
analysis of complex systems. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham,
H., Trawiński, B. (eds.) ACIIDS 2018. LNCS (LNAI), vol. 10752, pp. 102–111.
Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75420-8_10
6. Bui, Q.V., Ho, T.K.T., Bui, M.: Topic diffusion prediction on bibliographic net-
work: new approach with combination between external and intrinsic factors. In:
Nguyen, N.T., Hoang, B.H., Huynh, C.P., Hwang, D., Trawiński, B., Vossen, G.
(eds.) ICCCI 2020. LNCS (LNAI), vol. 12496, pp. 45–57. Springer, Cham (2020).
https://doi.org/10.1007/978-3-030-63007-2_4
7. Freeman, L.C., et al.: Centrality in social networks: Conceptual clarification. Social
network: critical concepts in sociology. Londres: Routledge 1, 238–263 (2002)
8. Goldenberg, J., Libai, B., Muller, E.: Talk of the network: a complex systems look
at the underlying process of word-of-mouth. Mark. Lett. 12(3), 211–223 (2001)
9. Goyal, A., Lu, W., Lakshmanan, L.V.: Celf++: Optimizing the greedy algorithm
for influence maximization in social networks (technical report)
10. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. 83(6),
1420–1443 (1978)
11. Gui, H., Sun, Y., Han, J., Brova, G.: Modeling topic diffusion in multi-relational
bibliographic information networks. In: CIKM (2014). https://doi.org/10.1145/
2661829.2662000
Modeling Information Diffusion in Bibliographic Networks 451
12. Hethcote, H.W.: The mathematics of infectious diseases. SIAM Rev. 42(4), 599–
653 (2000)
13. Ho, T.K.T., Bui, Q.V., Bui, M.: Homophily independent cascade diffusion model
based on textual information. In: Nguyen, N.T., Pimenidis, E., Khan, Z., Trawiński,
B. (eds.) ICCCI 2018. LNCS (LNAI), vol. 11055, pp. 134–145. Springer, Cham
(2018). https://doi.org/10.1007/978-3-319-98443-8_13
14. Kempe, D., Kleinberg, J.M., Tardos, E.: Maximizing the spread of influence
through a social network. In: KDD (2003).https://doi.org/10.1145/956750.956769
15. Kimura, M., Saito, K.: Tractable models for information diffusion in social net-
works. In: European Conference on Principles of Data Mining and Knowledge
Discovery, pp. 259–271. Springer (2006)
16. Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., Glance, N.:
Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
420–429 (2007)
17. Macy, M.W.: Chains of Cooperation: Threshold Effects in Collective Action. Am.
Sociol. Rev. 56(6), 730–747 (1991)
18. Molaei, S., Babaei, S., Salehi, M., Jalili, M.: Information Spread and Topic Diffusion
in Heterogeneous Information Networks. Sci. Rep. 8(1), 1–14 (2018)
19. Molaei, S., Zare, H., Veisi, H.: Deep learning approach on information dif-
fusion in heterogeneous networks. Knowledge-Based Systems p. 105153 (Oct
2019). https://doi.org/10.1016/j.knosys.2019.105153, http://www.sciencedirect.
com/science/article/pii/S0950705119305076
20. Serazzi, G., Zanero, S.: Computer virus propagation models. In: International
Workshop on Modeling, Analysis, and Simulation of Computer and Telecommuni-
cation Systems. pp. 26–50. Springer (2003)
21. Tang, Y., Xiao, X., Shi, Y.: Influence maximization: Near-optimal time complexity
meets practical efficiency. In: Proceedings of the 2014 ACM SIGMOD International
Conference On Management Of Data, pp. 75–86 (2014)
22. Varshney, D., Kumar, S., Gupta, V.: Modeling information diffusion in social net-
works using latent topic information. In: Huang, D.-S., Bevilacqua, V., Premaratne,
P. (eds.) ICIC 2014. LNCS, vol. 8588, pp. 137–148. Springer, Cham (2014). https://
doi.org/10.1007/978-3-319-09333-8_16
Optimizing Credit Scoring Models
for Decentralized Financial Applications
1 Introduction
offset potential risks. Second, the lack of credit scoring diminishes transparency
and fairness in the lending process, as all borrowers are treated uniformly regard-
less of their credit history. This reduces incentives for maintaining good financial
behavior and negatively impacting the sustainable growth of the decentralized
financial ecosystem.
Due to these issues, there is an increasing demand for assessing the quality
of crypto wallets. Such assessments determine repayment capabilities and cate-
gorize users, enabling lending pools to set tailored loan limits and interest rates
for each participant. Verified and evaluated loans are less likely to be liquidated
compared to those that do not consider credit factors.
In traditional finance (TraFi), banks use the FICO credit score to evalu-
ate the creditworthiness of individuals or organizations. These scores, typically
ranging from 300 to 850, represent the credit risk of a person, with higher scores
indicating lower risk. This scoring system can be adapted to evaluate crypto
wallets in DeFi, as wallets share key characteristics with bank accounts: (1) they
store assets, (2) facilitate transfers, and (3) support activities such as depositing,
collateralizing, and borrowing.
Over recent decades, models for optimizing FICO credit score parameters
have been extensively developed, primarily using regression techniques and deep
learning algorithms. These models rely on variables such as outstanding debt,
payment history, credit usage length, credit mix, and new credit applications [9].
Despite their effectiveness, access to TraFi data [13] remains limited due to its
sensitive nature and the restricted exchange of information between banks.
This paper proposes credit scoring models for DeFi, leveraging unique wallet
parameters such as total current assets, average assets, transaction frequency,
number and types of interacted DApps, transaction amounts, account age, num-
ber of liquidations, loan-to-balance ratio, and loan-to-investment ratio. Our con-
tributions include: (1) the collection and processing of a dataset comprising 14
characteristics of crypto wallets, and (2) the development of four credit evalua-
tion models based on the FICO score, followed by their assessment and compar-
ison.
The structure of this paper is organized as follows: Sect. 2 reviews related
work on credit scoring. Section 3 presents the dataset and methodology. Section 4
discusses the experimental setup and results, while Sect. 5 concludes the paper
with key findings and future research directions.
2 Related Work
This section reviews methods, research, and models for evaluating crypto wallet
credit scores within the DeFi ecosystem. Packin and Lev-Aretz [15] identified two
key approaches to credit scoring in DeFi: the off-chain integration model
and the crypto-native credit score. These approaches aim to integrate user
data from both TraFi and DeFi systems for a comprehensive assessment of cred-
itworthiness across Web2 and Web3 platforms [22].
In the off-chain integration model, data from TraFi is used alone of com-
bined with on-chain data (DeFi data). Machine learning models then process this
454 T. H. Dao et al.
data to generate credit scores, which are continuously updated, encrypted, and
publicly stored on the blockchain. Zhu [27] proposed a blockchain-based approach
to identity verification and credit reporting, incorporating multi-dimensional
authentication, weighted score calculation, and encryption for secure identity
management and risk assessment.
Patel et al. [16] introduced KiRTi, a deep-learning-based credit recommender
system operating on a public blockchain. KiRTi facilitates direct lending by lever-
aging historical blockchain transaction data and using a long-short-term mem-
ory model to generate credit scores. Smart contracts automate loan repayments,
removing the need for third-party credit rating agencies.
Uriawan et al. [20] developed a credit score formula combining loan risk,
activity, profile, and social recommendation scores. Hartmann and Hasan [7]
introduced a “social score” using social media data to provide loan opportunities
on a decentralized platform. However, these models primarily depend on off-chain
data, which can limit their accuracy in assessing digital wallets. Additionally,
requiring wallet authentication for off-chain data conflicts with the anonymity
preference of many Web3 users.
Our focus is on crypto-native credit scores, which rely on blockchain
activity data such as loan repayments, trading, and governance participation.
Unlike traditional credit scores, which are tied to individuals, a crypto-native
score is linked to a wallet and dynamically adjusted based on blockchain inter-
actions. Packin and Lev Aretz [14] highlighted notable products including Spec-
tral2 ,LedgerScore3 , and Prestare4 . Spectral provides a Multi-Asset Credit Risk
Oracle (MACRO) Score, considering factors such as transaction history and
market conditions. LedgerScore offers autonomous crypto credit reports based
on cryptocurrency transactions and asset portfolios, while Prestare combines
on-chain credit scoring with lending protocol management.
Several Web3 companies, including CreDA5 , Quadrata6 , Credefi7 , and
TrueFi8 , are also developing DeFi credit scoring solutions. These platforms assess
wallet credit by analyzing transaction history, liquidation events, amounts owed,
and credit mix, though they do not disclose their detailed methodologies or
parameter optimization processes.
Research on blockchain credit scoring is still evolving. Wolf et al. [25] pro-
posed a scoring method for Aave accounts based on account age, historical health
factors, interactions with the Aave protocol, and asset types. Their model is
tailored only for Aave accounts on Ethereum. Austin et al. [1] introduced the
Autonomous Lending Organization on Ethereum (ALOE) system for unsecured
lending on Ethereum, which maintains and updates borrower-specific ratios such
as Underpay Ratio (UPR) and Current Debt Burden Ratio (CDBR). The ALOE
2
https://docs.spectral.finance.
3
https://www.ledgerscore.com.
4
https://linktr.ee/prestare.finance.
5
https://www.creda.app.
6
https://quadrata.com.
7
https://www.credefi.finance.
8
https://truefi.io.
Optimizing Credit Scoring Models for Decentralized Financial Applications 455
3 Methodology
3.1 Dataset
Features Description
total_current_asset Current total value of assets held in the wallet
average_total_asset Average asset value over the last 30 d
frequency_of_DApp_transactions Wallet’s activity level on blockchain investment channels
number_of_interacted_DApps Number of high-reputation DApps interacted with in the last 30 d
Types_of_interacted_DApps Number of different types of DApps interacted with in the last 30 d
reputation_of_interacted_DApps Number of high-reputation DApps interacted with in the last 30 d
transaction_amount Total value (in USD) of funds received by the wallet in the last 30 d
frequency_of_transaction Number of transactions made by the wallet in the last 30 d
age_of_accounts Duration since the wallet’s first transaction, indicating how long it
has been active in the crypto market
number_of_liquidations Total number of liquidation events for the wallet
total_value_of_liquidations Cumulative value of all liquidation events
loan_to_balance_ratio Ratio of total loans to the account balance
loan_to_investment_ratio Penalty points applied for violating acceptable loan-to-investment
ratios
investment_to_total_asset_ratio Wallet’s investment activity relative to its total assets
Labeling Data. Following the FICO model, we categorize wallets into five
credit levels (see Table 3). However, assigning the exact credit level is challenging
due to the lack of established evaluation mechanisms. For example, a wallet with
a balance of $1 trillion and a recent liquidation event may fall into level 2 due
to the liquidation or level 4 because of its large balance. This ambiguity reflects
the characteristics of Fuzzy Data.
Fuzzy data consists of information that cannot be represented as precise num-
bers or precisely categorized [21]. Research on fuzzy data [2] includes machine
learning algorithms that handle fuzzy logic and data. Denoeux [4] proposed
methods for estimating parameters in statistical models with fuzzy observations.
Jane and Ganesh [10] reviewed how machine learning and fuzzy logic techniques
enable knowledge-based decision-making.
We use Fuzzy Labels for crypto wallet classification, inspired by the Fuzzy
c-means clustering method [3]. This robust unsupervised technique provides a
Optimizing Credit Scoring Models for Decentralized Financial Applications 457
10
https://www.lendingtree.com/credit-repair/credit-score-stats-page/.
458 T. H. Dao et al.
bilities, despite increased training time. Mutation involves multiplying the ele-
ments resulting from crossover by randomly generated values within the range
.[1 − M utationRate; 1 + M utationRate]
.E(i) = y i
− min(yf irst , ysecond ) if ypredict < min(yfi irst , ysecond
i i i i
) (4)
⎩ predict
ypredict − max(yf irst , ysecond ) if ypredict > max(yf irst , ysecond )
i i i i i i
Optimizing Credit Scoring Models for Decentralized Financial Applications 461
C
SCC (Sparse Categorical Crossentropy) = one_hot(ytrue , c) · log(ypredict [c])
c=1
.
1
N
Loss = − min(SCC(yf irst [i], ypredict [i]), SCC(ysecond [i], ypredict [i]))
N i=1
(5)
462 T. H. Dao et al.
4 Experiment
SGD and Adam: We implemented Stochastic Gradient Descent (SGD) and
Adam optimizers, tuning both the learning rate and number of iterations. For
SGD, we set a decay rate of 0.95. The Adam optimizer was configured with
the parameters .β1 = 0.9; .β2 = 0.999; and . = 1 × 10−8 . We tested learning
rates of 0.1, 0.01, and 0.001 across 100, 1000, and 10 000 iterations. The best
results for both algorithms were achieved with a learning rate of 0.001 and 10
000 iterations.
Genetic Algorithm: For GA, we initialized the number of generations to
100 and experimented with various crossover functions and mutation rates. The
two-point crossover and a mutation rate of 0.1 produced the best results after
testing one-point and two-point crossovers and mutation rates of 0.1 and 0.2.
Multilayer Perception: We experimented with multiple neural network archi-
tectures, the number of epochs, and batch sizes. The best performance was
achieved with two hidden layers (see Fig. 2), 10 epochs, and a batch size of
64.
Table 5 and Fig. 3 present and compare the performance of the four models on
the test dataset, covering three metrics: accuracy (A), precision (P), and recall
(R), for each separated label and for all labels. The key findings are as follows:
(1) The MLP model achieved the highest performance, with 97.93% accuracy
and 90.68% precision. In comparison, the GA model had slightly lower perfor-
mance, with 97.5% accuracy and 90.30% precision, but surpassed the MLP in
recall (62% vs. 58.96%). The SGD model exhibited the lowest performance,
with 64.03% accuracy and 53% precision.
Optimizing Credit Scoring Models for Decentralized Financial Applications 463
(2) The Fair and Good labels had the most accurate predictions. For the Fair
label, the results (A, P, R) for the GA and MLP models are (97.65%; 99.05%;
97.86%) and (98.66%; 97.09%; 98.91%), respectively. For the Good label,
GA and MLP results are (94.43%; 94.50%; 75.54%) and (98.27%; 97.09%;
91.77%), respectively. Figure 4 further illustrates that the MLP model accu-
rately predicted instances with high confidence in the Fair and Good labels.
(3) Recall values were notably low for certain labels across the models. For the
MLP model, recall is particularly low for the Poor and Excellent labels, at
7.4% and 24.10%, respectively. The GA model also shows low recall for the
Poor and Very Good labels, with values of 22.90% and 15.05%, respectively.
This discrepancy arises from the use of fuzzy labeling in the test dataset,
while the model outputs are a selection among the five labels: Fair, Good,
Very Good, and Excellent. Specifically, samples labeled as (Poor, Poor ),
(Poor, Fair ), and (Fair, Fair ) were predominantly predicted as Fair, result-
ing in high recall for the Fair label but low recall for the Poor label.
the Fuzzy Labeling approach, each wallet address is assigned two labels, reflect-
ing the uncertainty in classification. For example, a label of (0;0) indicates a
definitive classification of Poor, while a label of (0;1) suggests that the wallet
could be classified as either Poor or Fair. Thus, the test dataset contains these
fuzzy label pairs, such as (0;0), (0;1), (1;2), and (4;4).
Each cell in the heatmap shows the number of wallets predicted at a specific
level, compared to their original fuzzy labels. For example, a cell displaying 644
indicates that 644 wallets were predicted to be at level 1 (Fair), while their
original fuzzy labels were (0, 1), meaning they could belong to either Poor or
Fair.
5 Conclusion
Future research should expand the dataset to include non-EVM chains like
Cosmos11 , Solana12 , and Polkadot13 to test the models in diverse blockchain
environments. Additionally, an automated labeling system will further enhance
scoring accuracy and streamline evaluation, promoting the integration of robust
credit scoring mechanisms in decentralized finance.
References
1. Austin, T.H., Potika, K., Pollett, C.: Autonomous lending organization on
ethereum with credit scoring. In: 2023 Silicon Valley Cybersecurity Conference
(SVCC), pp. 1–8, IEEE (2023)
2. Bandemer, H., Näther, W.: Fuzzy data analysis, vol. 20. Springer Science & Busi-
ness Media (2012)
3. Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: the fuzzy c-means clustering algorithm.
Comput. Geosci. 10(2–3), 191–203 (1984)
4. Denœux, T.: Maximum likelihood estimation from fuzzy data using the em algo-
rithm. Fuzzy Sets Syst. 183(1), 72–91 (2011)
5. Dixon, W.J., Yuen, K.K.: Trimming and winsorization: a review. Statistische Hefte
15(2), 157–170 (1974)
6. Genc, E., Shin, H.R., Sik Park, J., Song, J.K.: Number recognition of parts
book schematics using convolutional recurrent neural network. In: 2018 Interna-
tional Conference on Information and Communication Technology Robotics (ICT-
ROBOT), pp. 1–3 (2018), https://doi.org/10.1109/ICT-ROBOT.2018.8549859
7. Hartmann, J., Hasan, O.: Privacy considerations for a decentralized finance (defi)
loans platform. Clust. Comput. 26(4), 2147–2161 (2023)
8. Harvey, C.R., Ramachandran, A., Santoro, J.: DeFi and the Future of Finance.
John Wiley & Sons (2021)
9. Homonoff, T., O’Brien, R., Sussman, A.B.: Does knowing your fico score change
financial behavior? evidence from a field experiment with student loan borrowers.
Rev. Econ. Stat. 103(2), 236–250 (2021)
10. Jane, J.B., Ganesh, E.: A review on big data with machine learning and fuzzy logic
for better decision making. Int. J. Sci. Technol. Res. 8(10), 1221–1225 (2019)
11. Katoch, S., Chauhan, S.S., Kumar, V.: A review on genetic algorithm: past,
present, and future. Multimed. Tools Appl. 80, 8091–8126 (2021)
12. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
13. Munkhdalai, L., Munkhdalai, T., Namsrai, O.E., Lee, J.Y., Ryu, K.H.: An empir-
ical comparison of machine-learning methods on bank client credit assessments.
Sustainability 11(3), 699 (2019)
11
https://cosmos.network.
12
https://solana.com.
13
https://polkadot.network.
466 T. H. Dao et al.
14. Packin, N.G., Lev Aretz, Y.: Crypto native credit score: Between financial inclusion
and predatory lending. Cardozo Law Review, Forthcoming (2023)
15. Packin, N.G., Lev-Aretz, Y.: Decentralized credit scoring: Black box 3.0. American
Business Law Journal (2023)
16. Patel, S.B., Bhattacharya, P., Tanwar, S., Kumar, N.: Kirti: A blockchain-based
credit recommender system for financial institutions. IEEE Trans. Netw. Sci. Eng.
8(2), 1044–1054 (2020)
17. Pham, V.B., Trinh, T.D.: Analysis model for decentralized lending protocols. In:
Proceedings of the 11th International Symposium on Information and Communi-
cation Technology, pp. 405–412 (2022)
18. Recht, B., Re, C., Wright, S., Niu, F.: Hogwild!: A lock-free approach to paralleliz-
ing stochastic gradient descent. In: Advances in Neural Information Processing
Systems, vol. 24 (2011)
19. Sun, X., Stasinakis, C., Sermpinis, G.: Liquidity risks in lending protocols: Evidence
from aave protocol. arXiv preprint arXiv:2206.11973 (2022)
20. Uriawan, W., Badr, Y., Hasan, O., Brunie, L.: Decentralized trustworthiness score
management with smart contracts on the trustlend platform. IET Blockchain 4(1),
59–72 (2024)
21. Viertl, R.: Statistical methods for fuzzy data. John Wiley & Sons (2011)
22. Voshmgir, S.: Token economy: How the Web3 reinvents the internet, vol. 2. Token
Kitchen (2020)
23. Werner, S., Perez, D., Gudgeon, L., Klages-Mundt, A., Harz, D., Knottenbelt, W.:
Sok: Decentralized finance (defi). In: Proceedings of the 4th ACM Conference on
Advances in Financial Technologies, pp. 30–46 (2022)
24. Wirth, N.: Algorithms + data structures = programs. Prentice-Hall Series in Auto-
matic Computation. Englewood Cliffs, N.J.: Prentice-Hall, Inc. XVII, 366 p. £
13.55; $ 21.55 (1976). (1976)
25. Wolf, W., Henry, A., Fadel, H.A., Quintuna, X., Gay, J.: Scoring aave accounts for
creditworthiness. arXiv preprint arXiv:2207.07008 (2022)
26. Yaguchi, A., Suzuki, T., Asano, W., Nitta, S., Sakata, Y., Tanizawa, A.: Adam
induces implicit weight sparsity in rectifier neural networks. In: 2018 17th IEEE
International Conference on Machine Learning and Applications (ICMLA), pp.
318–325, IEEE (2018)
27. Zhu, X.: Blockchain-based identity authentication and intelligent credit reporting.
In: Journal of Physics: Conference Series, vol. 1437, p. 012086, IOP Publishing
(2020)
Application of the SFE Feature Selection
Method for Multi-omic Biomarker
Discovery in Brain Cancer Subtyping
Hien Nguyen Minh , Ha Tang Vinh , Hoang Le , and Diep Thi Hoang(B)
1 Introduction
Cancer progression involves multiple stages, each associated with distinct genetic
mutations. These mutations disrupt normal cell cycle regulation, resulting in
uncontrolled proliferation [2]. A cancer biomarker is a measurable characteris-
tic indicating cancer risk or patient outcome. Molecular biomarkers encompass
genes, genetic variations, mRNA or protein expression differences, post transla-
tional protein modifications, and metabolite levels. And they aid to monitor dis-
ease progression or treatment response [3]. One emerging approach for biomarker
discoveries is utilizing evolutionary algorithms with multi-omics data [4]. This is
because different types of omics data offer unique insights into cellular activities
and biological processes, making them valuable for understanding complex dis-
eases. However, multi-omics data analysis presents a challenge: integrating a high
dimensionality of diverse biological data while having the inherently smaller sam-
ple size. And evolutionary computation (EC) techniques have gained significant
attention for their efficiency in exploring large solution spaces and achieving near-
optimal results within a reasonable timeframe [4]. Furthermore, previous studies
H. N. Minh and H. T. Vinh—Equally contributed.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 467–477, 2025.
https://doi.org/10.1007/978-981-96-4282-3_37
468 H. N. Minh et al.
2 Related Works
FS within omics data is challenging due to the vast search space, which grows
exponentially with the number of subsets of features, making exhaustive search
impractical. Traditional search methods like greedy, heuristic, and random
approaches often face issues like stagnation in local optima and high computa-
tional costs. To address these limitations, EC techniques have gained popularity
for their global search capabilities and potential to improve FS’s efficiency [5, 6].
In [7], Martinez et al. presented a customized version of the standard binary
particle swarm optimization (PSO) algorithm, designed to improve classifica-
tion accuracy while significantly reducing the number of selected features. Their
method updates a controlled subset of particles to avoid overwhelming the sys-
tem. It enables more efficient identification of small biomarker sets from microar-
ray data. Tested on 11 microarray datasets, the algorithm outperforms other
PSO-based approaches by achieving higher accuracy with fewer selected genes.
These genes can be considered potential biomarkers which are critical in clin-
ical applications. In [8], Kourid and Batouche introduce a scalable method for
biomarker discovery using a two-stage FS process. The first stage employs par-
allel K-means clustering and Signal-to-Noise Ratio (SNR) ranking to filter out
Application of the SFE for Multi-omic Brain Cancer Biomarker Discovery 469
redundant features, selecting top features from each cluster. In the second stage,
the Binary Particle Swarm Optimization (BPSO) algorithm, implemented with
MapReduce, is used to further optimize the feature subset. Wang et al. in [9]
proposed a Feature Weighting Particle Swarm Optimization (FWPSO) method,
which operates in two phases. In the Feature Weighting Phase, PSO assigns
weights to features based on their relevance, discriminating between impor-
tant and irrelevant features. In the FS Phase, the PSO algorithm refines the
search to the most relevant features, enhancing the identification of significant
biomarker genes while reducing data dimensionality. Their method demonstrates
improved classification accuracy and efficiency on microarray datasets, outper-
forming other techniques.
3 Dataset
3.1 Dataset Collection
The data were retrieved from Xena Hub, a cancer genomics platform provided by
the University of California, Santa Cruz, which offers easy access to key datasets,
including TCGA Legacy. The dataset used in this study was the TCGA GBM
cohort, which classifies GBM into four subtypes: classical, mesenchymal, proneu-
ral, and neural. We incorporated three omics data types-copy number variations
(CNV), DNA methylation, and level 3 gene expression (GE)-selecting only sam-
ples with matched data across all three types. Preprocessing was performed to
reduce noise and redundancy. The dataset characteristics are as follows:
– Subtypes: Classical: 71, Neural: 46, Proneural: 72, Mesenchymal: 81.
– Original features: GE: 12,043; Methylation: 27,579; CNV: 24,777.
– Selected features: GE: 2,000; Methylation: 2,000; CNV: 2,000.
The stratified k-fold function from scikit-learn [18] was used to split the
data into training, validation, and test sets for different purposes, which will
be specified later in the Methods section. Additionally, the training and test
sets were employed in phase two to assess the identified biomarkers using other
classical machine learning models, including the softmax classifier and random
forest classifier. The number of samples (i.e., GBM patients) in the training,
validation, and test datasets is as follows: 162 (train), 54 (validation), and 54
(test).
independently for each omics data type to enhance classification accuracy. The
data was split into training-validation (80%) and test (20%) sets using stratified
sampling. We selected the top 2,000 features per omics type based on ANOVA
F-values, ensuring that the first principal component explained less than 50%
of the variance. The training-validation set was further divided into training
(75%) and validation (25%) subsets using stratified 4-fold cross-validation. Min-
max scaling was applied to the gene expression and methylation data (excluding
CNV), with scaling parameters derived from the training set and consistently
applied to the validation and test sets. Finally, the three omics matrices were
concatenated into a single matrix for fitting with the classifier in the fitness
function.
4 Methods
4.1 Biomarker Discovery Problem
Researchers often deal with datasets where the number of measured features far
exceeds the sample size in biomarker discovery. These datasets are often derived
from high-throughput omics approaches or transcriptomic sequencing. Finding
biomarkers from omics data can be considered a FS problem. However, they
hold distinct meanings within bioinformatics. FS can be expressed formally as
below. It aims to identify a subset of relevant features .S ⊆ {1, 2, . . . , n} from the
original set of .n features, such that a model .f built on this subset .S can achieve
optimal or near-optimal classification performance while reducing computational
cost and avoiding overfitting. Given:
subject to:
. |S| ≤ k,
where .xS denotes the feature vector reduced to the subset .S, and .k is the max-
imum number of features allowed in the subset. Biomarker discovery, on the
other hand, is a specialized application of FS. It involves identifying key biolog-
ical markers, such as specific genes or proteins, that have clinical or biological
importance. The assumption behind applying FS techniques for this problem is
that the elements chosen to improve the predictive model’s performance are also
likely to hold biological relevance [14, 15].
Application of the SFE for Multi-omic Brain Cancer Biomarker Discovery 471
tures. Specifically, .X is the solution to the FS problem, where each element .xj in
.X represents the .j-th feature: .xj = 0 indicates that the feature is not selected,
while .xj = 1 indicates selection. The SFE algorithm operates in two main phases:
exploration and exploitation. In the exploration phase, a global search is con-
ducted using a non-selection operator to discard redundant and noisy features.
The exploitation phase then applies a selection operator to refine the search
locally, focusing on the features deemed non-selectable in the exploration phase.
This results in a subset of relevant features.
The SFE-PSO algorithm combines SFE with an evolutionary computation
(EC) method, specifically particle swarm optimization (PSO). The primary
objective of SFE-PSO is to reduce dataset dimensionality with SFE in the initial
stages, followed by PSO to identify an optimal subset in the lower-dimensional
space. More specifically, PSO is applied after SFE has not improved the solution
within the previous 1000 fitness evaluations. Notably, PSO can be substituted
by other EC methods, creating a versatile and adaptable PSO-EC framework.
The fitness function in both SFE and SFE-PSO is based on the accuracy of
a classifier. For instance, GBM has four subtypes: Classical, Neural, Proneural,
and Mesenchymal. Each patient’s subtype (e.g., Neural or Proneural) is included
in the omics data. A classifier, such as K-nearest neighbor as used in the original
study, is trained on omics data to classify patients’ subtypes. In each iteration of
SFE, the classifier is retrained on a new subset of features while using the same
patient data.
The flowcharts for the two algorithms are presented in Fig. 1 and Fig. 2.
Further details and explanations of these algorithms can be found in the original
study.
In the first phase of applying SFE and SFE-PSO to the TCGA GBM dataset, we
propose four key modifications: data preprocessing, introducing a mechanism to
control the number of output features, changing the fitness evaluation function,
and recording the algorithms’ performance across the training, validation, and
test sets (in contrast to the original study, which focused solely on the validation
set). The first modification, data preprocessing, is detailed earlier, along with its
rationale, and is illustrated in Block A of Fig. 1 and Fig. 2.
The second modification (Block B of Figs. 1 and 2) aims to control the num-
ber of output features, increasing it to 400 in certain experiments to enhance the
likelihood of identifying potential biomarkers, while still significantly reducing
the feature set. These feature counts represent only 0.00155%, 0.003105%, and
0.00621% of the original 64,399 features (genes). This approach aligns with the
472 H. N. Minh et al.
Fig. 1. Flow chart version of the SFE Fig. 2. Flow chart version of SFE-EC
algorithm. framework
Using the same hyperparameters as the previous section, but with unprocessed
data, we conducted an investigation using only the SFE algorithm, achieving the
results shown below.
The issue of overfitting is evident when comparing the results from E2 with those
from E1, where both experiments used the same hyperparameter settings but
differed in the dataset. In E2, the disparity between the classifier’s performance
on the training dataset and the test dataset is significantly larger than in E1.
Additionally, the Phase 2 results in E2 were suboptimal, further highlighting the
critical role of data preprocessing.
Upon examining the results of Phase 2 in E3, we observe an increase in
the number of overlapping genes, accompanied by a significant improvement in
the performance of classical machine learning models (the softmax classifier and
random forest), compared to the results in E1. This underscores the effective-
ness of controlling the number of output features. The improvement can also be
attributed to the typical behavior of machine learning models, which tend to
perform optimally and remain stable when working with a suitable number of
features.
In Phase 1, although there is a noticeable performance gap between the train-
ing and test sets, the results on the test set remain relatively robust, especially
considering the drastic reduction in the number of features. The classifier was
trained on the full set of 6,000 features but was subsequently evaluated on the
476 H. N. Minh et al.
test set using only a small subset of features: 59, 107, and 400 features obtained
from the modified SFE and SFE-PSO algorithms (59 features in E1 SFE, 107
features in E1 SFE-PSO, and 400 features in E3 SFE and SFE-PSO). Despite
this substantial reduction, the model’s ability to generalize, as reflected in the
test set performance, is commendable. This indicates that the FS algorithm,
SFE, is effective at identifying the most relevant biomarkers. Traditional models
like SVM continue to perform well when applied to a refined subset of features.
The reduced complexity contributes to more efficient computation while main-
taining classification quality, demonstrating the algorithm’s practical utility in
handling high-dimensional data.
7 Conclusion
References
1. Ahadzadeh, B., et al.: SFE: a simple, fast, and efficient feature selection algorithm
for high-dimensional data. IEEE Trans. Evol. Comput. 27(6), 1896–1911 (2023)
2. Hassanpour, S.H., Dehghani, M.: Review of cancer from perspective of molecular.
J. Cancer Res. Pract. 4(4), 127–129 (2017)
Application of the SFE for Multi-omic Brain Cancer Biomarker Discovery 477
3. Maruvada, P., et al.: Biomarkers in molecular medicine: cancer detection and diag-
nosis. Biotechniques 38(sup4), 9–15 (2005)
4. Liang, J., et al.: A survey on evolutionary computation for identifying biomarkers
of complex disease. IEEE Trans. Evol. Comput. (2024)
5. Xue, B., et al.: A survey on evolutionary computation approaches to feature selec-
tion. IEEE Trans. Evol. Comput. 20(4), 606–626 (2016)
6. Abd-Alsabour, N.: A review on evolutionary feature selection. In: 2014 European
Modelling Symposium. IEEE (2014)
7. Martinez, E., Alvarez, M.M., Trevino, V.: Compact cancer biomarkers discovery
using a swarm intelligence feature selection algorithm. Comput. Biol. Chem. 34(4),
244–250 (2010)
8. Amine, A., Bellatreche, L., Elberrichi, Z., Neuhold, E.J., Wrembel, R. (eds.): CIIA
2015. IAICT, vol. 456. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-
19578-0
9. Wang, X., Jia, W.: A feature weighting particle swarm optimization method to
identify biomarker genes. In: 2022 IEEE International Conference on Bioinformat-
ics and Biomedicine (BIBM). IEEE (2022)
10. Khan, N.M., Madhav C, N., Negi, A., Thaseen, I.S.: Analysis on improving the per-
formance of machine learning models using feature selection technique. In: Abra-
ham, A., Cherukuri, A.K., Melin, P., Gandhi, N. (eds.) ISDA 2018 2018. AISC, vol.
941, pp. 69–77. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-16660-
1_7
11. Tang, J., Alelyani, S., Liu, H.: Feature selection for classification: a review. In:
Data Classification: Algorithms and Applications, vol. 37 (2014)
12. Popovic, D., et al.: A self-tuning genetic algorithm with applications in biomarker
discovery. In: 2014 IEEE 27th International Symposium on Computer-Based Med-
ical Systems. IEEE (2014)
13. Panagiotopoulos, K., et al.: MEvA-X: a hybrid multiobjective evolutionary tool
using an XGBoost classifier for biomarkers discovery on biomedical datasets. Bioin-
formatics 39(7), btad384 (2023)
14. Torres, R., Judson-Torres, R.L.: Research techniques made simple: feature selection
for biomarker discovery. J. Investig. Dermatol. 139(10), 2068–2074 (2019)
15. Nair, T.M.: Calliper randomization: an artificial neural network based analysis of
E. coli ribosome binding sites. J. Biomol. Struct. Dyn. 15(3), 611–617 (1997)
16. Vargas, A.J., Harris, C.C.: Biomarker development in the precision medicine era:
lung cancer as a case study. Nat. Rev. Cancer 16(8), 525–537 (2016)
17. Bhavsar, H., Panchal, M.H.: A review on support vector machine for data clas-
sification. Int. J. Adv. Res. Comput. Eng. Technol. (IJARCET) 1(10), 185–189
(2012)
18. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn.
Res. 12 (2011)
A Reputation Scoring Framework
for Lending Protocols Using the PageRank
Algorithm
1 Introduction
The advent of blockchain technology has led to a fundamental transformation in
the financial system [4], with decentralized finance (DeFi) emerging as a viable
alternative to traditional centralized banking [19]. Within DeFi, lending decen-
tralized applications (Lending DApps), or Lending protocols, play a crucial
role by facilitating cryptocurrency borrowing and lending without intermedi-
aries [24]. As of May 2024, lending ranks second in total value locked (TVL)
among DeFi categories, with over $36 billion.1 This reflects the substantial adop-
tion and influence of lending within the blockchain ecosystem.
M.-T. Nguyen and T.-D. Trinh—Contributed equally to this work.
1
https://defillama.com/categories.
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 478–494, 2025.
https://doi.org/10.1007/978-981-96-4282-3_38
A Reputation Scoring Framework for Lending Protocols 479
Entities in lending DApps include: (1) user wallets [21], representing indi-
vidual users who participate in lending by depositing or borrowing tokens; (2)
centralized exchanges [25], hot wallets of centralized exchanges (CEX) that oper-
ate like user wallets but provide higher liquidity, enhancing market stability and
efficiency; (3) lending smart contracts [22], which automate lending processes,
ensure transparency, and enable trustless execution; (4) protocol supported tokens
[20], tokens eligible for borrowing or lending within the protocol.
Reputation within a lending DApp is demonstrated by consistent engagement
and lending activities over time. Implementing a reputation scoring algorithm
based on user interactions offers deeper insights into entity behavior and con-
tributions [1], allowing for more informed governance decisions. This scoring
mechanism also fosters healthy competition, motivating users to engage more
actively, thereby fostering a vibrant and sustainable ecosystem
However, evaluating entities within these DApps remains a challenge. Cur-
rent methods primarily rely on token holdings for governance voting [2], which is
vulnerable to manipulation as entities can temporarily inflate their token hold-
ings to gain influence during governance events [7]. To address this issue, we
propose a more robust scoring framework that evaluates entities based on both
token holdings and lending interactions over time, providing a fairer approach
to governance and development within the lending ecosystem [10].
This paper presents a robust reputation and credit scoring framework for
entities in decentralized lending applications. First, we adapt the PageRank
algorithm [18] to rank wallets, centralized exchanges, lending contracts, and
supported tokens. This ranking utilizes a graph constructed from an analysis of
lending activities, yielding a unified reputation score across diverse entity types.
Second, we implement a multi-step score normalization process to align our
results with traditional credit scoring systems, ensuring compatibility with exist-
ing financial frameworks. Finally, we validate our approach through backtest-
ing during market downturns, demonstrating that higher-ranked wallets exhibit
resilience and underscoring the proposed model’s effectiveness in risk assessment
and stability promotion within the lending ecosystem.
The paper is structured as follows: Sect. 2 reviews related work in DeFi and
lending DApps. Section 3 describes the proposed scoring framework and method-
ology. Section 4 presents the implementation and results of our system and dis-
cusses the findings. Finally, Sect. 5 concludes with future research directions and
potential improvements.
2 Related Work
This section begins with an introduction to the PageRank algorithm and its
applications. It then reviews existing evaluation systems relevant to our proposed
framework, including credit scoring systems in traditional banking and Web3
scoring projects.
480 M.-T. Nguyen et al.
To address the rank sink issue – where pages accumulate rank without dis-
tributing it – the PageRank algorithm uses a dampening factor (see Formula 2).
This adjustment balances the rank distribution, preventing certain pages from
disproportionately accumulating rank.
Since its introduction, PageRank has evolved with notable variations [5] such
as the Weighted PageRank (WPR) [23]. Unlike the standard PageRank, which
treats all links equally, WPR enhances ranking by incorporating the significance
of both incoming and outgoing links. This approach assigns different weights to
links, improving the relevance of search results.
P R(v)
P R(u) = (1 − d) + d
. (2)
Nv
v∈B(u)
The WPR algorithm assigns higher rank values to more important pages by
considering both the number of inlinks and outlinks. Formula 3 illustrates how
WPR uses weights .Win (v, u) and .Wout (v, u), calculated based on the number of
inlinks and outlinks. In this formula, .R(v) denotes the set of pages to which page
.v links, .Iu and .Ip represent the number of inlinks of pages .u and .p, respectively,
while .Ou and .Op represent the number of outlinks. .Win (v, u) measures how
important node .u is compared to other nodes that link to node .v. It helps adjust
the influence of the inbound link from .v to .u. .Wout (v, u) adjusts the importance
of the outbound link from .v to .u based on how links are distributed from node
.v to other nodes.
Iu Ou
Win (v, u) = Wout (v, u) =
I
p∈R(v) p p∈R(v) Op
. (3)
P R(u) = (1 − d) + d P R(v)Win (v, u)Wout (v, u)
v∈B(u)
A Reputation Scoring Framework for Lending Protocols 481
Our selection of the PageRank algorithm for this research is driven by its
computational efficiency [12] and scalability for large datasets. Recent studies
have demonstrated its application in blockchain contexts. Do and Do [6] used
PageRank to rank Ethereum addresses based on transaction data, effectively
identifying significant nodes such as major exchanges and decentralized appli-
cations. Mitra [14] proposed a cost-efficient method to improve blockchain con-
sensus in Industry 4.0 by integrating a Cellular Automata-based extension of
PageRank, enhancing data security and privacy. Experimental results showed
this method outperformed standard PageRank, NCDawareRank, and HodgeR-
ank. Qin et al. [16] introduced the Segmented PageRank (SPR) algorithm for
evaluating data value in alliance blockchains, which proved broadly applicable
and offered similar complexity to traditional PageRank. Experimental results
confirmed its superior performance.
Gleich [9] noted that PageRank is widely used in bibliometrics, social net-
works, link prediction, and recommendation systems. Boldi et al. [3] applied it
to a voting model in social networks, while François et al. [8] extended PageR-
ank with clustering to detect stealthy botnets in peer-to-peer communication
networks.
Data Crawling. In this initial step, data is gathered from lending DApps to
construct a graph for applying the PageRank algorithm. We begin by identify-
ing and collecting addresses of supported tokens from the official lending DApp
website. Using these token addresses, all related transactions are retrieved, along
with a list of interacting addresses. These are categorized into three groups:
personal wallet addresses, centralized exchange hot wallets, and smart contract
addresses. This comprehensive dataset captures both user and contract interac-
tions, which is essential for accurate graph construction and analysis.
Graph Building. Using the collected data, we create a graph to model interac-
tions between different entities within the lending DApp. Each vertex represents
an entity, while each edge signifies a financial relationship between them. To
ensure accuracy in scoring, entities, such as identical tokens deployed across
multiple chains, are consolidated.
also allows DApps to make more informed decisions, particularly in risk manage-
ment and loan issuance. Furthermore, the automated updates reduce the need
for manual intervention, ensuring that the system can scale to accommodate
growing user and transaction volumes across multiple chains.
Fig. 1. Graph Model for Lending Activities. The DApp token, as a type of supported
token, has the same types of edges as other supported tokens.
the borrower fully repays the loan, the dTokens are burned, and they can retrieve
their collateral. Lenders can redeem their aTokens at any time to withdraw
their initial deposits along with the accrued interest. This entire process, includ-
ing deposits, borrowing, repayments, and withdrawals, operates autonomously,
ensuring the stability of the lending protocol even in volatile markets.
Based on these operations, we propose the graph shown in Fig. 1. Each DApp
has a unique graph for evaluating its entities, which include: (1) the lending
DApp; (2) supported tokens – assets that can be deposited or borrowed
within the DApp; (3) the DApp token – the token issued by the DApp, also
usable for borrowing or depositing; (4) Each supported token has corresponding
aTokens and dTokens, such as aETH and dETH for ETH; (4) Addr: represents
addresses interacting with the DApp, including user wallets, centralized exchange
hot wallets, or the DApp’s lending smart contracts.
The graph is constructed with each vertex representing an entity, analogous
to a page in the PageRank algorithm, where PageRank indicates the entity’s rep-
utation score within the lending ecosystem. The weighted PageRank algorithm
assigns weights to edges based on the percentage of users clicking links to navi-
gate between pages. In Lending DApps, financial connections – such as token
ownership or the amount of tokens locked in the lending DApp – are modeled
as edges, with weights reflecting the monetary value of these relationships. This
approach standardizes edges to a common unit, enabling consistent conversion
to percentage weights.
To establish the edges among the vertices in the graph, we first examine the
effects of adding one-way and two-way links. When a one-way link is added from
vertex .u to vertex .v, .u creates an outbound link, and .v gains an inbound link.
While .u’s PageRank does not decrease immediately, the amount of PageRank
it passes to .v is divided among all of .u’s outbound links. The more outbound
links .u has, the smaller the portion of its PageRank that .v receives. Although
.u’s initial PageRank remains unchanged, this redistribution can cause its rank
to decrease over several iterations, as its influence becomes more diluted. In
contrast, .v benefits from the new inbound link, and its PageRank is likely to
increase depending on .u’s rank and how many other outbound links .u has.
A Reputation Scoring Framework for Lending Protocols 485
In the case of a two-way link between vertices .u and .v, both gain new inbound
links, potentially enhancing their PageRanks. The value each vertex receives
is moderated by the number of their outbound links. Mutual inbound links
generally lead to a net increase in PageRank, particularly when both vertices
already have high ranks. Nonetheless, the exact impact depends on the overall
graph structure and the distribution of PageRank among other vertices.
Based on these characteristics, the direction of an edge in the graph is based
on which entity’s rank benefits from their connection. If only .v’s rank increases,
the edge directs from .u to .v; if both ranks increase, the edge is bidirectional.
Consequently, most edges are bidirectional, with equal weights assigned in both
directions, thereby enhancing each vertex’s rank through their interrelations.
Figure 1 illustrates the edges, where labels indicate the weights of these
relationships expressed as monetary values in USD, rather than their
names. The edges include:
significantly from those of individual users. Thus we isolate user wallets and
normalize their rank scores to a range of 300 to 850, ensuring that these scores
adhere to a normal distribution. This adoption of the FICO score range enhances
the practical applicability of the proposed ranking method. Details on score
normalization are provided in the experiment section.
4 Experimental Results
To evaluate the proposed entity ranking method for lending based on the PageR-
ank algorithm, we conducted experiments with the Aave lending protocol. As of
May 2024, Aave is the leading lending protocol with over $11 billion in TVL.
Leveraging the lending data outlined in Sect. 3.1, we isolated Aave’s data across
six EVM chains, covering both Aave v2 and v3 versions, and constructed a graph
with over 45,000 vertices and 200,000 edges. The average processing time was
15 min for graph construction, and 5 min for running the PageRank algorithm
on the lending DApp graph.
Since personal wallet rankings hold the most practical relevance for lending
applications, we specifically optimized the score normalization for this category.
Initially, we selected the top 6,000 personal wallets by score and plotted their
scores, as illustrated in Fig. 2a. Our objective was to transform the scores to a
range of 355–800 while ensuring that the scores follow a normal distribution,
which required transitioning the graph from Fig. 2a to Fig. 2f.
Through regression calculations, we identified four transformations as fol-
lows: (1) We normalized the scores of all wallets to be greater than 10 using
the transformation shown in Formula 4. This step prepared us for the subse-
quent logarithmic transformation using base 10. (2) We applied a double log-
arithmic transformation using base 10 (see Formula 5). (3) The scores were
exponentiated by a factor of .α, as shown in Formula 6). To achieve normal-
ization in the FICO score range of 300 to 850, we will determine .α such that
max(all IntermediateScore2 ) 850
.
min(all IntermediateScore2 ) ≈ 300 (4) Finally, the scores will be scaled to the range
of 300–850 according to Formula 7.
10
IntermediateScore1 = P ageRank ×
. (4)
min(all P ageRanks)
IntermediateScore3 = (IntermediateScore2 )α
. (6)
IntermediateScore3
F inalScore =
. × 300 (7)
min(all IntermediateScore3 )
The process of transforming the PageRank scores is illustrated in Fig. 2,
which includes six plots: the original PageRank scores, the transformed scores
with a minimum value of 10, the first logarithmic transformation, the second
488 M.-T. Nguyen et al.
logarithmic transformation, the exponentiated scores, and the final scores nor-
malized to the range of 300–850. The curve shape in the final plot indicates that
our method has successfully resulted in scores distributed normally from 300 to
850. This is further supported by the histogram of wallet scores in Fig. 3.
Along with personal wallets, the four transformations were also applied to
rank other entity types, including centralized exchanges, lending smart contracts,
and supported tokens of Aave, to derive their overall scores. These scores are
used to analyze the entities in the following sections.
Fig. 5. Financial Holdings of the Top 5,000 Wallets in the Aave DApp
the Poor range, contributing only 4.5% to the Total Value Locked (TVL), indi-
cating their limited impact on the lending pool. In contrast, wallets with scores
above 800, though numbering only 126, represent 40% of the TVL, highlighting
their importance for financial stability. This distribution suggests that AAVE
could refine its lending strategies by offering targeted incentives to different user
segments to optimize participation and maintain platform stability.
Figure 5 presents a scatter plot visualizing the top 5,000 wallets ranked by
score, along with their respective deposit, borrow amounts in AAVE, and the
amount of AAVE tokens held in these wallets. The trend shows a decline in these
amounts as rank decreases. Since the y-axis is on a logarithmic scale, it is clear
that wallets with the highest ranks (on the left side of the plot) have significantly
higher deposit, borrow amounts, and AAVE holdings compared to lower-ranked
wallets.
Fig. 7. Scores and Total Borrowing and Depositing of Top 10 Supported Tokens
on Aave
Fig. 8. Scores and Total Transaction Volume of Top 10 Aave Smart Contracts
where the left chart depicts Bitcoin’s price, while the right chart shows the per-
centage of wallets that have not been liquidated, categorized as Exceptional and
Very Good, Good, and Fair.
The right chart reveals a clear pattern: Exceptional and Very Good wallets
exhibit greater resilience, with approximately 30% remaining by the end of the
period, compared to around 20% for Good and Fair wallets. This finding rein-
forces the validity of our scoring model for lending entities as higher-scoring
wallets are more capable of weathering market downturns.
5 Conclusion
This study presents a robust entity ranking framework for decentralized lend-
ing, utilizing the PageRank algorithm. We detail the graph construction process
based on an analysis of lending DApps and conducted experiments with Aave,
the leading lending protocol in the market. By normalizing PageRank scores to
align with FICO scores, we evaluated various entities within the Aave ecosystem,
including personal wallets, centralized exchanges, lending smart contracts, and
supported tokens. This evaluation provided valuable insights into the influence
of these entities on Aave’s platform stability.
Our analysis demonstrated that a small fraction of high-scoring wallets plays
a crucial role in maintaining platform stability, highlighting the necessity for tar-
geted lending strategies. Additionally, backtesting results confirmed that higher-
ranking wallets tend to be more resilient during market downturns, thereby
reinforcing the reliability of our scoring model.
Future work will focus on refining the graph construction process to incor-
porate additional parameters and expanding our model to include other decen-
tralized finance protocols. This will enable a more comprehensive evaluation of
lending strategies and risk management practices across various platforms, ulti-
mately contributing to the advancement of decentralized finance applications.
References
1. Agrawal, D., Natalia, N., Gopalakrishnan, G., Guzman, M.N., McDonald, M.D.,
Kim, H.M.: Loyalty points on the blockchain. Business Manage. Stud. 4(3), 80–92
(2018)
2. Barbereau, Smethurst, T., Papageorgiou, R., Rieger, O., Fridgen, A., Gilbert.: Defi,
not so decentralized: The measured distribution of voting rights. In: Proceedings
of the 55th Hawaii International Conference on System Sciences (dec 2022)
3. Boldi, P., Bonchi, F., Castillo, C., Vigna, S.: Voting in social networks, pp. 777–786
(11 2009). https://doi.org/10.1145/1645953.1646052
A Reputation Scoring Framework for Lending Protocols 493
4. Chen, Y., Bellavitis, C.: Decentralized finance: Blockchain technology and the quest
for an open financial system. In: Stevens Institute of Technology School of Business
Research Paper, Hoboken, NJ 07030-5991 USA (jul 2019)
5. Chung, F.: A brief survey of pagerank algorithms. In: IEEE Transactions on Net-
work Science and Engineering, vol. 1, pp. 38–42 (2014)
6. Do, H.D., Do, T.: Pagerank and hodgerank on ethereum transactions: A measure
for social credit. Int. J. Softw. Innov. (IJSI) 11(1), 1–13 (2023)
7. Fatih, R., Arezki, S., Gadi, T.: A review of blockchain-based e-voting systems:
comparative analysis and findings. Int. J. Interact. Mobile Technol. (iJIM) 17,
49–67 (dec 2023)
8. François, J., Wang, S., State, R., Engel, T.: BotTrack: tracking botnets using net-
flow and pagerank. In: Domingo-Pascual, J., Manzoni, P., Palazzo, S., Pont, A.,
Scoglio, C. (eds.) NETWORKING 2011. LNCS, vol. 6640, pp. 1–14. Springer, Hei-
delberg (2011). https://doi.org/10.1007/978-3-642-20757-0_1
9. Gleich, D.F.: Pagerank beyond the web. SIAM Rev. 57(3), 321–363 (2015)
10. Hassija, V., Bansal, G., Chamola, V., Kumar, N., Guizani, M.: Secure lending:
blockchain and prospect theory-based decentralized credit scoring model. IEEE
Trans. Netw. Sci. Eng. 7(4), 2566–2575 (2020)
11. Li, L., Wu, J., Cui, W.: A review of blockchain cross-chain technology. IET
Blockchain 3(3), 149–158 (2023)
12. Lofgren, P., Banerjee, S., Goel, A.: Bidirectional pagerank estimation: from
average-case to worst-case. In: Algorithms and Models for the Web Graph, pp.
164–176, Springer International Publishing, Cham (2015)
13. Mcwilliams, A.: Corporate social responsibility: a theory of the firm perspective.
Acad. Manage. Rev. 26, 117–127 (01 2001)
14. Mitra, A.: How can we enhance reputation in blockchain consensus for indus-
try 4.0-a proposed approach by extending the pagerank algorithm. Inter-
national Journal of Information Management Data Insights 2(2), 100138
(2022), ISSN 2667-0968, https://doi.org/10.1016/j.jjimei.2022.100138, https://
www.sciencedirect.com/science/article/pii/S2667096822000817
15. myFICO: What’s in my fico R scores? (2024). https://www.myfico.com/credit-
education/whats-in-your-credit-score
16. Qin, C., et al.: A segmented pagerank-based value compensation method for per-
sonal data in alliance blockchains. Big Data Research 30, 100326 (2022), ISSN
2214-5796, https://doi.org/10.1016/j.bdr.2022.100326, https://www.sciencedirect.
com/science/article/pii/S221457962200020X
17. Qin, K., Zhou, L., Gamito, P., Jovanovic, P., Gervais, A.: An empirical study of defi
liquidations: incentives, risks, and instabilities. In: IMC ’21: Proceedings of the 21st
ACM Internet Measurement Conference, pp. 336–350, Association for Computing
Machinery, New York, NY, United States (nov 2021)
18. Rieder, B.: What is in pagerank? a historical and conceptual investigation of a
recursive status index. Comput. Cult. 2 (sep 2012)
19. Santos, S.D., Singh, J., Thulasiram, R.K., Kamali, S., Sirico, L., Loud, L.: A new
era of blockchain-powered decentralized finance (defi) - a review. In: Proceedings of
the IEEE Annual Computer Software and Applications Conference (COMPSAC),
pp. 1286–1292, IEEE, Los Alamitos, CA, USA (aug 2022)
20. Shirole, M., Darisi, M., Bhirud, S.: Cryptocurrency token: an overview. In: IC-BCT
2019, pp. 133–140, Springer Singapore, Singapore (2020)
21. Suratkar, S., Shirole, M., Bhirud, S.: Cryptocurrency wallet: a review. In: 2020
4th International Conference on Computer, Communication and Signal Processing
(ICCCSP), pp. 1–7 (2020)
494 M.-T. Nguyen et al.
22. Taherdoost, H.: Smart contracts in blockchain technology: A critical review. Infor-
mation 14(2), 117 (2023)
23. Xing, W., Ghorbani, A.: Weighted pagerank algorithm. In: Proceedings of the
Second Annual Conference on Communication Networks and Services Research,
2004., pp. 305–314 (2004)
24. Xu, J., Vadgama, N.: From banks to defi: the evolution of the lending market.
In: Enabling the Internet of Value, pp. 53–66, Springer, Cham, Los Alamitos, CA,
USA (jan 2022)
25. Zhou, Z., Shen, B.: Toward understanding the use of centralized exchanges for
decentralized cryptocurrency (2022). https://arxiv.org/abs/2204.08664
Unifying Convolution and Self-attention
for Liver Lesion Diagnosis on Multi-phase
Magnetic Resonance Imaging
1 Introduction
Liver cancer is the sixth most diagnosed cancer and the third leading cause of
cancer-related deaths globally [18]. Hepatocellular carcinoma (HCC) accounts
for most cases, while secondary liver cancers from metastases also contribute
significantly [7]. Non-invasive imaging methods like CT and MRI are essential
for detecting tumors and planning surgeries, despite challenges from the liver’s
complex architecture and lesion variability [12].
c The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 495–509, 2025.
https://doi.org/10.1007/978-981-96-4282-3_39
496 H.-S. Nguyen et al.
2 Related Work
2.1 SDR-Former: A Siamese Dual-Resolution Transformer for Liver
Lesion Classification Using 3D Multi-phase Imaging
Lou et al. [15] introduce the SDR-Former, a framework designed for liver lesion
classification in 3D multi-phase CT and MR imaging. This framework com-
bines a hybrid CNN-Transformer network called DR-Former and an Adaptive
Phase Selection Module (APSM) to enhance feature representation and improve
diagnostic accuracy. The SDR-Former is validated using two clinical datasets: a
three-phase CT dataset with two lesion types and an eight-phase MR dataset
with seven different lesion categories.
The SDR-Former framework employs a dual-stage approach with a Siamese
Neural Network (SNN) for feature extraction and an Adaptive Phase Selection
Module (APSM) for phase-specific feature integration. The SNN ensures scalabil-
ity and adaptability across datasets with varying phase counts, enabling effective
transfer learning. However, SNN alone may struggle to isolate distinctive phase
features, leading to weaker representation.
To address this, the DR-Former network incorporates dual branches: a 3D
CNN for high-resolution spatial details and a 3D Transformer for low-resolution
global context. These complementary methods, connected via a Bidirectional
Convolutional Interaction Module (BCIM), enhance feature exchange and rep-
resentation.
The APSM then dynamically merges phase-sensitive features, emphasizing
diagnostically critical information. The combined features are processed through
Global Average Pooling (GAP) and a Fully-Connected (FC) layer for final clas-
sification. This design ensures robust multi-phase imaging analysis, improving
diagnostic accuracy.
Yin Cui et al. [3] addresses the issue of imbalanced data in large-scale datasets,
where a few classes dominate while most classes have relatively few samples.
Traditional re-balancing strategies, such as re-sampling and re-weighting based
on class frequency, often fail to yield satisfactory performance on real-world data.
The authors propose a new approach that calculates the effective number of
samples, which considers the diminishing additional benefit of new data points
as the sample size increases. This is achieved through a formula involving a
hyperparameter .β, which helps better estimate each class’s true representation.
To address the imbalance, the authors introduce a class-balanced loss func-
tion that re-weights the loss for each class inversely proportional to its effective
number of samples. This method assigns higher weights to under-represented
classes, thereby improving the model’s performance across all classes. The paper
demonstrates that this approach significantly enhances the accuracy of mod-
els on long-tailed datasets like CIFAR, ImageNet, and iNaturalist. The paper
shows substantial improvements in handling data imbalance by integrating the
498 H.-S. Nguyen et al.
3 Proposed Method
3.1 Unifying Convolution and Self-attention
In recent years, convolutional neural networks (CNNs) have revolutionized com-
puter vision by excelling in tasks like image classification and object detec-
tion. Starting with seminal architectures like AlexNet [11], CNNs have evolved
through numerous powerful variants, demonstrating high performance across
a spectrum of image understanding tasks. As video data gains prominence,
researchers have extended CNNs to 3D space, albeit facing challenges of opti-
mization complexity and computational cost. Strategies such as kernel inflation
and dimension factorization have been explored to mitigate these issues, while
temporal modeling enhancements like temporal shifts and spatiotemporal exci-
tation have aimed to improve video understanding.
Vision Transformers (ViTs) have emerged as an alternative approach to cap-
ture long-range dependencies in images. Inspired by Transformer architectures
from natural language processing (NLP), ViTs represent images as tokens and
utilize attention mechanisms to model token relationships. Despite initial depen-
dencies on large datasets and careful augmentation, advancements in patch
embedding, efficient self-attention, and multi-scale architectures have signifi-
cantly enhanced ViT performance across various image tasks. Extensions to
Unifying Convolution and Self-attention for Liver Lesion Diagnosis 499
Fig. 1. Unified transFormer. The dimensions highlighted in red only exist for the video
input, while all are equal to one for image input. [13] (Color figure online)
video modeling, such as TimeSformer, have further adapted ViTs for spatiotem-
poral representation learning, though challenges remain in efficiently encoding
low-level features compared to convolution-based methods.
Efforts to combine CNNs and ViTs seek to leverage their respective strengths
for enhanced vision tasks. Integrative approaches include incorporating convo-
lutional stems and position embeddings into ViTs, as well as embedding convo-
lution within Transformer feed-forward networks. While some methods focus on
replacing convolution with self-attention, recent innovations like UniFormer [13]
propose unified architectures that blend both mechanisms. This approach aims
to optimize local and global token relations across diverse vision tasks, achiev-
ing improved accuracy and computational efficiency in both image and video
domains.
X = DPE(Xin ) + Xin
. (1)
Y = MHRA(Norm(X)) + X
. (2)
Z = FFN(Norm(Y)) + Y
. (3)
C×T ×H×W
Considering the input token tensor .Xin ∈ R (where .T = 1 for image
input), the paper first introduces Dynamic Positional Embedding (DPE) to
500 H.-S. Nguyen et al.
dynamically incorporate positional information into all tokens (Eq. 1). Next,
Multi-Head Relation Attention (MHRA) enhances each token by capturing con-
textual relationships with neighboring tokens (Eq. 2). Finally, the paper employs
a Feed-Forward Network (FFN) akin to traditional Vision Transformers (ViTs)
(Eq. 3).
Local MHRA. The paper proposes representing local affinity using a learnable
parameter matrix in the shallow layers. Specifically, for an anchor token .Xi ,
the local RA learns the affinity between this token and others in the small
neighborhood .Ωt×h×w
i (.t = 1 for image input):
Alocal
.n (Xi , Xj ) = ai−j
n , (6)
where .j ∈ Ωt×h×w
i . .an ∈ Rt×h×w is a learnable parameter, and .Xj refers to any
neighbor token in .Ωt×h×w
i . Here, .(i − j) denotes the relative position between
token .i and .j.
Table 1. Backbones for image classification, ‘L’ and ‘G’ refer to local and global
UniFormer blocks, respectively. [13]
DPE(Xin ) = DWConv(Xin ),
. (8)
where DWConv refers to depthwise convolution with zero paddings. The paper
adopts this design for DPE based on several reasons. First, depthwise convolution
adapts well to arbitrary input shapes, such as its straightforward extension to
encode 3D positional information in videos. Second, it is lightweight, balancing
computation and accuracy efficiently. Finally, adding zero paddings helps tokens
understand their absolute positions by progressively querying their neighbors [2]
(Table 1).
4 Experiments
4.1 Dataset
The MR image dataset used in our experiment was obtained from Lou et al. [15]
and consists of 498 annotated multi-phase liver lesions from an equal number
Unifying Convolution and Self-attention for Liver Lesion Diagnosis 503
each of size 512. × 512. The lesion is annotated to be present from the 33rd to
the 42nd slice. To preprocess this data, we create a 3D volume based on the
annotation information. The ROI is defined using the bounding boxes, and to
ensure adequate context, we extend the region by 16 pixels in both the x and
y directions and add two extra slices at each end along the z-axis. This results
in a final cropped 3D volume (14, 94, 87), comprehensively capturing the lesion
and surrounding areas.
Gaussian noise, and blurring effects. Lesion volumes are randomly cropped
to dimensions of 14.× 112.× 112.
– Loss function: Given the long-tailed distribution of the dataset, a class-
balanced loss function is employed to compare the model’s output with the
ground truth.
kappa coefficient of 0.7467. These results represent the highest scores among the
comparative methods, with SDR-Former achieving the best accuracy, F1 score,
and kappa coefficient.
Our proposed method using the Uniformer-base variant matched the highest
accuracy score of 0.7885. It also achieved the highest precision of 0.8380, indi-
cating superior ability in correctly identifying positive cases. The F1 score was
0.7719, slightly lower than that of SDR-Former but still competitive, and the
kappa coefficient was 0.7358, reflecting substantial agreement and robustness of
our approach.
Overall, our method demonstrated competitive performance, particularly
excelling in precision and achieving strong results across all metrics, thus vali-
dating the effectiveness of the Uniformer framework for multi-phase liver lesion
detection and classification.
The confusion matrix analysis reveals that many cases are misclassified as
Hepatocellular Carcinoma (HCC), primarily due to several factors. Firstly, HCC
is the most common liver cancer type, resulting in a higher representation in the
training dataset, which can bias the model towards classifying uncertain cases as
HCC. The imaging characteristics of HCC often overlap with other liver lesions
such as Hepatic Metastasis (HM) and Intrahepatic Cholangiocarcinoma (ICC),
leading to confusion. The similarity in texture, shape, and enhancement patterns
across different MRI phases contributes to this misclassification. Moreover, class
imbalance in the dataset and variability in annotation quality can further exac-
erbate this issue, making it challenging for the model to accurately distinguish
HCC from other lesion types. Addressing these challenges may require balancing
the dataset, enhancing feature extraction techniques, employing advanced data
augmentation strategies, and ensuring consistent annotations.
5 Conclusion
References
1. Chu, X., et al.: Twins: Revisiting the design of spatial attention in vision trans-
formers. In: Proceedings of the 35th Conference on Neural Information Processing
Systems (NeurIPS) (2021)
2. Chu, X., Zhang, B., Tian, Z., Wei, X., Xia, H.: Do we really need explicit
position encodings for vision transformers? arXiv preprint arXiv:2102.10882
abs/2102.10882 (2021)
3. Cui, M., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on
effective number of samples. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR) (Jun 2019). https://doi.org/10.
1109/CVPR.2019.01138
4. Liang, D., et al.: Combining convolutional and recurrent neural networks for clas-
sification of focal liver lesions in multi-phase CT images. In: Frangi, A.F., Schn-
abel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018.
LNCS, vol. 11071, pp. 666–675. Springer, Cham (2018). https://doi.org/10.1007/
978-3-030-00934-2_74
5. Dong, X., et al.: Cswin transformer: A general vision transformer backbone with
cross-shaped windows. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 12124–12134 (2022)
6. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image
recognition at scale. In: Proceedings of the International Conference on Learning
Representations (ICLR) (2021)
7. Ferlay, J., Shin, H.R., Bray, F., Forman, D., Mathers, C., Parkin, D.M.: Estimates
of worldwide burden of cancer in 2008: Globocan 2008. Int. J. Cancer 127(12),
2893–2917 (2010)
8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition
(2015). https://arxiv.org/abs/1512.03385
9. Khan, M., et al.: Multimodal brain tumor classification using deep learning and
robust feature selection: a machine learning application for radiologists. Diagnostics
10, 565 (08 2020). https://doi.org/10.3390/diagnostics10080565
10. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017). https://
arxiv.org/abs/1412.6980
11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger,
K. (eds.) Advances in Neural Information Processing Systems. vol. 25. Curran
Associates, Inc. (2012). https://proceedings.neurips.cc/paper_files/paper/2012/
file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
12. Lee, S., et al.: Ct and mri liver imaging reporting and data system version 2018 for
hepatocellular carcinoma: A systematic review with meta-analysis. J. Am. Coll.
Radiol. 17(10), 1199–1206 (2020)
13. Li, K., et al.: Uniformer: Unifying convolution and self-attention for visual recog-
nition (2023). https://arxiv.org/abs/2201.09450
14. Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted win-
dows. In: Proceedings of the IEEE/CVF International Conference on Computer
Vision (ICCV), pp. 10012–10022 (2021)
Unifying Convolution and Self-attention for Liver Lesion Diagnosis 509
15. Lou, M., Ying, H., Liu, X., Zhou, H.Y., Zhang, Y., Yu, Y.: Sdr-former: A siamese
dual-resolution transformer for liver lesion classification using 3d multi-phase imag-
ing (2024). https://arxiv.org/abs/2402.17246
16. Luo, L., et al.: Rare benign liver tumors that require differentiation from hepato-
cellular carcinoma: focus on diagnosis and treatment. J. Cancer Res. Clin. Oncol.
149 (07 2022). https://doi.org/10.1007/s00432-022-04169-w
17. Qu, T., et al.: M3net: A multi-scale multi-view framework for multi-phase pan-
creas segmentation based on cross-phase non-local attention. Med. Image Anal.
75, 102232 (2022)
18. Sung, H., et al.: Global cancer statistics 2020: Globocan estimates of incidence
and mortality worldwide for 36 cancers in 185 countries. CA: A Cancer J. Clin.
71(3), 209–249 (2021). https://doi.org/10.3322/caac.21660, https://acsjournals.
onlinelibrary.wiley.com/doi/abs/10.3322/caac.21660
19. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training
data-efficient image transformers & distillation through attention. In: Proceedings
of the 38th International Conference on Machine Learning (ICML), pp. 10347–
10357 (2021)
20. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper
with image transformers. arXiv preprint arXiv:2103.17239 abs/2103.17239
(2021)
21. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense pre-
diction without convolutions. In: Proceedings of the IEEE/CVF International Con-
ference on Computer Vision (ICCV), pp. 568–578 (2021)
22. Yasaka, K., Akai, H., Abe, O., Kiryu, S.: Deep learning with convolutional neural
network for differentiation of liver masses at dynamic contrast-enhanced ct: A pre-
liminary study. Radiology 286(3), 887–896 (2018). https://doi.org/10.1148/radiol.
2017170706
23. Zhou, S.K., et al.: A review of deep learning in medical imaging: imaging traits,
technology trends, case studies with progress highlights, and future promises. Proc.
IEEE 109(5), 820–838 (2021). https://doi.org/10.1109/JPROC.2021.3054390
Author Index
A K
Agrawal, Kunal 193, 203 Kamioka, Eiji 287
L
B Lam, Phat 39
Binh, Minh Tran 427 Le, Ba Luat 371
Bui, Manh Quan 415 Le, Chau-Anh 167
Bui, Marc 439 Le, Duy Minh 155
Bui, Nhu-Nghia 3 Le, Hoang 467
Bui, Quang Vu 439 Le, Minh-Huan 14
Le, Quang-Khai 25
Le, Trung-Nghia 14, 25, 54, 65, 77, 88, 100,
C 127, 193
Cannon, Ian 203 Liu, Yuchen 287
Cuong, Ngo Xuan 182
Cuong, Nguyen Quoc 275 M
Mai, Tien 354
D Mai, Tien-Dung 182
D. Huynh, Vinh-Hien 167 Mai, Xuan-Bach 65
Dang, Thai-Viet 3 Mien, Doan Phuoc 343
Dao, Mai Hoang 155 Minh, Hien Nguyen 467
Dao, Thao Thi Phuong 100
Dao, Trong Hoan 452 N
Do, Manh Quang 216 Ngo, Dat 39
Do, Trung-Hieu 100 Nguyen Duy, Khang 386
Duong, Viet Hang 415 Nguyen Tuan, Minh 386
Nguyen, Anh D. 239
Nguyen, Ba Nghien 216
G Nguyen, Cong-Long 25
Giang, Nguyen Long 401 Nguyen, Duc P. T. 141
Nguyen, Duc-Vu 313
Nguyen, Hai-Dang 112
H Nguyen, Hieu 127
Hà, Minh Hoàng 354 Nguyen, Hoa N. 239
Ho, Thi Kim Thoa 439 Nguyen, Hue T. 401
Hoang, Bao Ly Tran 298 Nguyen, Huynh-Sang 495
Hoang, Diep Thi 467 Nguyen, Khanh-Duy 298
Hung, Nguyen An 275 Nguyen, Kiet Van 313
Huynh, Van-Hieu 54 Nguyen, Loi Khanh 39
Huynh, Viet-Tham 329 Nguyen, Long 427
Huynh, Y. Thien 167 Nguyen, Long-Bao 54
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Singapore Pte Ltd. 2025
W. Buntine et al. (Eds.): SOICT 2024, CCIS 2350, pp. 511–512, 2025.
https://doi.org/10.1007/978-981-96-4282-3
512 Author Index